Comparing Strings in FoxPro
Writing an article about comparing strings? Only recently this became an issue for many developers in other languages. Up to then most languages only had one function to determine if two strings are identical or not. Even better: This function behaves the same all the time! FoxPro was always a little different. Comparing strings can be quite a difficult matter. Not only do we have two operators: "=" and "==". They also operate differently depending on multiple SET commands. In this article I like to discuss how comparing strings works in VFP, what it depends on and how various functions are affected by different settings.
= or == ?
The most frequently used operator to compare strings is "=". With the default setting of SET EXACT OFF this operator only compares strings up to length of the right-hand string. If this string is shorter than the left one, both strings are considered to be equal nonetheless. The other way round this is not true. If the right-hand string is longer than the left-hand one, both strings are different:
? "String which is longer" = "String"
.T.
? "String" = "String which is longer"
.F.
If you need to compare strings literally without any external impact you have to use the "==" operator
With "==", on the other hand, both strings have not only to be identical, but also have the same length. Comparing strings with "==" is therefore called a binary string comparison. This operator is also independent from various settings like SET EXACT. That means, if you need to compare strings literally without any external impact you have to use the "==" operator. One additional benefit is that comparing strings with "==" is a little faster than the "=" operator. The following sample demonstrates the usage of this operator for binary comparisons:
? "String" == "String"
.T.
? "String1" == "String"
.F.
SET EXACT
What role does SET EXACT have in all this? SET EXACT can take one of two values: ON and OFF. The default setting is OFF which causes FoxPro to behave as outlined above. With "ON" you turn on the exact string comparison. It's easy to mistakenly assume that SET EXACT ON is the same as the "==" operator. This is especially true since the help file constantly talks about "exact comparison" regarding the "==" operator. At first glance this assumption seems to be true:
? "String" = "String"
.T.
? "String1" = "String"
.F.
These are the same results that the "==" operator returned in the sample above. Be careful, though! The "=" operator always ignores trailing blanks:
? "String " = "String" && .T.
? "String" = "String " && .T.
? "String " == "String" && .F.
? "String" == "String " && .F.
The reason for this behavior is easily explained. Without this detail you wouldn't be able to compare regular character fields when SET EXACT is ON, unless you constantly use ALLTRIM(). As it is implemented, you can compare fields which are padded with blanks. Just recently with the introduction of VARCHAR in Visual FoxPro this issue has become less serious. In other words, when you use "=" and SET EXACT ON, you should remember that FoxPro performs an implicit RTRIM() on both strings before comparing them. The following lines are therefore identical:
SET EXACT ON
? String1 = String2
? RTRIM(String1) = RTRIM(String2)
Knowing this you can easily explain why in FoxPro you should never compare a string with an empty string. When SET EXACT is OFF, such an expression always returns .T.:
? "String" = ""
.T.
If you revert the expression, FoxPro returns .F. as expected:
? "" = "String"
.F.
Not only is SET EXACT ON important for comparing strings, but it is for the order as well. That's a huge difference to other languages that follow the common rule: "If A equals B, B must equal A". Once again you need to take SET EXACT ON into account:
? "String" = ""
.F.
It appears that you can compare a string with an empty string. Stay away from code like, though, as blanks are ignored here as well:
? " " = ""
.T.
If you want to know if a string is empty, use the binary comparison operator (==) or check if the length is 0 by calling the LEN() function. Why don't I recommend to use EMPTY() which is meant for checking for an empty string? The question is: What is an empty string? For instance, a string is empty when it doesn't contain a character; in other words, when it's "". You probably know that a string is also empty when it contains blanks. Less known seems to be the fact that a string is empty when it contains any of the following codes: CHR(9) = tab, CHR(13) = carriage return and CHR(10) = line feed. The following expression therefore returns .T.:
? EMPTY( " "+Chr(9)+Chr(13)+Chr(10) )
If you want to ensure that an expression only contains blanks or nothing at all, you need to use LEN(ALLTRIM(String)))=0 as condition. ALLTRIM() removes blanks, but doesn't touch any of the other characters that EMPTY() ignores.
In the European area SET COLLATE is another important command
SET COLLATE
Don't think I'm done with the article yet. SET EXACT is only one of the commands that affects string comparisons. In the European area SET COLLATE is another important command. According to the help file this command affects indexing and sorting operations. Its impact on string comparisons is not immediately visible. Take a look at the following program:
SET COLLATE TO "MACHINE"
? "String" = "string" && .F.
SET COLLATE TO "GENERAL"
? "String" = "string" && .T.
The impact of SET COLLATE is far from just being relevant for fields and indexes. All string comparisons are performed according to the current SET COLLATE setting. This behavior makes sense, because you definitely want to compare table fields the same way you did in the index, even after storing the field value into a memory variable. Nonetheless, there's one big trap involved here. Pay attention when you want to compare data that is not actually text:
? Chr(1) = Chr(2)
.F.
SET COLLATE TO "GENERAL"
? Chr(1) = Chr(2)
.T.
With GENERAL all binary characters that are not letters are identical. No rule without exception, though: this doesn't impact CHR(9), the tabulator.
? Chr(1) = Chr(9)
.F.
Why would you need binary characters, you might ask? Many developers use binary encoded strings as a primary key for tables. With FoxPro 2.x we avoided binary encoded strings due to various code pages issues. Visual FoxPro introduced a new field type called "Character (binary)". Visual FoxPro doesn't convert such a string at all. Hence, you can use all possible 256 characters as a primary key.
Unfortunately that's only half the truth. In fact, when SET COLLATE has a different setting than MACHINE, such fields still respect the sort order:
SET COLLATE TO "GENERAL"
CREATE TABLE Test( Feld C(4) NOCPTRANS )
INSERT INTO Test (Feld) VALUE (CHR(1)+"ABC")
? Test.Feld = CHR(2)+"ABC"
As you can see, both values are treated the same, even when using binary fields. If you somehow use binary characters, make sure to SET COLLATE to MACHINE when you compare any of these fields. I'd like to recommend == for binary strings. There's another gotcha with this operator though. Let me get back to my previous sample, this time using the "==" operator:
? test.Feld == CHR(2)+"ABC"
.F.
Fantastic! That's just what we wanted. Now we check the exact opposite. That is, are the strings different?
? test.Feld # CHR(2)+"ABC"
.F.
Hmm… Two strings which are neither equal nor different? Unfortunately, FoxPro doesn't have an operator for binary not equal. "#" (and "<>" and "!=") Works just like "=". Hence, SET EXACT and SET COLLATE both have impact. If you want to find out if binary strings differ, you have to use the following line:
? NOT (test.Feld == CHR(2)+"ABC") .T.
Settings and other comparison operators
This raises the question what other functions are commands are affected by these settings. For one there's the not equal operator "#". It always returns the exact opposite of "=". So don't use "#" if you actually meant "NOT ==". Those are two different beasts.
Operators such as less than, greater than, less or equal than and greater or equal than behave like "=". They depend on the current setting of SET EXACT and SET COLLATE. The simplest statement you can make is that any of these returns .F. when "=" returns .T. For instance, "String " > "String" returns .F., because trailing blanks are ignored and therefore both strings are equal. The impact of SET EXACT is more difficult to grasp:
SET EXACT ON
? "String1" > "String" && .T.
SET EXACT OFF
? "String1" > "String" && .F.
Greater than (just as >=) behaves like "=". In the second expression both strings are considered to be equal. Therefore none of them could be greater than the other one. Less than behaves accordingly:
SET EXACT ON
? "String1" < "String" && .F.
SET EXACT OFF
? "String1" < "String" && .F.
The first expression returns .F., because "String1" is greater than "String". The second expression returns .F., because both strings are equal with SET EXACT OFF. Once more blanks add an extra level of confusion. Let's approach this systematically. If the expression on the left side contains more blanks than the right one, but is otherwise identical, we get .F. with all SET EXACT settings:
SET EXACT OFF
? "String " < "String" && .F.
? "String " > "String" && .F.
SET EXACT ON
? "String " < "String" && .F.
? "String " > "String" && .F.
Except for the last line that's what we expect. A string with an additional blank hardly can be smaller. With SET EXACT OFF it can't be greater, because blanks are ignored. However, even with SET EXACT ON it's not greater. That's because even with SET EXACT ON trailing blanks are ignored. Both strings are equal. Now let's look at the same constellation, this time with the blank being on the right side:
SET EXACT OFF
? "String" < "String " && .T.
? "String" > "String " && .F.
SET EXACT ON
? "String" < "String " && .F.
? "String" > "String " && .F.
Once again a string without blank can't be greater than one with blanks. The first expression is what we would expect since strings are compared up to length of the right side. Wouldn't you expect a .T. in the third line, though? Again we've been caught by the implicit RTRIM() rule in Visual FoxPro. With SET EXACT ON "String" and "String " are always identical, no matter in which order you write them. That's why "<" and ">" always return .F. if the difference between two strings is just the number of blanks at the end.
As you can see there's no exact string comparison available for less than, greater than, and so forth following the rules of SET EXACT ON. How can we determine that "String " is greater than "String"? As far as FoxPro is concerned, those strings are identical with SET EXACT ON. Hence, we check if both strings are identical. If that's the case we check if the left string is greater than the right string. This would mean that the left string is greater, because it must differ in only the number of blanks:
PROCEDURE IsGreater
PARAMETER cStr1, cStr2
IF SET("EXACT") = "ON"
IF cStr1 = cStr2
RETURN LEN(cStr1) > LEN(cStr2)
ELSE
RETURN cStr1 > cStr2
ENDIF
ELSE
RETURN cStr1 > cStr2
ENDIF
It's even more difficult with a binary comparison according to the rules of "==". We can't use "<" or ">" as they depend on the current SET COLLATE setting. To avoid them we first check for binary equality using "==". If both strings differ, we need to find out which of the two is greater. Next we check the length. The longer string must be the one that is greater, as well. If both strings have the same length and differ, we need to check the ANSI code of each character until we find a difference.
This solution leads to another problem. If you compare a string you need to know whether you are actually comparing characters (that is the ANSI code) or letters. With SET COLLATE TO "MACHINE" there's no difference between the two. For all other collate sequences there is a major difference:
SET COLLATE TO "GENERAL"
? "s" = "S" && .T.
? ASC("s") = ASC("S") && .F.
That's even more obvious when one letter corresponds to multiple characters. In the German language, for instance, the eszet character (ß) can also be written as "ss". The same applies to "Œ" and "OE" in other languages:
? "ß" = "ss" && .T.
? "oe" = "Œ" && .T.
How would you compare the ANSI code of Œ or ß with the ANSI code of two other characters? Letters (or logical characters) as FoxPro uses them in sort orders do not have an ANSI code. Of course, you can retrieve the ANSI code of such a character. But it's of limited use. In no case it's a valid representation of the letter.
More functions to compare
There are more functions in FoxPro that deal with comparing strings. You can divide them into two groups: a) Functions that compare the entire string, and b) functions that compare substrings. The first group follows the rules for "=", the second group those of "==".
An important function of the first group is INLIST(). You can always replace INLIST() with a series of individual string comparisons. For example,
INLIST(Var,"String1","String2")
is identical to
Var="String1" OR Var="String2"
This explains why INLIST() works the way it does and requires you to deal with any of the special behavior of "=". An extreme example is the following:
SET EXACT ON
SET COLLATE TO "GENERAL"
? INLIST( "ßt ", "SST" ) && .T.
Internally, "ß" is converted to "ss". Lower case letters are converted to upper case letters. RTRIM() is applied to the left hand string.
Another function that follows the same pattern is BETWEEN(). This function performs the equivalent of the following expression:
BETWEEN( Var, "String1", "String2" )
Var >= "String1" AND var <= "String2"
Again we can use the sample from above:
SET EXACT ON
SET COLLATE TO "GENERAL"
? BETWEEN( "ßb ", "SSA ", "SSC" ) && .T.
The third function I like to mention is ASCAN(). Internally, it's a loop that compares every array element with the search value. Once more the old example:
DIMENSION aTest[2]
aTest[1] = "SSA "
aTest[2] = "ßb "
? ASCAN(atest,"SSB") && returns 2
Be careful when the number of blanks or the interpretation of characters as letters is significant. If you really need an exact search such as the one performed by "==" with any of these functions, you need to write the code yourself. This is especially risky when you use primary keys consisting of binary characters. If you want to search an array of primary keys, make sure you use your own function. The same is true for queries that return a range of primary keys using BETWEEN. In doubt, try to find a different expression to avoid unwanted result records.
It's kind of frustrating that AT() and its operator "$" do not follow the same rules, but perform a binary comparison. This leads to the confusing situation that two strings are identical, but not contained in each other:
SET COLLATE TO "GENERAL"
? "String" = "string" && .T. ? "String" $ "string" && .F.
Keep this in mind when you search fields in a table using a non-MACHINE collate sequence. In this case you access the table with two different methods. "=", ">", "<", BETWEEN(), and so forth, respect the current collate sequences, whereas AT() and "$" only find records that physically contain the search expression. When dealing with SQL SELECT statements the same is true for LIKE or LIKE().
If you want to find all strings that contain the word "Straße" (street) no matter if it's written as "Strasse", "Straße" or "straße", you first have to adjust the case. Afterwards, you have to list all the same combinations that SET COLLATE uses. In our example, that's "STRASSE" and "STRAßE":
BROWSE FOR "STRASSE" $ UPPER(Feld) OR "STRAßE" $ UPPER(Feld)
Conclusion
Whenever you compare strings make sure you understand which of the various methods you actually need and which would cause incorrect results. The following list gives a few considerations:
- Can you ignore the remainder of the left string when the right string is shorter?
- Are blanks relevant?
- Do you compare text or binary characters?
- Is the case (lower case or upper case) relevant?
Make sure that in your application SET COLLATE and SET EXACT are set up according to the desired comparison method. If needed, replace FoxPro functions with your own functions, if the native one uses the wrong method.