The REGEXP, LIKE, and SIMILAR TO search conditions are similar in that they all attempt to match a pattern to a string. Also, all three attempt to match an entire string, not a substring within the string.
The basic syntax for all three search conditions is similar:
expression search-condition pattern
REGEXP supports a superset of regular expression syntax supported by SIMILAR TO. In addition, for compatibility with other products, the REGEXP search condition supports several syntax extensions. Also, REGEXP and SIMILAR TO have a different default escape character and process the characters underscore ( _ ), percent ( % ), and caret ( ^ ) differently. REGEXP behavior matches closely with Perl 5 (except where Perl syntax and operators are not supported).
LIKE syntax for pattern is simple and supports a small set of wildcards, but does not support the full regular expression syntax.
SIMILAR TO syntax for pattern allows a robust pattern matching using the regular expression syntax defined in the ANSI/ISO SQL standard.
When performing comparisons, REGEXP behavior is different from LIKE and SIMILAR TO. For REGEXP comparisons, the database server uses code point values in the database character set for comparisons. This is consistent with other regular expression implementations such as Perl.
For LIKE and SIMILAR TO, the database server uses the equivalence and sort order in the database collation for comparisons. This is consistent with how the database evaluates comparison operators such as > and =.
The difference in character comparison methods means that results for matching and range evaluation for REGEXP and LIKE/SIMILAR differ as well.
Differences in matching Since REGEXP uses code point values, it only matches a literal in a pattern if it is the exact same character. REGEXP matching is therefore not impacted by such things as database collation, case-sensitivity, or accent sensitivity. For example, 'A' could never be returned as a match for 'a'.
Since LIKE and SIMILAR TO use the database collation, results are impacted by case- and accent-sensitivity when determining character equivalence. For example, if the database collation is case- and accent-insensitive, matches are case- and accent-insensitive. So, an 'A' could be returned as a match for 'a'.
Differences in range evaluation
Since REGEXP uses code points for range evaluation, a character is considered to be in the range if its code point
value is equal to, or between, the code point values for the start and end of the range. For example, the comparison
x REGEXP '[A-C]', for the single character x, is equivalent to
CAST(x AS BINARY) >= CAST(A AS BINARY) AND CAST(x AS BINARY) <= CAST(C AS BINARY) .
Since LIKE and SIMILAR TO use the collation sort order for range evaluation, a character is considered to be in the range
if its position in the collation is the same as, or between, the position of the start and end characters for the range. For
example, the comparison
x SIMILAR TO '[A-C]' (where x is a single character) is equivalent to
x >= A AND x <= C, and the comparison operators are evaluated using the collation sort ordering.
The following table shows the set of characters included in the range
'[A-C]' as evaluated by LIKE, SIMILAR TO, and REGEXP. Both databases use the 1252LATIN1 collation, but the first database is case-insensitive,
while the second one is case sensitive.
|LIKE/SIMILAR TO '[A-C]'||REGEXP '[A-C]'|
|demo.db (case-insensitive)||A,B,C,a,b,c,ª,À,Á,Â,Ã,Ä,Å,Æ,Ç,à,á,â,ã,ä,å,æ,ç||A, B, C|
|charsensitive.db (case-sensitive)||A,B,C,b,c,À,Á,Â,Ã,Ä,Å,Æ,Ç,ç||A, B, C|
The following can be observed in the results:
LIKE and SIMILAR TO include accented characters in the range.
LIKE and SIMILAR TO include different characters depending on database case-sensitivity. Specifically, they include any lowercase letters found within the range, which you may not have anticipated when searching on a case-sensitive database.
Similarly, on a case-sensitive database, some characters included in the range might appear to be inconsistent. For example,
SIMILAR TO '[A-C]' on a case-sensitive database includes A, b, B, c, C but not a because a occurs before the uppercase A in the sort order.
REGEXP returns only A, B, C regardless of database case sensitivity. If you want the range to include lowercase letters, you
must add them to the range definition. For example,
the REGEXP set of characters does not change, regardless of database case-sensitivity.
Even though your database uses a different collation, or has different case- or accent-sensitivity settings than the examples above, you can perform a similar test to see what is returned by LIKE, SIMILAR TO, or REGEXP by connecting to the database and executing any of these statements:
SELECT CHAR( row_num ) FROM RowGenerator WHERE CHAR( row_num ) LIKE '[A-C]'; SELECT CHAR( row_num ) FROM RowGenerator WHERE CHAR( row_num ) REGEXP '[A-C]'; SELECT CHAR( row_num ) FROM RowGenerator WHERE CHAR( row_num ) SIMILAR TO '[A-C]';
Discuss this page in DocCommentXchange.
|Copyright © 2012, iAnywhere Solutions, Inc. - SQL Anywhere 12.0.1|