LIKE, REGEXP, and SIMILAR TO search conditions

The REGEXP, LIKE, and SIMILAR TO search conditions are similar in that they all attempt to match a pattern to a string. Also, all three attempt to match an entire string, not a substring within the string.

The basic syntax for all three search conditions is similar:

expression search-condition pattern
LIKE, REGEXP, and SIMILAR TO: differences in pattern definition

REGEXP, LIKE, and SIMILAR TO search conditions differ in how you define pattern:

  • REGEXP supports a superset of regular expression syntax supported by SIMILAR TO. In addition, for compatibility with other products, the REGEXP search condition supports several syntax extensions. Also, REGEXP and SIMILAR TO have a different default escape character. REGEXP behavior matches closely with Perl 5 (except where Perl syntax and operators are not supported).

  • LIKE syntax for pattern is simple and supports a small set of wildcards, but does not support the full regular expression syntax.

  • SIMILAR TO syntax for pattern allows a robust pattern matching using the regular expression syntax defined in the ANSI/ISO SQL standard.

LIKE, REGEXP, and SIMILAR TO: differences in character comparisons

When performing comparisons, REGEXP behavior is different from LIKE and SIMILAR TO. For REGEXP comparisons, the database server uses code point values in the database character set for comparisons. This is consistent with other regular expression implementations such as Perl.

For LIKE and SIMILAR TO, the database server uses the equivalence and sort order in the database collation for comparisons. This is consistent with how the database evaluates comparison operators such as > and =.

The difference in character comparison methods means that results for matching and range evaluation for REGEXP and LIKE/SIMILAR differ as well.

  • Differences in matching   Since REGEXP uses code point values, it only matches a literal in a pattern if it is the exact same character. REGEXP matching is therefore not impacted by such things as datebase collation case-sensitivity or accent sensitivity. For example, 'A' could never be returned as a match for 'a'.

    Since LIKE and SIMILAR TO use the database collation, results are impacted by case- and accent-sensitivity when determining character equivalence. For example, if the database collation is case- and accent-insensitive, matches are case- and accent-insensitive. So, an 'A' could be returned as a match for 'a'.

  • Differences in range evaluation   Since REGEXP uses code points for range evaluation, a character is considered to be in the range if its code point value is equal to, or between, the code point values for the start and end of the range. For example, the comparison x REGEXP '[A-C]', for the single character x, is equivalent to CAST(x AS BINARY) >= CAST(A AS BINARY) AND CAST(x AS BINARY) <= CAST(C AS BINARY) .

    Since LIKE and SIMILAR TO use the collation sort order for range evaluation, a character is considered to be in the range if its position in the collation is the same as, or between, the position of the start and end characters for the range. For example, the comparison x SIMILAR TO '[A-C]' (where x is a single character) is equivalent to x >= A AND x <= C, and the comparison operators are evaluated using the collation sort ordering.

    The following table shows the set of characters included in the range '[A-C]' as evaluated by LIKE, SIMILAR TO, and REGEXP. Both databases use the 1252LATIN1 collation, but the first database is case-insensitive, while the second one is case sensitive.

    LIKE/SIMILAR TO '[A-C]' REGEXP '[A-C]'
    demo.db (case-insensitive) A,B,C,a,b,c,ª,À,Á,Â,Ã,Ä,Å,Æ,Ç,à,á,â,ã,ä,å,æ,ç A, B, C
    charsensitive.db (case-sensitive) A,B,C,b,c,À,Á,Â,Ã,Ä,Å,Æ,Ç,ç A, B, C

    The following can be observed in the results:

    • LIKE and SIMILAR TO include accented characters in the range.

    • LIKE and SIMILAR TO include different characters depending on database case-sensitivity. Specifically, they include any lower case letters found within the range, which you may not have anticipated when searching on a case-sensitive database.

      Similarly, on a case-insensitive database, some characters you assume would be included in the range are not. For example, SIMILAR TO '[a-c]' on a case-insensitve database includes a, A, b, B, c, but not C because C occurs after lower case c in the sort order.

    • REGEXP returns only A, B, C regardless of database case sensitivity. If you want the range to include lower case letters, you must add them to the range definition. For example, REGEXP '[a-cA-C]'.

    • the REGEXP set of characters does not change, regardless of database case-sensitivity.

    Even though your database uses a different collation, or has different case- or accent-sensitivity settings than the examples above, you can perform a similar test to see what is returned by LIKE, SIMILAR TO, or REGEXP by connecting to the database and executing any of these statements:

    SELECT CHAR( row_num ) FROM RowGenerator WHERE CHAR( row_num ) LIKE '[A-C]';
    SELECT CHAR( row_num ) FROM RowGenerator WHERE CHAR( row_num ) REGEXP '[A-C]';
    SELECT CHAR( row_num ) FROM RowGenerator WHERE CHAR( row_num ) SIMILAR TO '[A-C]';

See also

LIKE search condition
REGEXP search condition
SIMILAR TO search condition