Regular expressions syntax

Regular expressions are supported with the SIMILAR TO, and REGEXP search conditions, and the REGEXP_SUBSTR function. For SIMILAR TO, regular expression syntax is consistent with the ANSI/ISO SQL standard. For REGEXP and REGEXP_SUBSTR, regular expression syntax and support is consistent with Perl 5.

Regular expressions are used by REGEXP and SIMILAR TO to match a string, whereas regular expressions are used by REGEXP_SUBSTR to match a substring. To achieve substring matching behavior for REGEXP and SIMILAR TO, you can specify wildcards on either side of the pattern you are trying to match. For example, REGEXP '.*car.*' matches car, carwash, and vicar. Or, you can rewrite your query to make use the REGEXP_SUBSTR function. See REGEXP_SUBSTR function [String].

Regular expression matching with SIMILAR TO is case- and accent-insensitive. REGEXP and REGEXP_SUBSTR is not affected by the database accent and case sensitivity. See LIKE, REGEXP, and SIMILAR TO: differences in character comparisons.

Regular expressions: Metacharacters

Metacharacters are symbols or characters that have a special meaning within a regular expression.

The treatment of metacharacters can vary depending on:

  • whether the regular expression is being used with the SIMILAR TO or REGEXP search conditions, or the REGEXP_SUBSTR function

  • whether the metacharater is inside of a character class in the regular expression

Before continuing, you should understand the definition of a character class. A character class is a set of characters enclosed in square brackets, against which characters in a string are matched. For example, in the syntax SIMILAR TO 'ab[1-9]', [1-9] is a character class and matches one digit in the range of 1 to 9, inclusive. The treatment of metacharacters in a regular expression can vary depending on whether the metacharater is placed inside a character class. Specifically, most metacharacters are handled as regular characters when positioned inside of a character class.

For SIMILAR TO (only), the metacharacters *, ?, +, _, |, (, ), { must be escaped within a character class.

To include a literal minus sign (-), caret (^), or right-angle bracket (]) character in a character class, it must be escaped.

The list of supported regular expression metacharacters is provided below. Almost all metacharacters are treated the same when used by SIMILAR TO, REGEXP, and REGEXP_SUBSTR:

Character Additional information
[ and ]

Left and right square brackets are used to specify a character class. A character class is a set of characters to match against.

With the exception of the hyphen (-) and the caret (^), metacharacters and quantifiers (such as * and {m}, respectively) specified within a character class have no special meaning and are evaluated as actual characters.

SQL Anywhere also supports sub-character classes such as POSIX character classes. See Regular expressions: Special sub-character classes.

* The asterisk can be used to match a character 0 or more times. For example, REGEXP '.*abc' matches a string that ends with abc, and starts with any prefix. So, aabc, xyzabc, and abc match, but bc and abcc do not.
? The question mark can be used to match a character 0 or 1 times. For example, 'colou?r' matches color and colour.
+ The plus sign can be used to match a character 1 or more times. For example, 'bre+' matches bre and bree, but not br.
-

A hyphen can be used within a character class to denote a range. For example, REGEXP '[a-e]' matches a, b, c, d, and e.

For details on how ranges are evaluated by REGEXP and SIMILAR TO, see LIKE, REGEXP, and SIMILAR TO: differences in character comparisons.

%

The percent sign can be used with SIMILAR TO to match any number of characters.

The percent sign is not considered a metacharacter for REGEXP and REGEXP_SUBSTR. When specified, it matches a percent sign (%).

_ (underscore character)

The underscore can be used with SIMILAR TO to match a single character.

The underscore is not considered a metacharacter for REGEXP and REGEXP_SUBSTR. When specified, it matches an underscore (_).

|

The pipe symbol is used to specify alternative patterns to use for matching the string. In a string of patterns separated by a vertical bar, the vertical bar is interpreted as an OR and matching stops at the first match made starting from the leftmost pattern. So, you should list the patterns in descending order of preference. You can specify an unlimited number of alternative patterns.

( and ) Left and right parenthesis are metacharacters when used for grouping parts of the regular expression. For example, (ab)* matches zero or more repetitions of ab. As with mathematical expressions, you use grouping to control the order in which the parts of a regular expression are evaluated.
{ and }

Left and right curly braces are metacharacters when used for specifying quantifiers. Quantifiers specify the number of times a pattern must repeat to constitute a match. For example:

  • {m}   Matches a character exactly m times. For example, '519-[0-9]{3}-[0-9]{4}' matches a phone number in the 519 area code (providing the data is formatted in the manner defined in the syntax).

  • {m,}   Matches a character at least m times. For example, '[0-9]{5,}' matches any string of five or more digits.

  • {m,n}   Matches a character at least m times, but not more than n times. For example, SIMILAR TO '_{5,10}' matches any string with between 5 and 10 (inclusive) characters.

\ The backslash is used as an escape character for metacharacters. It can also be used to escape non-metacharacters.
^

For REGEXP and REGEXP_SUBSTR, when a caret is outside a character class, the caret matches the start of a string. For example, '^[hc]at' matches hat and cat, but only at the beginning of the string.

When used inside a character class, the following behavior applies:

  • REGEXP and REGEXP_SUBSTR   When the care is the first character in a character class, it matches anything other than the characters in the character set. For example, REGEXP '[^abc]' matches any character other than a, b, or c.

    If the caret is not the first character inside the square brackets, it matches a caret. For example, REGEXP_SUBSTR '[a-e^c]' matches a, b, c, d, e, and ^.

  • SIMILAR TO   For SIMILAR TO, the caret is treated as a subtraction operator. For example, SIMILAR TO '[a-e^c]' matches a, b, d, and e.

$

When used with REGEXP and REGEXP_SUBSTR, matches the end of a string. For example, SIMILAR TO 'cat$' matches cat, but not catfish.

When used with SIMILAR TO, it matches a question mark.

.

When used with REGEXP and REGEXP_SUBSTR, matches any single character. For example, REGEXP 'a.cd' matches any string of four characters that starts with a and ends with cd.

When used with SIMILAR TO, matches a period (.).

:

The colon is used within a character set to specify a subcharacter class. For example, '[[:alnum:]]'.

Regular expressions: Special sub-character classes

Sub-character classes are special character classes embedded within a larger character class. In addition to custom character classes where you define the set of characters to match (for example, [abxq4] limits the set of matching characters to a, b, x, q, and 4), SQL Anywhere supports sub-character classes such as most of the POSIX character classes. For example, [[:alpha:]] represents the set of all upper- and lower-case letters.

The REGEXP search condition and the REGEXP_SUBSTR function support all the syntax conventions in the table below, but the SIMILAR TO search expression does not. Conventions supported by SIMILAR TO have a Y in the SIMILAR TO column.

In REGEXP and when using the REGEXP_SUBSTR function, sub-character classes can be negated using a caret. For example, [[:^alpha:]] matches the set of all characters except alpha characters.

Sub-character class Additional information SIMILAR TO
[:alpha:]

Matches upper- and lowercase alphabetic characters in the current collation. For example, '[0-9]{3}[[:alpha:]]{2}' matches three digits, followed by two letters.

Y
[:alnum:]

Match digits, and upper- and lowercase alphabetic characters in the current collation. For example, '[[:alnum:]]+' matches a string of one or more letters and numbers.

Y
[:digit:]

Match digits in the current collation. For example, '[[:digit:]-]+' matches a string of one or more digits or dashes. Likewise, '[^[:digit:]-]+' matches a string of one or more characters that are not digits or dashes.

Y
[:lower:]

Match lowercase alphabetic characters in the current collation. For example, '[[:lower:]]' does not match A because A is uppercase.

Y
[:space:]

Match a single blank (' '). For example, the following statement searches Contacts.City for any city with a two word name:

SELECT City 
FROM Contacts
WHERE City REGEXP '.*[[:space:]].*';
Y
[:upper:]

Match uppercase alphabetic characters in the current collation. For example, '[[:upper:]ab]' matches one of: any upper case letter, a, or b.

Y
[:whitespace:]

Match a whitespace character such as space, tab, formfeed, and carriage return.

Y
[:ascii:]

Match any seven-bit ASCII character (ordinal value between 0 and 127).

[:blank:]

Match a blank space, or a horizontal tab.

[[:blank:]] is equivalent to [ \t].

[:cntrl:]

Match ASCII characters with an ordinal value of less than 32, or character value 127 (control characters). Control characters include newline, form feed, backspace, and so on.

[:graph:]

Match printed characters.

[[:graph:]] is equivalent to [[:alnum:][:punct:]].

[:print:]

Match printed characters and spaces.

[[:print:]] is equivalent to [[:graph:][:whitespace:]].

[:punct:]

Match one of: !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~.

The [:punct:] sub-character class may not include non-ASCII punctuation characters available in the current collation.

[:word:]

Match alphabetic, digit, or underscore characters in the current collation.

[[:word:]] is equivalent to [[:alnum:]_].

[:xdigit:]

Match a character that is in the character class [0-9A-Fa-f].

Regular expressions: Other supported syntax conventions

The following syntax conventions are supported by the REGEXP search condition and the REGEXP_SUBSTR function, and they assume that the backslash is the escape character. These conventions are not supported by the SIMILAR TO search expression.

Regular expression syntax Name and meaning
\0xxx

Matches the character whose value is \0xxx, where xxx is any sequence of octal digits, and 0 is a zero. For example, \0134 matches a backslash.

\a

Matches the bell character.

\A

Used outside a character set to match the start of a string.

Equivalent to ^ used outside a character set.

\b

Matches a backspace character.

\B

Matches the backslash character (\).

\cX

Matches a named control character. For example, \cZ for ctrl-Z.

\d

Matches a digit in the current collation. For example, the following statement searches Contacts.Phone for all phone numbers that end with 00:

SELECT Surname, Surname, City, Phone
   FROM Contacts
   WHERE Phone REGEXP '\\d{8}00';

\d can be used both inside and outside character classes, and is equivalent to [[:digit:]].

\D

Matches anything that is not a digit. This is the opposite of \d.

\D can be used both inside and outside character classes, and is equivalent to [^[:digit:]].

Be careful when using the negated shorthands inside square brackets. [\D\S] is not the same as [^\d\s]. The latter matches any character that is not a digit or whitespace. So it matches x, but not 8. The former, however, matches any character that is either not a digit, or is not whitespace. Because a digit is not whitespace, and whitespace is not a digit, [\D\S] matches any character, digit, whitespace or otherwise.

\e

Matches the escape character.

\E

Ends the treatment of metacharacters as non-metacharacters, initiated by a \Q.

For a list of regular expression metacharacters, see Regular expressions: Metacharacters.

\f

Matches a form feed.

\n

Matches a new line.

\Q

Treat all metacharacters as non-metacharacters, until \E is encountered. For example, \Q[$\E is equivalent to \[\$.

For a list of regular expression metacharacters, see Regular expressions: Metacharacters.

\r

Matches a carriage return.

\s

Matches a space or a character treated as whitespace. For example, the following statement returns all product names from Products.ProductName that have at least one space in the name:

SELECT Name 
FROM Products
WHERE Name REGEXP '.*\\s.*'

\s can be used both inside and outside character classes, and is equivalent to [[:whitespace:]]. See Regular expressions: Special sub-character classes.

\S

Matches a non-whitespace character. This is the opposite of \s, and is equivalent to [^[:whitespace:]].

\S can be used both inside and outside character classes. See Regular expressions: Special sub-character classes.

Be careful when using the negated shorthands inside square brackets. [\D\S] is not the same as [^\d\s]. The latter matches any character that is not a digit or whitespace. So it matches x, but not 8. The former, however, matches any character that is either not a digit, or is not whitespace. Because a digit is not whitespace, and whitespace is not a digit, [\D\S] matches any character, digit, whitespace or otherwise.

\t

Matches a horizontal tab.

\v

Matches a vertical tab.

\w

Matches a alphabetic character, digit, or underscore in the current collation. For example, the following statement returns all surnames from Contacts.Surname that are exactly seven alpha-numeric characters in length:

SELECT Surname 
FROM Contacts 
WHERE Surname REGEXP '\\w{7}';

\w can be used both inside and outside character classes. See Regular expressions: Special sub-character classes.

Equivalent to [[:alnum:]_]..

\W

Matches anything that is not an alphabetic character, digit, or underscore in the current collation. This is the opposite of \w, and is equivalent to [^[:alnum:]_].

This regular expression can be used both inside and outside character classes. See Regular expressions: Special sub-character classes.

\xhh

Matches the character whose value is 0xhh, where hh is, at most, two hex digits. For example, \x2D is equivalent to a hyphen.

Equivalent to \x{hh}.

\x{hhh}

Matches the character whose value is 0xhhh, where hhh is, at most, eight hex digits.

\z and \Z

Matches the position (not character) at the end of the string.

Equivalent to $.

Regular expressions: Assertions

Assertions test whether a condition is true, and affect the position in the string where matching begins. Assertions do not return characters; the assertion pattern is not included in the final match. These assertions are supported by the REGEXP search condition and the REGEXP_SUBSTR function. These conventions are not supported by the SIMILAR TO search expression.

Lookahead and lookbehind assertions can be useful with REGEXP_SUBSTR when trying to split a string. For example, you can return the list of street names (without the street numbers) in the Address column of the Customers table by executing the following statement:

SELECT REGEXP_SUBSTR( Street, '(?<=^\\S+\\s+).*$' ) 
FROM Customers;

Another example is if you want to use a regular expression to verify that a password conforms to certain rules. You could use a zero width assertion similar to the following:

IF password REGEXP '(?=.*[[:digit:]])(?=.*[[:alpha:]].*[[:alpha:]])[[:word:]]{4,12}' 
   MESSAGE 'Password conforms' TO CLIENT;
ELSE
   MESSAGE 'Password does not conform' TO CLIENT;
END IF

The password is valid when the following are true:

  • password has at least one digit (zero width positive assertion with [[:digit:]])

  • password has at least two alphabetic characters (zero width positive assertion with [[:alpha:]].*[[:alpha:]])

  • password contains only alpha-numeric or underscore characters ([[:word:]])

  • password is at least 4 characters, and at most 12 characters ({4,12})

The following table contains the assertions supported by SQL Anywhere:

Syntax Meaning
(?=pattern)

Positive lookahead zero-width assertion   Looks to see if the current position in the string is immediately followed by an occurrence of pattern, without pattern becoming part of the match string. 'A(?=B)' matches an A that is followed by a B, without making the B part of the match.

For example, SELECT REGEXP_SUBSTR( 'in new york city', 'new(?=\\syork)'); returns the substring new since it is immediately followed by ' york' (note the space before york).

(?!pattern)

Negative lookahead zero-width assertions   Looks to see if the current position in the string is not immediately followed by an occurrence of pattern, without pattern becoming part of the match string. So, 'A(?!B)' matches an A that is not followed by a B.

For example,SELECT REGEXP_SUBSTR('new jersey', 'new(?!\\syork)'); returns the substring new.

(?<=pattern)

Positive lookbehind zero-width assertions   Looks to see if the current position in the string is immediately preceded by an occurrence of pattern, without pattern becoming part of the match string. So, '(?<=A)B' matches a B that is immediately preceded by an A, without making A part of the match.

For example, SELECT REGEXP_SUBSTR('new york', '(?<=new\\s)york'); returns the substring york.

(?<!pattern)

Negative lookbehind zero-width assertions   Looks to see if the current position in the string is not immediately preceded by an occurrence of pattern, without pattern becoming part of the match string.

For example, SELECT REGEXP_SUBSTR('about york', '(?<!new\\s)york'); returns the substring york.

(?>pattern)

Possessive local subexpression   Matches only the largest prefix of the remaining string that matches pattern.

For example, in 'aa' REGEXP '(?>a*)a', (?>a*) matches (and consumes) the aa, and never just the leading a. As a result, 'aa' REGEXP '(?>a*)a' evaluates to false.

(?:pattern)

Non-capturing block   This is functionally equivalent to just pattern, and is provided for compatibility.

For example, in 'bb' REGEXP '(?:b*)b', (?:b*) matches (and consumes) the bb. However, unlike possessive local subexpression, the last b in bb is given up to allow the whole match to succeed (that is, to allow the matching to the b found outside the non-capturing block).

Likewise, 'a(?:bc|b)c' matches abcc, and abc. In matching abc, backtracking on the final c in bc takes place so that the c outside the group can be used to make the match successful.

(?#text)

Used for comments. The content of text is ignored.

See also