Regular expression syntax

Regular expression syntax varies from product to product. SQL Anywhere syntax for regular expressions matches closely the Perl 5 syntax, but also supports some conventions found in other products. Use the following section to understand the full list of syntax conventions supported by SQL Anywhere.

Depending on what you are searching for, and how your data is structured, there can be several ways to construct the regular expression you need for matching. The syntax examples in the next sections are very basic, and are designed to show the syntax in simple context. In many cases, there will be several ways you can construct the expression to return the same result.

Regular expression matching is case insensitive on case-insensitive databases.

Note

The REGEXP, and SIMILAR TO search conditions attempt to match an entire string, not a substring within the string. For example, 'car' matches only car, not carwash or vicar. However, you could specify wildcards on either side of the pattern you are trying to match, to achieve substring matching behavior. For example, '%car%' matches car, as well as carwash, and vicar. Alternatively, you can rewrite your query to make use the REGEXP_SUBSTR function, which is designed to search for substrings within a string. See REGEXP_SUBSTR function [String].

Metacharacters used for regular expression syntax

Metacharacters are symbols or characters that have a special meaning within a regular expression. The list of regular expression metacharacters are:

Regular expressions: Wildcards and quantifiers

Wildcards allow you to define the type of characters to look for, without specifying what the character should be. For example, the underscore character (_) matches any single character.

Quantifiers specify the number of times a pattern must repeat to constitute a match. For example, .*abc means "find a pattern that ends with abc, and starts with any character repeated any number of times". Consequently, aabc, xyzabc, and abc match, but bc and abcc do not.

The REGEXP search condition and the REGEXP_SUBSTR function support all of the syntax conventions in the table below, but the SIMILAR TO search expression does not. Conventions supported by SIMILAR TO have a Y in the SIMILAR TO column.

Character Additional information SIMILAR TO
% Matches any number of characters (including no characters). Y
. Matches any single character (equivalent to '_'). For example, 'a.cd' matches any string of four characters that starts with a and ends with cd.
_ (underscore character) Matches any single character. For example, 's_t' matches any string of three characters that starts with s and ends with t. Y
( ) Used for grouping parts of the regular expression so that you can apply quantifiers. For example, (ab)* will match zero or more repetitions of ab. As with mathematical expressions, you can use grouping to control the order in which the parts of a regular expression are evaluated. Y
[ ]

Matches a single character against the enclosed set of characters. Square brackets around a set of characters constitute a character class. Wildcards and quantifiers have no special meaning within the brackets.

You can use a hyphen in a character class to denote a range. For example, '[a-ce]' matches a, b, c, and e. Note that the position in the range depends on the collation ordering.

To include a literal minus sign (-), caret (^), or right-angle bracket (]) character in the matching set of characters, it must be escaped.

For SIMILAR TO (but not REGEXP or REGEXP_SUBSTR function), the meta characters *, ?, +, _, |, (, ), { must be escaped within a character class.

When a caret as the first character inside of a bracket expression, it matches a single character not listed in the brackets. For example, '[^a-c]' matches any character except a, b, or c.

If a caret is between characters in a character set, it matches anything to the left of the caret, minus those in the set on the right side of the caret. For example, [a-e^b] matches a, c, d, and e.

SQL Anywhere also supports sub-character classes such as POSIX character classes. See Regular expressions: Special sub-character classes.

Y
| Matches against the expressions on either side of the vertical bar (also referred to as alternation). The vertical bar is interpreted as an OR, and matching stops at the first match made starting from the leftmost pattern. Consequently, you should list the patterns in descending order of preference. You can specify an unlimited number of alternative patterns. Y
* Matches a character 0 or more times. For example, 'bre*' matches br, bre, and bree. Y
? Matches a character 0 or 1 times. For example, 'colou?r' matches color and colour. Y
+ Matches a character 1 or more times. For example, 'bre+' matches bre and bree, but not br. Y
{m} Matches a character exactly m times. For example, '519-[0-9]{3}-[0-9]{4}' matches a phone number in the 519 area code (providing the data is formatted in the manner defined in the syntax). Y
{m, } Matches a character at least m times. For example, '[0-9]{5,}' matches any string of five or more numbers. Y
{m, n} Matches a character at least m times, but not more than n times. For example, '_{5,10}' matches any string with between 5 and 10 (inclusive) characters. Y
Regular expressions: Assertions

Assertions test whether a condition is true, and affect the position in the string where matching begins. Assertions do not return characters, in that the assertion pattern is not included in the final match. These assertions are supported by the REGEXP search condition and the REGEXP_SUBSTR function. These conventions are not supported by the SIMILAR TO search expression.

Lookahead and lookbehind assertions can be useful with REGEXP_SUBSTR when trying to split a string. For example, you can return the list of street names (without the street numbers) in the Address column of the Customers table by executing the following statement:

SELECT REGEXP_SUBSTR( Street, '(?<=^\\S+\\s+).*$' ) 
FROM Customers;

Another example is if you want to use a regular expression to verify that a password conforms to certain rules. You could use a zero width assertion similar to the following:

IF password REGEXP '(?=%[[:digit:]])(?=%[[:alpha:]]%[[:alpha:]])[[:word:]]{4,12}' 
   MESSAGE 'Password conforms' TO CLIENT;
ELSE
   MESSAGE 'Password does not conform' TO CLIENT;
END IF

The password is valid if all of the following are true:

  • password has at least one digit (zero width positive assertion with [[:digit:]])
  • password has at least two alphabetic characters (zero width positive assertion with [[:alpha:]]%[[:alpha:]])
  • password contains only alpha-numeric or underscore characters ([[:word:]])
  • password is at least 4 characters, and at most 12 characters ({4,12})

The following table contains the assertions supported by SQL Anywhere:

Syntax Meaning
(?=pattern)

Positive lookahead zero-width assertion   Looks to see if the current position in the string is immediately followed by an occurrence of pattern, without pattern becoming part of the match string. 'A(?=B)' matches an A that is followed by a B, without making the B part of the match.

For example, SELECT REGEXP_SUBSTR( 'in new york city', 'new(?=\\syork)'); returns the substring new since it is immediately followed by ' york' (note the space before york).

(?!pattern)

Negative lookahead zero-width assertions   Looks to see if the current position in the string is not immediately followed by an occurrence of pattern, without pattern becoming part of the match string. So, 'A(?!B)' matches an A that is not followed by a B.

For example,SELECT REGEXP_SUBSTR('new jersey', 'new(?!\\syork)'); returns the substring new.

(?<=pattern)

Positive lookbehind zero-width assertions   Looks to see if the current position in the string is immediately preceded by an occurrence of pattern, without pattern becoming part of the match string. So, '(?<=A)B' matches a B that is immediately preceded by an A, without making A part of the match.

For example, SELECT REGEXP_SUBSTR('new york', '(?<=new\\s)york'); returns the substring york.

(?<!pattern)

Negative lookbehind zero-width assertions   Looks to see if the current position in the string is not immediately preceded by an occurrence of pattern, without pattern becoming part of the match string.

For example, SELECT REGEXP_SUBSTR('about york', '(?<!new\\s)york'); returns the substring york.

(?>pattern)

Possessive local subexpression   Matches only the largest prefix of the remaining string that matches pattern.

For example, in 'aa' REGEXP '(?>a*)a', (?>a*) matches (and consumes) the aa, and never just the leading a. As a result, 'aa' REGEXP '(?>a*)a' evaluates to false.

(?:pattern)

Non-capturing block   This is functionally equivalent to just pattern, and is provided for compatibility.

For example, in 'bb' REGEXP '(?:b*)b', (?:b*) matches (and consumes) the bb. However, unlike possessive local subexpression, the last b in bb is given up to allow the whole match to succeed (that is, to allow the matching to the b found outside of the non-capturing block).

Likewise, 'a(?:bc|b)c' matches abcc, and abc. In matching abc, backtracking on the final c in bc takes place so that the c outside of the group can be used to make the match successful.

(?#text)

Used for comments. The content of text is ignored.

Regular expressions: Special sub-character classes

Character classes are a set of characters, enclosed in square brackets, against which characters in a string are matched. Sub-character classes are special character classes embedded within a larger character class. In addition to custom character classes where you define the set of characters to match (for example, [abxq4] limits the set of matching characters to a, b, x, q, and 4), SQL Anywhere supports sub-character classes such as most of the POSIX character classes. For example, [[:alpha:]] represents the set of all upper- and lower-case letters.

The REGEXP search condition and the REGEXP_SUBSTR function support all of the syntax conventions in the table below, but the SIMILAR TO search expression does not. Conventions supported by SIMILAR TO have a Y in the SIMILAR TO column.

In REGEXP and when using the REGEXP_SUBSTR function, sub-character classes can be negated using a caret. For example, [[:^alpha:]] matches the set of all characters except alpha characters.

Sub-character class Additional information SIMILAR TO
[:alpha:]

Matches upper- and lowercase alphabetic characters in the current collation. For example, '[0-9]{3}[[:alpha:]]{2}' matches three digits, followed by two letters.

Y
[:alnum:]

Match digits, and upper- and lowercase alphabetic characters in the current collation. For example, '[[:alnum:]]+' matches a string of one or more letters and numbers.

Y
[:digit:]

Match digits in the current collation. For example, '[[:digit:]-]+' matches a string of one or more digits or dashes. Likewise, '[^[:digit:]-]+' matches a string of one or more characters that are not digits or dashes.

Y
[:lower:]

Match lowercase alphabetic characters in the current collation. For example, '[[:lower:]]' does not match A in a case-sensitive database because A is uppercase.

Equivalent to [:alpha:] in a case-insensitive database.

Y
[:space:]

Match a single blank (' '). For example, the following statement searches Contacts.City for any city with a two word name:

SELECT City 
FROM Contacts
WHERE City REGEXP '%[[:space:]]%';
Y
[:upper:]

Match uppercase alphabetic characters in the current collation. For example, '[[:upper:]ab]' matches A, C, and b on a case-insensitive database.

Equivalent to [:alpha:] in a case-insensitive database.

Y
[:whitespace:]

Match a whitespace character such as space, tab, formfeed, and carriage return.

Y
[:ascii:]

Match any seven-bit ASCII character (ordinal value between 0 and 127).

[:blank:]

Match a blank space, or a horizontal tab.

[[:blank:]] is equivalent to [ \t].

[:cntrl:]

Match ASCII characters with an ordinal value of less than 32, or character value 127 (control characters). Control characters include newline, form feed, backspace, and so on.

[:graph:]

Match printed characters.

[[:graph:]] is equivalent to [[:alnum:][:punct:]].

[:print:]

Match printed characters and spaces.

[[:print:]] is equivalent to [[:graph:][:whitespace:]].

[:punct:]

Match one of: !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~.

The [:punct:] sub-character class may not include non-ASCII punctuation characters available in the current collation.

[:word:]

Match alphabetic, digit, or underscore characters in the current collation.

[[:word:]] is equivalent to [[:alnum:]_].

[:xdigit:]

Match a character that is both ASCII and in the character class [0-9A-Fa-f].

[[:xdigit:]] is equivalent to (?=[[:ascii:]])[0-9A-Fa-f].

Regular expressions: Other supported syntaxes

The following syntax conventions are supported by the REGEXP search condition and the REGEXP_SUBSTR function. These conventions are not supported by the SIMILAR TO search expression.

Regular expression syntax Name and meaning
^

When the caret is outside of a character set, the caret matches the start of a string. For example, '^[hc]at' matches hat and cat, but only at the beginning of the string.

A caret used inside a character class denotes negation or subtraction. See the description of angle brackets ([]) for character sets for information on the use of the caret in character sets: Regular expressions: Wildcards and quantifiers.

$

Matches the end of a string. For example, cat$ matches cat, but not catfish.

\0xxx

Matches the character whose value is \0xxx, where xxx is any sequence of octal digits, and 0 is a zero. For example, \0134 matches a backslash.

\a

Matches the bell character.

\A

Used outside of a character set to match the start of a string.

Equivalent to ^ used outside of a character set.

\b

Matches a backspace character.

\B

Matches the backslash character (\).

\cX

Matches a named control character. For example, \cZ for ctrl-Z.

\d

Matches a digit in the current collation. For example, the following statement searches Contacts.Phone for all phone numbers that end with 00:

SELECT Surname, Surname, City, Phone
   FROM Contacts
   WHERE Phone REGEXP '\\d{8}00';

\d can be used both inside and outside of character classes, and is equivalent to [[:digit:]].

\D

Matches anything that is not a digit. This is the opposite of \d.

\D can be used both inside and outside of character classes, and is equivalent to [[^:digit:]].

Be careful when using the negated shorthands inside square brackets. [\D\S] is not the same as [^\d\s]. The latter matches any character that is not a digit or whitespace. So it matches x, but not 8. The former, however, matches any character that is either not a digit, or is not whitespace. Because a digit is not whitespace, and whitespace is not a digit, [\D\S] matches any character, digit, whitespace or otherwise.

\e

Matches the escape character.

\E

Ends the treatment of metacharacters as non-metacharacters, initiated by a \Q.

For a list of regular expression metacharacters, see Metacharacters used for regular expression syntax.

\f

Matches a form feed.

\n

Matches a new line.

\Q

Treat all metacharacters as non-metacharacters, until \E is encountered. For example, \Q[$\E is equivalent to \[\$.

For a list of regular expression metacharacters, see Metacharacters used for regular expression syntax.

\r

Matches a carriage return.

\s

Matches a space or a character treated as whitespace. For example, the following statement returns all product names from Products.ProductName that have at least one space in the name:

SELECT Name 
FROM Products
WHERE Name REGEXP '%\\s%'

\s can be used both inside and outside of character classes, and is equivalent to [[:whitespace:]]. See Regular expressions: Special sub-character classes.

\S

Matches a non-whitespace character. This is the opposite of \s, and is equivalent to [^[:whitespace:]].

\S can be used both inside and outside of character classes. See Regular expressions: Special sub-character classes.

Be careful when using the negated shorthands inside square brackets. [\D\S] is not the same as [^\d\s]. The latter matches any character that is not a digit or whitespace. So it matches x, but not 8. The former, however, matches any character that is either not a digit, or is not whitespace. Because a digit is not whitespace, and whitespace is not a digit, [\D\S] matches any character, digit, whitespace or otherwise.

\t

Matches a horizontal tab.

\v

Matches a vertical tab.

\w

Matches a alphabetic character, digit, or underscore in the current collation. For example, the following statement returns all surnames from Contacts.Surname that are exactly seven alpha-numeric characters in length:

SELECT Surname 
FROM Contacts 
WHERE Surname REGEXP '\\w{7}';

\w can be used both inside and outside of character classes. See Regular expressions: Special sub-character classes.

Equivalent to [[:alnum:]].

\W

Matches anything that is not an alphabetic character, digit, or underscore in the current collation. This is the opposite of \w, and is equivalent to [^[:alnum:]_].

This regular expression can be used both inside and outside of character classes. See Regular expressions: Special sub-character classes.

\xhh

Matches the character whose value is 0xhh, where hh is, at most, two hex digits. For example, \x2D is equivalent to a hyphen.

Equivalent to \x{hh}.

\x{hhh}

Matches the character whose value is 0xhhh, where hhh is, at most, eight hex digits.

\z and \Z

Matches the position (not character) at the end of the string.

Equivalent to $.

See also