Text configuration object settings

SQL Anywhere provides two default text configuration objects, default_char for use with CHAR data, and default_nchar for use with NCHAR and CHAR data. Note that while default_nchar can be used with any data, character set conversion will be performed. For information about the settings for default_char and default_nchar text configuration objects, see Default text configuration objects.

You can test how a text configuration object affects term breaking using the sa_char_terms and sa_nchar_terms system procedures. See sa_char_terms system procedure, and sa_nchar_terms system procedure.

The following table explains text configuration object settings and how they impact what is indexed and how a full text search query is interpreted for text indexes using the text configuration. For examples, see Example text configuration objects.

External prefilter algorithm (PREFILTER) Prefiltering is the process of extracting text data from a file types such as Word, PDF, HTML, and XML. In the context of text indexing, prefiltering allows you to extract only the data you want indexed, and avoid indexing unnecessary content such HTML tags. For certain types of documents (for example, Microsoft Word documents), prefiltering is required to make full text indexes useful.

SQL Anywhere does not provide a built-in prefilter feature. However, you can create an external prefilter library to perform prefiltering according to your requirements, and then alter your text configuration object to point to it.

The following table explains the impact that the value of PREFILTER EXTERNAL NAME has on text indexing and on how query strings are handled:

Text indexes	Query strings
GENERIC and NGRAM text indexes An external prefilter takes an input value (a document) and filters it according to the rules specified by the prefilter library. The resulting text is then passed to the term breaker before building or updating the text index.	GENERIC and NGRAM text indexes Query strings are not passed through a prefilter, so the setting of the PREFILTER EXTERNAL NAME clause has no impact on query strings.

The ExternalLibrariesFullText directory in your SQL Anywhere install contains prefilter and term breaker sample code for you to explore. This directory is found under your Samples directory. For the location of your Samples directory, see Samples directory.

Term breaker algorithm (TERM BREAKER) The TERM BREAKER setting specifies the algorithm to use for breaking strings into terms. The choices are GENERIC for storing terms, or NGRAM for storing n-grams. For GENERIC, you can use the built-in term breaker algorithm, or an external term breaker. See TERM BREAKER clause, ALTER TEXT CONFIGURATION statement.

The following table explains the impact that the value of TERM BREAKER has on text indexing and on how query strings are handled:

Text indexes Query strings

Text indexes	Query strings
GENERIC text index Performance of GENERIC text indexes can be faster than NGRAM text indexes. However, you cannot perform fuzzy searches on GENERIC text indexes. When building a GENERIC text index using the built-in algorithm, groups of alphanumeric characters appearing between non-alphanumeric characters are processed as terms by the database server, and have positions assigned to them. When building a GENERIC text index using a term breaker external library, terms and their positions are defined by the external library. Once the terms have been identified by the term breaker, any term that exceeds the term length restrictions or that is found in the stoplist, is counted but not inserted in the text index. NGRAM text index An n-gram is a group of characters of length n where n is the value of MAXIMUM TERM LENGTH. When building an NGRAM text index, the database server treats as a term any group of alphanumeric characters between non-alphanumeric characters. Once the terms are defined, the database server breaks the terms into n-grams. In doing so, terms shorter than n, and n-grams that are in the stoplist, are discarded. For example, for an NGRAM text index with MAXIMUM TERM LENGTH 3, the string 'my red table' is represented in the text index as the following n-grams: red tab abl ble. For n-grams, the positional information of the n-grams is stored, not the positional information for the original terms.	When parsing a CONTAINS query, the database server extracts keywords and special characters from the query string and then applies the term breaker algorithm to the remaining terms. For example, if the query string is `'ab_cd* AND b'`, the and the keyword AND are extracted, and the character strings ab_cd and b are given to the term breaker algorithm to parse separately. For more information about keywords and special characters in full text search, see CONTAINS search condition. GENERIC text index When querying a GENERIC text index, terms in the query string are processed in the same manner as if they were being indexed. Matching is performed by comparing query terms to terms in the text index. NGRAM text index When querying an NGRAM text index, terms in the query string are processed in the same manner as if they were being indexed. Matching is performed by comparing n-grams from the query terms to n-grams from the indexed terms. For information about how prefix searching is performed in NGRAM text indexes, see Prefix searching.

GENERIC text index Performance of GENERIC text indexes can be faster than NGRAM text indexes. However, you cannot perform fuzzy searches on GENERIC text indexes.
When building a GENERIC text index using the built-in algorithm, groups of alphanumeric characters appearing between non-alphanumeric characters are processed as terms by the database server, and have positions assigned to them.

When building a GENERIC text index using a term breaker external library, terms and their positions are defined by the external library.

Once the terms have been identified by the term breaker, any term that exceeds the term length restrictions or that is found in the stoplist, is counted but not inserted in the text index.
NGRAM text index An n-gram is a group of characters of length n where n is the value of MAXIMUM TERM LENGTH.
When building an NGRAM text index, the database server treats as a term any group of alphanumeric characters between non-alphanumeric characters. Once the terms are defined, the database server breaks the terms into n-grams. In doing so, terms shorter than n, and n-grams that are in the stoplist, are discarded.

For example, for an NGRAM text index with MAXIMUM TERM LENGTH 3, the string 'my red table' is represented in the text index as the following n-grams: red tab abl ble.

For n-grams, the positional information of the n-grams is stored, not the positional information for the original terms.

When parsing a CONTAINS query, the database server extracts keywords and special characters from the query string and then applies the term breaker algorithm to the remaining terms. For example, if the query string is 'ab_cd* AND b*', the * and the keyword AND are extracted, and the character strings ab_cd and b are given to the term breaker algorithm to parse separately.

For more information about keywords and special characters in full text search, see CONTAINS search condition.

GENERIC text index When querying a GENERIC text index, terms in the query string are processed in the same manner as if they were being indexed. Matching is performed by comparing query terms to terms in the text index.
NGRAM text index When querying an NGRAM text index, terms in the query string are processed in the same manner as if they were being indexed. Matching is performed by comparing n-grams from the query terms to n-grams from the indexed terms. For information about how prefix searching is performed in NGRAM text indexes, see Prefix searching.

If not defined, the default for TERM BREAKER is taken from the setting in the default text configuration object. If a term breaker is not defined in the default text configuration object, the internal term breaker is used. See Default text configuration objects.

Minimum term length setting (MINIMUM TERM LENGTH) The MINIMUM TERM LENGTH setting specifies the minimum length, in characters, for terms inserted in the index or searched for in a full text query. MINIMUM TERM LENGTH is not relevant for NGRAM text indexes.

MINIMUM TERM LENGTH has special implications on prefix searching. See Prefix searching.

The value of MINIMUM TERM LENGTH must be greater than 0. If you set it higher than MAXIMUM TERM LENGTH, then MAXIMUM TERM LENGTH is automatically adjusted to be equal to MINIMUM TERM LENGTH.

If not defined, the default for MINIMUM TERM LENGTH is taken from the setting in the default text configuration object, which is typically 1. See Default text configuration objects.

The following table explains the impact that the value of MINIMUM TERM LENGTH has on text indexing and on how query strings are handled:

Text indexes	Query strings
GENERIC text index For GENERIC text indexes, the text index will not contain words shorter than MINIMUM TERM LENGTH. NGRAM text index For NGRAM text indexes, this setting is ignored.	GENERIC text index When querying a GENERIC text index, query terms shorter than MINIMUM TERM LENGTH are ignored because they cannot exist in the text index. NGRAM text index The MINIMUM TERM LENGTH setting has no impact on full text queries on NGRAM text indexes.

Text indexes

Query strings

GENERIC text index For GENERIC text indexes, the text index will not contain words shorter than MINIMUM TERM LENGTH.

NGRAM text index For NGRAM text indexes, this setting is ignored.

GENERIC text index When querying a GENERIC text index, query terms shorter than MINIMUM TERM LENGTH are ignored because they cannot exist in the text index.

NGRAM text index The MINIMUM TERM LENGTH setting has no impact on full text queries on NGRAM text indexes.

Maximum term length setting (MAXIMUM TERM LENGTH) The MAXIMUM TERM LENGTH setting is used differently depending on the term breaker algorithm.

The value of MAXIMUM TERM LENGTH must be less than or equal to 60. If you set it lower than the MINIMUM TERM LENGTH, then MINIMUM TERM LENGTH is automatically adjusted to be equal to MAXIMUM TERM LENGTH.

The default for this setting is taken from the setting in the default text configuration object, which is typically 20. See Default text configuration objects.

If not defined, the default for MAXIMUM TERM LENGTH is taken from the setting in the default text configuration object, which is typically 20. See Default text configuration objects.

The following table explains the impact that the value of MAXIMUM TERM LENGTH has on text indexing and on how query strings are handled:

Text indexes	Query strings
GENERIC text indexes For GENERIC text indexes, MAXIMUM TERM LENGTH specifies the maximum length, in characters, for terms inserted in the text index. NGRAM text index For NGRAM text indexes, MAXIMUM TERM LENGTH determines the length of the n-grams that terms are broken into. An appropriate choice of length for n-grams depends on the language. Typical values are 4 or 5 characters for English, and 2 or 3 characters for Chinese.	GENERIC text indexes For GENERIC text indexes, query terms longer than MAXIMUM TERM LENGTH are ignored because they cannot exist in the text index. NGRAM text index For NGRAM text indexes, query terms are broken into n-grams of length n, where n is the same as MAXIMUM TERM LENGTH. Then, the database server uses the n-grams to search the text index. Terms shorter than MAXIMUM TERM LENGTH are ignored because they will not match the n-grams in the text index. Therefore, proximity searches do not work unless arguments are prefixes of length n.

Text indexes

Query strings

GENERIC text indexes For GENERIC text indexes, MAXIMUM TERM LENGTH specifies the maximum length, in characters, for terms inserted in the text index.

NGRAM text index For NGRAM text indexes, MAXIMUM TERM LENGTH determines the length of the n-grams that terms are broken into. An appropriate choice of length for n-grams depends on the language. Typical values are 4 or 5 characters for English, and 2 or 3 characters for Chinese.

GENERIC text indexes For GENERIC text indexes, query terms longer than MAXIMUM TERM LENGTH are ignored because they cannot exist in the text index.

NGRAM text index For NGRAM text indexes, query terms are broken into n-grams of length n, where n is the same as MAXIMUM TERM LENGTH. Then, the database server uses the n-grams to search the text index. Terms shorter than MAXIMUM TERM LENGTH are ignored because they will not match the n-grams in the text index. Therefore, proximity searches do not work unless arguments are prefixes of length n.

Stoplist setting (STOPLIST) The stoplist setting specifies the terms that must not be indexed.

If not defined, the default for this setting is taken from the setting in the default text configuration object, which typically has an empty stoplist. See Default text configuration objects.

STOPLIST impact to text index	STOPLIST impact to query terms
GENERIC text indexes For GENERIC text indexes, terms that are in the stoplist are not inserted into the text index. NGRAM text index For NGRAM text indexes, the text index will not contain the n-grams formed from the terms in the stoplist.	GENERIC text indexes For GENERIC text indexes, query terms that are in the stoplist are ignored because they cannot exist in the text index. NGRAM text index Terms in the stoplist are broken into n-grams and the n-grams are used for the term filtering. Likewise, query terms are broken into n-grams and any that match n-grams in the stoplist are dropped because they cannot exist in the text index.

STOPLIST impact to text index

STOPLIST impact to query terms

GENERIC text indexes For GENERIC text indexes, terms that are in the stoplist are not inserted into the text index.

NGRAM text index For NGRAM text indexes, the text index will not contain the n-grams formed from the terms in the stoplist.

GENERIC text indexes For GENERIC text indexes, query terms that are in the stoplist are ignored because they cannot exist in the text index.

NGRAM text index Terms in the stoplist are broken into n-grams and the n-grams are used for the term filtering. Likewise, query terms are broken into n-grams and any that match n-grams in the stoplist are dropped because they cannot exist in the text index.

The settings in the text configuration object are applied to the stoplist when it is parsed. That is, the specified specified term breaker and the min/max length settings are applied.

Stoplists in NGRAM text indexes can cause unexpected results because the stoplist is stored in n-gram form, and not the stoplist terms you specified. For example, in an NGRAM text index where MAXIMUM TERM LENGTH is 3, if you specify STOPLIST 'there', the following n-grams are stored as the stoplist: the her ere. This impacts the ability to query for any terms that contain the n-grams the, her, and ere.

Note

The same restrictions with regards to specifying string literals also apply to stoplists. For example, apostrophes must be escaped, and so on. For more information on formatting string literals, see String literals.

The Samples directory contains sample code that loads stoplists for several languages. These sample stoplists are recommended for use only on GENERIC text indexes. For the location of the Samples directory, see Samples directory.

Date, time, and timestamp format settings When a text configuration object is created, the values for date_format, time_format, timestamp_format, and timestamp_with_time_zone_format options for the current connection are stored with the text configuration object. These option values control how DATE, TIME, and TIMESTAMP columns are formatted for the text indexes built using the text configuration object. You cannot explicitly set these option values for the text configuration object; the settings reflect those in effect for the connection that created the text configuration object. However, you can change them. See How to alter a text configuration object, and the SAVE OPTION VALUES clause of the ALTER TEXT CONFIGURATION statement.

Text configuration object settings

Note

See also