Text configuration object settings

The following tables explain text configuration object settings, how they affect what is indexed, and how a full text search query is interpreted. For examples of text configuration objects and their impact on TEXT indexes and full text searching, see “Text configuration object setting interpretations”.

Term breaker algorithm (TERM BREAKER) The TERM BREAKER setting specifies the algorithm to use for breaking strings into terms. Sybase IQ supports GENERIC (the default) for storing terms.

NoteNGRAM term breakers for storing n-grams (an n-gram is a group of characters of length n where n is the value of MAXIMUM TERM LENGTH) are supported only in IN SYSTEM tables, and cannot be used in Sybase IQ TEXT indexes.

Regardless of the term breaker you specify, the database server records in the TEXT index the original positional information for terms when they are inserted into the TEXT index. In the case of n-grams, the positional information of the n-grams is stored, not the positional information for the original terms.

Table 2-3: TERM BREAKER impact

To TEXT index

To query terms

GENERIC TEXT index When building a GENERIC TEXT index (the default), groups of alphanumeric characters appearing between non-alphanumeric characters are processed as terms by the database server. After the terms have been defined, terms that exceed the term length settings, and terms found in the stoplist, are counted but not inserted in the TEXT index.

Performance on GENERIC TEXT indexes can be faster than NGRAM TEXT indexes. However, you cannot perform fuzzy searches on GENERIC TEXT indexes.

GENERIC TEXT index When querying a GENERIC TEXT index, terms in the query string are processed in the same manner as if they were being indexed. Matching is performed by comparing query terms to terms in the TEXT index.

NGRAM TEXT index* When building an NGRAM TEXT index, the database server treats as a term any group of alphanumeric characters between non-alphanumeric characters. Once the terms are defined, the database server breaks the terms into n-grams. In doing so, terms shorter than n, and n-grams that are in the stoplist, are discarded.

For example, for an NGRAM TEXT index with MAXIMUM TERM LENGTH 3, the string 'my red table' is represented in the TEXT index as these n-grams: red tab abl ble.

NGRAM TEXT index* When querying an NGRAM TEXT index, terms in the query string are processed in the same manner as if they were being indexed. Matching is performed by comparing n-grams from the query terms to n-grams from the indexed terms.

*NGRAM TEXT indexes are supported only in IN SYSTEM tables.

Minimum term length setting (MINIMUM TERM LENGTH) The MINIMUM TERM LENGTH setting specifies the minimum length, in characters, for terms inserted in the index or searched for in a full text query. MINIMUM TERM LENGTH is not relevant for NGRAM TEXT indexes.

MINIMUM TERM LENGTH has special implications on prefix searching. The value of MINIMUM TERM LENGTH must be greater than 0. If you set it higher than MAXIMUM TERM LENGTH, then MAXIMUM TERM LENGTH is automatically adjusted to be equal to MINIMUM TERM LENGTH.

The default for MINIMUM TERM LENGTH is taken from the setting in the default text configuration object, which is typically 1.

Table 2-4: MINIMUM TERM LENGTH impact

To TEXT index

To query terms

GENERIC TEXT index For GENERIC TEXT indexes, the TEXT index will not contain words shorter than MINIMUM TERM LENGTH.

GENERIC TEXT index When querying a GENERIC TEXT index, query terms shorter than MINIMUM TERM LENGTH are ignored because they cannot exist in the TEXT index.

NGRAM TEXT index* For NGRAM TEXT indexes, this setting is ignored.

NGRAM TEXT index* The MINIMUM TERM LENGTH setting has no impact on full text queries on NGRAM TEXT indexes.

*NGRAM TEXT indexes are supported only in IN SYSTEM tables.

Maximum term length setting (MAXIMUM TERM LENGTH) The MAXIMUM TERM LENGTH setting is used differently, depending on the term breaker algorithm. The value of MAXIMUM TERM LENGTH must be less than or equal to 60. If you set it lower than the MINIMUM TERM LENGTH, then MINIMUM TERM LENGTH is automatically adjusted to be equal to MAXIMUM TERM LENGTH.

The default for this setting is taken from the setting in the default text configuration object, which is typically 20.

Table 2-5: MAXIMUM TERM LENGTH impact

To TEXT index

To query terms

GENERIC TEXT index For GENERIC TEXT indexes, MAXIMUM TERM LENGTH specifies the maximum length, in characters, for terms inserted in the TEXT index.

GENERIC TEXT index For GENERIC TEXT indexes, query terms longer than MAXIMUM TERM LENGTH are ignored because they cannot exist in the TEXT index.

NGRAM TEXT index* For NGRAM TEXT indexes, MAXIMUM TERM LENGTH determines the length of the n-grams that terms are broken into. An appropriate choice of length for MAXIMUM TERM LENGTH depends on the language. Typical values are 4 or 5 characters for English, and 2 or 3 characters for Chinese.

NGRAM TEXT index* For NGRAM TEXT indexes, query terms are broken into n-grams of length n, where n is the same as MAXIMUM TERM LENGTH. The database server uses the n-grams to search the TEXT index. Terms shorter than MAXIMUM TERM LENGTH are ignored because they do not match the n-grams in the TEXT index.

*NGRAM TEXT indexes are supported only in IN SYSTEM tables.

Stoplist setting (STOPLIST) The stoplist setting specifies terms that are not indexed. The default for this setting is taken from the setting in the default text configuration object, which typically has an empty stoplist.

Table 2-6: STOPLIST impact

To TEXT index

To query terms

GENERIC TEXT index For GENERIC TEXT indexes, terms that are in the stoplist are not inserted into the TEXT index.

GENERIC TEXT index For GENERIC TEXT indexes, query terms that are in the stoplist are ignored because they cannot exist in the TEXT index.

NGRAM TEXT index* For NGRAM TEXT indexes, the TEXT index does not contain the n-grams formed from the terms in the stoplist.

NGRAM TEXT index* Terms in the stoplist are broken into n-grams and the n-grams are used for the stoplist. Likewise, query terms are broken into n-grams and any that match n-grams in the stoplist are dropped because they cannot exist in the TEXT index.

*NGRAM TEXT indexes are supported only in IN SYSTEM tables.

Consider carefully whether to put terms in to your stoplist. In particular, do not include words that have non-alphanumeric characters in them such as apostrophes or dashes. These characters act as term breakers. For example, the word you'll (which must be specified as 'you'll') is broken into you and ll and stored in the stoplist as these two terms. Subsequent full text searches for 'you' or 'they'll' are negatively impacted.

Stoplists in NGRAM TEXT indexes can cause unexpected results because the stoplist that is stored is actually in n-gram form, not the actual stoplist terms you specified. For example, in an NGRAM TEXT index where MAXIMUM TERM LENGTH is 3, if you specify STOPLIST 'there', these n-grams are stored as the stoplist: the her ere. This impacts the ability to query for any terms that contain the n-grams the, her, and ere.