Text configuration object settings
The following tables explain text configuration object settings,
how they affect what is indexed, and how a full text search query
is interpreted. For examples of text configuration objects and their
impact on TEXT indexes and full text searching,
see “Text configuration object setting interpretations”.
Term breaker algorithm (TERM BREAKER) The TERM BREAKER setting specifies the
algorithm to use for breaking strings into terms. Sybase IQ supports GENERIC (the
default) for storing terms.
NGRAM term breakers for storing n-grams
(an n-gram is a group of characters of length n where n is
the value of MAXIMUM TERM LENGTH) are supported
only in IN SYSTEM tables, and cannot be used
in Sybase IQ TEXT indexes.
Regardless of the term breaker you specify, the database server
records in the TEXT index the original positional
information for terms when they are inserted into the TEXT index.
In the case of n-grams, the positional information of the n-grams
is stored, not the positional information for the original terms.
Table 2-3: TERM BREAKER impact
To TEXT index
|
To query terms
|
GENERIC TEXT index When building a GENERIC TEXT index
(the default), groups of alphanumeric characters appearing between
non-alphanumeric characters are processed as terms by the database
server. After the terms have been defined, terms that exceed the
term length settings, and terms found in the stoplist, are counted
but not inserted in the TEXT index.
Performance on GENERIC TEXT indexes can
be faster than NGRAM TEXT indexes. However, you
cannot perform fuzzy searches on GENERIC TEXT indexes.
|
GENERIC TEXT index When querying a GENERIC TEXT index,
terms in the query string are processed in the same manner as if
they were being indexed. Matching is performed by comparing query
terms to terms in the TEXT index.
|
NGRAM TEXT index* When building an NGRAM TEXT index,
the database server treats as a term any group of alphanumeric characters
between non-alphanumeric characters. Once the terms are defined,
the database server breaks the terms into n-grams. In doing so, terms
shorter than n, and n-grams that are in the stoplist, are discarded.
For example, for an NGRAM TEXT index with MAXIMUM
TERM LENGTH 3, the string 'my red table' is represented
in the TEXT index as these n-grams: red tab abl ble.
|
NGRAM TEXT index* When querying an NGRAM TEXT index,
terms in the query string are processed in the same manner as if
they were being indexed. Matching is performed by comparing n-grams
from the query terms to n-grams from the indexed terms.
|
*NGRAM TEXT indexes
are supported only in IN SYSTEM tables.
|
Minimum term length setting (MINIMUM TERM LENGTH) The MINIMUM TERM LENGTH setting specifies
the minimum length, in characters, for terms inserted in the index
or searched for in a full text query. MINIMUM TERM LENGTH is
not relevant for NGRAM TEXT indexes.
MINIMUM TERM LENGTH has special implications
on prefix searching. The value of MINIMUM TERM LENGTH must
be greater than 0. If you set it higher than MAXIMUM TERM
LENGTH, then MAXIMUM TERM LENGTH is automatically
adjusted to be equal to MINIMUM TERM LENGTH.
The default for MINIMUM TERM LENGTH is
taken from the setting in the default text configuration object,
which is typically 1.
Table 2-4: MINIMUM TERM LENGTH impact
To TEXT index
|
To query terms
|
GENERIC TEXT index For GENERIC TEXT indexes,
the TEXT index will not contain words shorter
than MINIMUM TERM LENGTH.
|
GENERIC TEXT index When querying a GENERIC TEXT index,
query terms shorter than MINIMUM TERM LENGTH are
ignored because they cannot exist in the TEXT index.
|
NGRAM TEXT index* For NGRAM TEXT indexes,
this setting is ignored.
|
NGRAM TEXT index* The MINIMUM TERM LENGTH setting has no
impact on full text queries on NGRAM TEXT indexes.
|
*NGRAM TEXT indexes
are supported only in IN SYSTEM tables.
|
Maximum term length setting (MAXIMUM TERM LENGTH) The MAXIMUM TERM LENGTH setting is used
differently, depending on the term breaker algorithm. The value
of MAXIMUM TERM LENGTH must be less than or equal
to 60. If you set it lower than the MINIMUM TERM LENGTH,
then MINIMUM TERM LENGTH is automatically adjusted
to be equal to MAXIMUM TERM LENGTH.
The default for this setting is taken from the setting in
the default text configuration object, which is typically 20.
Table 2-5: MAXIMUM TERM LENGTH impact
To TEXT index
|
To query terms
|
GENERIC TEXT index For GENERIC TEXT indexes, MAXIMUM
TERM LENGTH specifies the maximum length, in characters,
for terms inserted in the TEXT index.
|
GENERIC TEXT index For GENERIC TEXT indexes,
query terms longer than MAXIMUM TERM LENGTH are
ignored because they cannot exist in the TEXT index.
|
NGRAM TEXT index* For NGRAM TEXT indexes, MAXIMUM
TERM LENGTH determines the length of the n-grams that
terms are broken into. An appropriate choice of length for MAXIMUM
TERM LENGTH depends on the language. Typical values are
4 or 5 characters for English, and 2 or 3 characters for Chinese.
|
NGRAM TEXT index* For NGRAM TEXT indexes,
query terms are broken into n-grams of length n, where n is the same
as MAXIMUM TERM LENGTH. The database server uses
the n-grams to search the TEXT index. Terms shorter than MAXIMUM
TERM LENGTH are ignored because they do not match the
n-grams in the TEXT index.
|
*NGRAM TEXT indexes
are supported only in IN SYSTEM tables.
|
Stoplist setting (STOPLIST) The stoplist setting specifies terms that are not indexed.
The default for this setting is taken from the setting in the default
text configuration object, which typically has an empty stoplist.
Table 2-6: STOPLIST impact
To TEXT index
|
To query terms
|
GENERIC TEXT index For GENERIC TEXT indexes,
terms that are in the stoplist are not inserted into the TEXT index.
|
GENERIC TEXT index For GENERIC TEXT indexes,
query terms that are in the stoplist are ignored because they cannot exist
in the TEXT index.
|
NGRAM TEXT index* For NGRAM TEXT indexes,
the TEXT index does not contain the n-grams formed
from the terms in the stoplist.
|
NGRAM TEXT index* Terms in the stoplist are broken into n-grams and the n-grams
are used for the stoplist. Likewise, query terms are broken into
n-grams and any that match n-grams in the stoplist are dropped because
they cannot exist in the TEXT index.
|
*NGRAM TEXT indexes
are supported only in IN SYSTEM tables.
|
Consider carefully whether to put terms in to your stoplist.
In particular, do not include words that have non-alphanumeric characters
in them such as apostrophes or dashes. These characters act as term
breakers. For example, the word you'll (which must be specified
as 'you'll') is broken into you and ll and stored in the stoplist
as these two terms. Subsequent full text searches for 'you' or 'they'll'
are negatively impacted.
Stoplists in NGRAM TEXT indexes
can cause unexpected results because the stoplist that is stored
is actually in n-gram form, not the actual stoplist terms you specified.
For example, in an NGRAM TEXT index
where MAXIMUM TERM LENGTH is 3, if you specify STOPLIST 'there',
these n-grams are stored as the stoplist: the her ere. This impacts
the ability to query for any terms that contain the n-grams the,
her, and ere.