SQL Anywhere provides two default text configuration objects, default_char for use with CHAR data, and default_nchar for use with NCHAR and CHAR data. Note that while default_nchar can be used with any data, character set conversion will be performed. For information about the settings for default_char and default_nchar text configuration objects, see Default text configuration objects.
You can test how a text configuration object affects term breaking using the sa_char_terms and sa_nchar_terms system procedures. See sa_char_terms system procedure, and sa_nchar_terms system procedure.
The following table explains text configuration object settings and how they impact what is indexed and how a full text search query is interpreted for text indexes using the text configuration. For examples, see Example text configuration objects.
External prefilter algorithm (PREFILTER) Prefiltering is the process of extracting text data from a file types such as Word, PDF, HTML, and XML. In the context of text indexing, prefiltering allows you to extract only the data you want indexed, and avoid indexing unnecessary content such HTML tags. For certain types of documents (for example, Microsoft Word documents), prefiltering is required to make full text indexes useful.
SQL Anywhere does not provide a built-in prefilter feature. However, you can create an external prefilter library to perform prefiltering according to your requirements, and then alter your text configuration object to point to it.
The following table explains the impact that the value of PREFILTER EXTERNAL NAME has on text indexing and on how query strings are handled:
Text indexes | Query strings |
---|---|
GENERIC and NGRAM text indexes An external prefilter takes an input value (a document) and filters it according to the rules specified by the prefilter library. The resulting text is then passed to the term breaker before building or updating the text index. |
GENERIC and NGRAM text indexes Query strings are not passed through a prefilter, so the setting of the PREFILTER EXTERNAL NAME clause has no impact on query strings. |
See also: External prefilter libraries, and PREFILTER EXTERNAL NAME clause, ALTER TEXT CONFIGURATION statement.
The ExternalLibrariesFullText directory in your SQL Anywhere install contains prefilter and term breaker sample code for you to explore. This directory is found under your Samples directory. For the location of your Samples directory, see Samples directory.
Term breaker algorithm (TERM BREAKER) The TERM BREAKER setting specifies the algorithm to use for breaking strings into terms. The choices are GENERIC for storing terms, or NGRAM for storing n-grams. For GENERIC, you can use the built-in term breaker algorithm, or an external term breaker. See TERM BREAKER clause, ALTER TEXT CONFIGURATION statement.
The following table explains the impact that the value of TERM BREAKER has on text indexing and on how query strings are handled:
Text indexes | Query strings |
---|---|
|
When parsing a CONTAINS query, the database server extracts keywords and special characters from the query string and then
applies the term breaker algorithm to the remaining terms. For example, if the query string is For more information about keywords and special characters in full text search, see CONTAINS search condition.
|
If not defined, the default for TERM BREAKER is taken from the setting in the default text configuration object. If a term breaker is not defined in the default text configuration object, the internal term breaker is used. See Default text configuration objects.
See also: TERM BREAKER clause, ALTER TEXT CONFIGURATION statement.
Minimum term length setting (MINIMUM TERM LENGTH) The MINIMUM TERM LENGTH setting specifies the minimum length, in characters, for terms inserted in the index or searched for in a full text query. MINIMUM TERM LENGTH is not relevant for NGRAM text indexes.
MINIMUM TERM LENGTH has special implications on prefix searching. See Prefix searching.
The value of MINIMUM TERM LENGTH must be greater than 0. If you set it higher than MAXIMUM TERM LENGTH, then MAXIMUM TERM LENGTH is automatically adjusted to be equal to MINIMUM TERM LENGTH.
If not defined, the default for MINIMUM TERM LENGTH is taken from the setting in the default text configuration object, which is typically 1. See Default text configuration objects.
The following table explains the impact that the value of MINIMUM TERM LENGTH has on text indexing and on how query strings are handled:
Text indexes | Query strings |
---|---|
GENERIC text index For GENERIC text indexes, the text index will not contain words shorter than MINIMUM TERM LENGTH. NGRAM text index For NGRAM text indexes, this setting is ignored. |
GENERIC text index When querying a GENERIC text index, query terms shorter than MINIMUM TERM LENGTH are ignored because they cannot exist in the text index. NGRAM text index The MINIMUM TERM LENGTH setting has no impact on full text queries on NGRAM text indexes. |
See also: MINIMUM TERM LENGTH clause, ALTER TEXT CONFIGURATION statement.
Maximum term length setting (MAXIMUM TERM LENGTH) The MAXIMUM TERM LENGTH setting is used differently depending on the term breaker algorithm.
The value of MAXIMUM TERM LENGTH must be less than or equal to 60. If you set it lower than the MINIMUM TERM LENGTH, then MINIMUM TERM LENGTH is automatically adjusted to be equal to MAXIMUM TERM LENGTH.
The default for this setting is taken from the setting in the default text configuration object, which is typically 20. See Default text configuration objects.
If not defined, the default for MAXIMUM TERM LENGTH is taken from the setting in the default text configuration object, which is typically 20. See Default text configuration objects.
The following table explains the impact that the value of MAXIMUM TERM LENGTH has on text indexing and on how query strings are handled:
Text indexes | Query strings |
---|---|
GENERIC text indexes For GENERIC text indexes, MAXIMUM TERM LENGTH specifies the maximum length, in characters, for terms inserted in the text index. NGRAM text index For NGRAM text indexes, MAXIMUM TERM LENGTH determines the length of the n-grams that terms are broken into. An appropriate choice of length for n-grams depends on the language. Typical values are 4 or 5 characters for English, and 2 or 3 characters for Chinese. |
GENERIC text indexes For GENERIC text indexes, query terms longer than MAXIMUM TERM LENGTH are ignored because they cannot exist in the text index. NGRAM text index For NGRAM text indexes, query terms are broken into n-grams of length n, where n is the same as MAXIMUM TERM LENGTH. Then, the database server uses the n-grams to search the text index. Terms shorter than MAXIMUM TERM LENGTH are ignored because they will not match the n-grams in the text index. Therefore, proximity searches do not work unless arguments are prefixes of length n. |
See also: MAXIMUM TERM LENGTH clause, ALTER TEXT CONFIGURATION statement.
Stoplist setting (STOPLIST) The stoplist setting specifies the terms that must not be indexed.
If not defined, the default for this setting is taken from the setting in the default text configuration object, which typically has an empty stoplist. See Default text configuration objects.
STOPLIST impact to text index | STOPLIST impact to query terms |
---|---|
GENERIC text indexes For GENERIC text indexes, terms that are in the stoplist are not inserted into the text index. NGRAM text index For NGRAM text indexes, the text index will not contain the n-grams formed from the terms in the stoplist. |
GENERIC text indexes For GENERIC text indexes, query terms that are in the stoplist are ignored because they cannot exist in the text index. NGRAM text index Terms in the stoplist are broken into n-grams and the n-grams are used for the term filtering. Likewise, query terms are broken into n-grams and any that match n-grams in the stoplist are dropped because they cannot exist in the text index. |
The settings in the text configuration object are applied to the stoplist when it is parsed. That is, the specified specified term breaker and the min/max length settings are applied.
Stoplists in NGRAM text indexes can cause unexpected results because the stoplist is stored in n-gram form, and not the stoplist
terms you specified. For example, in an NGRAM text index where MAXIMUM TERM LENGTH is 3, if you specify STOPLIST 'there'
, the following n-grams are stored as the stoplist: the her ere. This impacts the ability to query for any terms that contain
the n-grams the, her, and ere.
The same restrictions with regards to specifying string literals also apply to stoplists. For example, apostrophes must be escaped, and so on. For more information on formatting string literals, see String literals.
The Samples directory contains sample code that loads stoplists for several languages. These sample stoplists are recommended for use only on GENERIC text indexes. For the location of the Samples directory, see Samples directory.
See also: STOPLIST clause, ALTER TEXT CONFIGURATION statement.
Date, time, and timestamp format settings When a text configuration object is created, the values for date_format, time_format, timestamp_format, and timestamp_with_time_zone_format options for the current connection are stored with the text configuration object. These option values control how DATE, TIME, and TIMESTAMP columns are formatted for the text indexes built using the text configuration object. You cannot explicitly set these option values for the text configuration object; the settings reflect those in effect for the connection that created the text configuration object. However, you can change them. See How to alter a text configuration object, and the SAVE OPTION VALUES clause of the ALTER TEXT CONFIGURATION statement.
Discuss this page in DocCommentXchange.
|
Copyright © 2010, iAnywhere Solutions, Inc. - SQL Anywhere 12.0.0 |