NGRAM TEXT index and fuzzy and non-fuzzy search

As of Sybase IQ 15.2 ESD #2, Sybase IQ supports NGRAM TEXT index, fuzzy search, and non-fuzzy search.


NGRAM TEXT index

NGRAM TEXT index stores the text in the column by breaking the text into NGRAMs of text value N, where N is the value given by a user. You can perform a search over an NGRAM TEXT index by matching the NGRAMs of the text value in the CONTAINS clause of the query against the stored NGRAMs in the index.

NGRAM TEXT index accommodates fuzzy searching capability over the text for both European and non-European languages.

NoteNGRAM TEXT index search is mainly useful when words are misspelled. Sybase IQ does not support searches like synonyms and antonyms.

NGRAM is built on TEXT indexes, so use text configuration object settings to define whether to use an NGRAM or GENERIC TEXT index.

For more information on text configuration object settings, see SQL Anywhere 11.0.1 > SQL Anywhere Server - SQL Usage > Querying and Modifying Data > Querying Data > Text configuration objects > Text configuration object settings.


Creating NGRAM TEXT index

For information on how to create a NGRAM TEXT index, see SQL Anywhere 11.0.1 > SQL Anywhere Server - SQL Usage > Querying and Modifying Data > Querying Data > Types of full text searches > Tutorial: Performing a fuzzy full text search.


Fuzzy search

Fuzzy search capability over a TEXT index is possible only if the TEXT index is of type NGRAM. The GENERIC TEXT index cannot handle the fuzzy search.

Here is an example of a fuzzy search over an NGRAM TEXT index. Create a table and a NGRAM TEXT index:

CREATE TEXT CONFIGURATION NGRAMTxtcfg
   FROM default_char;
ALTER TEXT CONFIGURATION NGRAMTxtcfg TERM BREAKER    NGRAM;
ALTER TEXT CONFIGURATION NGRAMTxtcfg maximum term    length 3;
CREATE TABLE t_iq(a int, b varchar(100));
CREATE TEXT INDEX TXT_IQ on t_iq(b) CONFIGURATION    NGRAMTxtcfg

Insert this data into the table:

INSERT INTO t_iq values (1,'hello this is hira ');
INSERT INTO t_iq values(2, ' book he ookw worm okwo
kwor');
INSERT INTO t_iq values(3,'Michael is a good person');
INSERT INTO t_iq values(4,'hello this is evaa');
INSERT INTO t_iq values(5,'he is a bookworm');
INSERT INTO t_iq values (6,'boo ook okw kwo wor orm');

After inserting the data, execute this query to perform fuzzy searching over an NGRAM TEXT index:

SELECT * FROM t_iq WHERE CONTAINS (b,'FUZZY "bookerm"');

The results of the query are:

a         b       
2         book he ookw worm okwo kwor
5         he is a bookworm
6         boo ook okw kwo wor orm

Fuzzy searching can be used to search for misspellings or variations of a word. To do so, use the FUZZY operator followed by a string in double quotes to find an approximate match for the string.Using the FUZZY operator is equivalent to breaking the string manually into substrings of length n and separating them with OR operators. For example, if you have a text index configured with the NGRAM term breaker and a MAXIMUM TERM LENGTH of 3, specifying 'FUZZY "500 main street" ' is equivalent to specifying '500 OR mai OR ain OR str OR tre OR ree OR eet'.The FUZZY operator is useful in a full text search that returns a score. Many approximate matches may be returned, but usually only the matches with the highest scores are meaningful.

Example 1—This query illustrates a additional letter in the fuzzy search clause.

SELECT * FROM t_iq WHERE CONTAINS (b,'FUZZY "hellow"');

The results of the query are:

a         b       
1         hello this is hira
4         hello this is evaa

Example 2—In this query, a letter is removed from the fuzzy search clause.

SELECT * FROM t_iq WHERE CONTAINS(b, 'FUZZY "hllo"');

The results of the query are:

a         b       
1         hello this is hira
4         hello this is evaa

NoteFuzzy search does not support prefix or suffix searching. For example, the search clause cannot be “v*” or “*vis”.

For more information on fuzzy search, see SQL Anywhere 11.0.1 > SQL Anywhere Server - SQL Usage > Querying and Modifying Data > Querying Data > Types of full text searches > Fuzzy searches.


Non-fuzzy search

Non-fuzzy search on NGRAM breaks the term into corresponding NGRAMs and searches for the NGRAMs in the NGRAM TEXT index.The query CONTAINS ( M.Description, 'ams' ) ct; illustrates a non-fuzzy NGRAM search over a 2GRAM index, which is semantically equal to searching query CONTAINS( M.Description, '"am ms"' ) ct;

If you search for a ‘v*’ TERM on a 2GRAM index, then v followed by any alphabet is considered as a matching 2GRAM for the searching term and is output as a result.The query CONTAINS (M.Description, ‘white whale’) ct; illustrates a non-fuzzy NGRAM search over a 3GRAM index and is semantically equal to searching query CONTAINS (M.Description, ‘”whi hit ite wha hal ale”’);The difference between NGRAM fuzzy and non-fuzzy search is that fuzzy search is a disjunction over individual GRAMS. Non-fuzzy search is a conjunction over the individual GRAMS. When GENERIC and NGRAM TEXT indexes are created on the same column, then the GENERIC TEXT index is used for a query with non-fuzzy search and the NGRAM TEXT index is used for fuzzy search.

Example 3—This query illustrates non-fuzzy search after creating a GENERIC TEXT index on the same column.

SELECT * FROM t_iq WHERE CONTAINS (b,'bookworm');

The results of the query are:

a         b       
5         he is a bookworm

Example 4—This query illustrates fuzzy search with both NGRAM and GENERIC TEXT indexes on the same column.

SELECT * FROM t_iq 
WHERE CONTAINS (b,’FUZZY “bookwerm”’);

The results of the query are:

a         b       
2         book he ookw worm okwo kwor
5         he is a bookworm
6         boo ook okw kwo wor orm

Example 4—This query illustrates the behavior of a fuzzy search phrase in a non-fuzzy search clause.

SELECT * FROM t_iq WHERE CONTAINS (b,’bookwerm’);

No result is returned for this query.