Developing and configuring customized text splitters

This section provides information for Java developers about developing, configuring, and using custom text splitters.

All document body text and textual metadata (excluding file paths) values are passed through the configured term splitter to be broken into individual terms. Each term that is not preserved (see “Defining the list of preserved terms”), not a stopword (see “Defining the list of stopwords”), and is neither too short nor too long, is passed to the configured term stemmer to be reduced to its root form. Both the term splitter and term stemmer can be reimplemented and reconfigured where necessary.

Term splitting turns extracted plain text into words. Term stemming reduces words to their common roots. Term splitting and term stemming are language-specific; therefore, for optimum performance when you know documents and searches are to be performed in a single language, you can customize the term splitter and term stemmer algorithm to make best use of the language.

For example, an English stemming algorithm converts “singing”, “sings”, and “singer” to the stem “sing”; however, this algorithm is not appropriate for French or Chinese.

The default configuration splitter class com.isdduk.text.BreakIteratorSplitter handles all double-byte characters by using the underlying default Java class java.text.BreakIterator. The Java BreakIterator class uses punctuation and word delimiters to split single-byte languages into words. For double-byte languages; however, the Java BreakIterator class samples the glyphs in pairs and tries to determine where the end of the words are likely to be.

If you intend to run Sybase Search with documents containing glyph-based languages, it is recommended that you write your own custom term splitter (described in “Configuring the term splitter”). Term splitting algorithms designed for a single language out-perform the Java BreakIterator, which is designed to handle multiple languages, particularly glyph-based languages.