Indexing document stores

Indexing is the process of collecting data about documents contained in a document store and storing its proprietary data structures, generically called indexes. After documents in a document store are indexed, they are available to be searched.

An indexing session describes all data collected during the pass of a document store’s indexer. Data for all documents is collected during the first indexing session; subsequent indexing sessions collect data for new documents, modified documents, and deleted documents. Thus, the amount of data collected during two different indexing sessions can vary dramatically.

When creating a document store, you can request that Sybase Search immediately index the document store. You can also perform the following types of indexing after creating a document store:

All data collected during an indexing session is stored in the indexing session’s data buffer. The data buffer is a RAM-oriented data structure, where data is aggregated, ready to be written to an index stripe. This buffer is flushed when the maximum memory threshold has been exceeded (specified in the system property omniq.index.buffer.maxMemory). The buffer shares this memory allocation with the document store’s active index stripe. See “Striping index data”.


Viewing indexed details

Sybase Search displays details of indexing activity for both the previous and any current indexing session with the details of the corresponding document store on the Document Stores page.

Table 3-3 summarizes the details of indexing activity displayed for each indexed document store.

Table 3-3: Indexing activity

Property

Value

Total

The total number of documents found

Indexable

The number of documents eligible for indexing

Selected

The number of documents selected for indexing

Skipped

The number of documents purposefully ignored

Deleted

The number of documents that have been indexed but no longer exist

New

The number of new unindexed documents found

Updated

The number of updated (changed since indexing) documents found

Unchanged

The number of indexed documents that have not changed

Failed

The number of documents that should have been indexed but were not, due to a problem

To view indexing data, click Index Information. Sybase Search displays the data on the Index Information page. Table 3-4 summarizes the data collected during an indexing session.

Table 3-4: Index information data

Property

Value

Documents Indexed

The total number of live and deleted documents in the indexes of all index stripes

Deleted Documents

The total number of documents in the indexes of all index stripes that reference deleted documents

Live Documents

The total number of documents in the indexes of all index stripes that reference live documents

Number of Stripes

The number of index stripes the indexed data is split across

Index Stripes

The details of each index stripe that the indexed data is split across


Striping index data

Index data is transferred from the data buffer and written to active or static stripes. Whether data is written to an active or a static index stripe is decided during the indexing session. The current active stripe stores all the data collected during the indexing session if it can accommodate it; otherwise, the active index stripe is emptied into a new static stripe, and all data collected during the indexing session is stored in the new static index stripe.

Active index stripes

Each document store’s collection of index stripes contain exactly zero or one active index stripe. An active index stripe is a collection of RAM-oriented data structures—all of its data is stored in RAM while it keeps a copy on disk for persistence. An active index stripe is always writable, thus may contain data collected over numerous indexing sessions.

When an active index stripe is emptied into a static index stripe, these files are deleted and it is discarded. A new active stripe is created the next time an indexing session collects a sufficient amount of data to fit into an active index stripe.

Static index stripes

Each document store’s collection of index stripes contain zero or more static index stripes. A static index stripe is a collection of disk-oriented data structures that you cannot change once they are written.


Viewing index stripe information

Each index stripe and details of its internal data structures are listed on the Index Information page. The details include the generic term, metadata indexes, and the data structures needed to track file system documents.

Table 3-5: Index stripe properties

Property

Value

Root

The location where the index stripe stores its data. The root property creates directories and data files in here as necessary.

Term Index Segments

The number of segments into which the term indexes are divided.

Metadata Index Segments

The number of segments into which the metadata indexes are divided.

Deleted Documents

The number of deleted documents for which this stripe still holds data (data which is purged on unification).

Live Documents

The number of live documents for which this stripe holds data.

Document lexicon

Property

Value

Segments

The number of segments into which the document lexicon is divided.

Documents

The number of documents in the lexicon.

ID Range

The ID range of the document IDs (first to last).

Last Indexed

The name of the last document indexed and the time it was added.


Unifying an index stripe

Too many index stripes can eventually cause a bottleneck; therefore, you periodically should unify the stripes into a single stripe.

StepsTo unify an index stripe

  1. From the Document Stores page, select a document store and click Index Information. The Index Information page appears and displays the index stripe details.

  2. Click Unify. The unification process runs. Sybase Search displays the progress of the unification process. Additionally, the unification process purges data marked for deletion and defragments the indexed data structures.