You construct a SQL query to retrieve content and metadata from columns in a database. Each row of data represents a document. Each document requires a unique identifier (a document reference) and content (body text). Optionally, it can have a title and other metadata.
Each database document store can have only one SQL query. A single SQL query can import one or all of your database documents into a document store, provided that the documents are all in a single database and that no authentication or authorization constraints require you to make multiple queries. For example, if you must specify more than one user name and password or more than one host, then you must construct more than one SQL query. You then require a database document store for each SQL query. Documents in separate databases require their own database document stores.
When constructing an import SQL statement, the following column names (or column aliases) have specific meaning. All are not case sensitive:
DOC_REF – a unique token by which the document can be referred to for updates and deletes. A primary key column is most suitable for this.
Sybase Search supports the following SQL types:
TINYINT
SMALLINT
INTEGER
BIGINT
REAL
FLOAT
DOUBLE
CHAR
VARCHAR – You can define the maximum length of VARCHAR.
DOC_CONTENT – the text used as the body of the document. The text can be the content from TEXT or VARCHAR fields, or if used in conjunction with a content type value, it can be any of the supported MIME types.
DOC_CONTENT_TYPE – the content type (or MIME type) of the document. When content is contained in the database in a binary format, DOC_CONTENT_TYPE provides the additional information required to decode it. For example, if the document content was UTF-8 encoded, plain text, then the content type is “text/plain; charset=UTF-8”. Similarly, if the content field contains PDF bytes, the content type is “application/pdf”. When the document content is binary and no content type is specified, Sybase Search attempts to decode it as plain text using the JRE default character set.
If the content column contains a document binary and only the name or extension of the document is known (for example, Report.pdf), the name and extension can be supplied as the document content type. Sybase Search performs a MIME type for file extension lookup using its MIME Type map, and in these cases values, such as C:\Documents\Report.pdf, Report.pdf, .pdf, and pdf each represent the MIME type “application/pdf”.
DOC_LINK – a link to an external document on a file system. The link must be an absolute path, visible to Sybase Search. The document properties (where present) are extracted as metadata and Sybase Search uses the text as the document’s body text.
Sybase Search treats all other column names and aliases as metadata and saves the information with the document as its metadata. If the metadata is to be indexed, its name and type must be known by the metadata manager. You are not required to supply metadata; however, as a best practice, you should supply a document TITLE. The document TITLE is shown on the document search results page.
The following example of a SQL query shows how a recruitment agency might import their current candidate CV resumés:
SELECT ID AS DOC_REF, /*INT*/ PROFILE AS DOC_CONTENT, /*VARCHAR*/ CV AS DOC_CONTENT_2 /*BLOB*/ CV_NAME AS DOC_CONTENT_TYPE_2, FIRST_NAME + ' ' + LAST_NAME AS TITLE, PREF_SALARY /*FLOAT*/ FROM CANDIDATES WHERE LIVE=1
This example shows how the primary key column ID is used as a document reference, and how the document content (body text) is composed from both VARCHAR text in the database and a document on the file system. In this example, the MIME type of the document is not known; therefore, the original name of the document is passed to Sybase Search for it to query the appropriate MIME type. Also, a title is being constructed from the first and last name of the candidate. The preferred salary is saved as metadata.