Processing metadata values

Parsers are used for processing metadata values, which are generally received as string key/value pairs. While document body text is processed by the system term splitter and stemmer, metadata often must be handled differently (as metadata values can be not only strings, but also numeric and date types). The parsers loaded by the Text Manager are referenced in the metadata field parser and query parser XML configuration files.

There are four types of parsers:

A string parser is always handled by internal classes. You can build custom numeric and dates parsers and plug them into the system if necessary.

Table 4-21 shows the attributes for the Parser tag.

Table 4-21: The Parser tag

Attribute

Default

Description

identifier

None

The Parser instance’s identifier. This must be a name and a unique ID separated by an underscore (_).

class

None

The Java implementation class.

Table 4-22 shows the attributes for the Param tag.

Table 4-22: The Param tag

Attribute

Default

Description

name

None

The name of the parameter to pass to the parser.

value

None

The string value to associate with the parameter name.

Sybase Search comes with the preconfigured parsers, shown in Table 4-23, which are adequate for most common metadata types.

Table 4-23: Preconfigured parsers

Name

float_1

Class

com.isdduk.text.SimpleFloatParser

This class parses strings representing decimal numbers into actual decimal numbers. For example, the string “3.142” is parsed into Java float 3.142.

Name

integer_2

Class

com.isdduk.text.IntegerParser

This class parses strings representing an integer number into an actual integer number; any floating-point information is discarded. For example, both “3” and “3.142” are parsed into Java int 3.

Name

dateUK_3

Class

com.isdduk.text.DateFormatParser

Name

dateMs1970_4

Class

com.isdduk.text.Ms1970DateParser

Parameter

Name – roundTo.

Value – choose a year, month, day, hour, minute, second, or any other value to denote no rounding should take place.

This class is date parser, which effectively parses strings representing long integer (64-bit) numbers, which themselves represent dates as the number of milliseconds since 1 January 1970. The preconfigured instance rounds dates to the nearest day (UTC).

Name

intB2KB_5

Class

com.isdduk.text.B2KBIntParser

This class parses strings representing byte-size numbers and converts them into kilobyte-size numbers. For instance, the string “2048” (bytes) is parsed as Java int 2 (kilobytes).

Name

datePDF_6

Class

com.isdduk.text.PDFDateParser

Parameter

Name – roundTo.

Value – choose a year, month, day, hour, minute, second, or any other value to denote no rounding should take place.

This class handles the PDF date format, in which dates are formatted “D:20030602143803+01'00'”. The preconfigured instance rounds dates to the nearest day (UTC).