unichar datatype

The unichar and univarchar datatypes support the UTF-16 encoding of Unicode in the Adaptive Server. These datatypes are independent of the char and varchar datatypes, but mirror their behavior.

For example, the built-in functions that operate on char and varchar also operate on unichar and univarchar. However, unichar and univarchar store only UTF-16 characters and have no connection to the default character set ID or the default sort order ID, as char and varchar do.

Each unichar/univarchar character requires two bytes of storage. The declaration of a unichar/univarchar column is the number of 16-bit Unicode values. The following example creates a table with one unichar column for 10 Unicode values, requiring 20 bytes of storage:

create table unitbl (unicol unichar(10))

The length of a unichar/univarchar column is limited by the size of a data page, as is the length of char/varchar columns.

Unicode surrogate pairs use the storage of two 16-bit Unicode values (in other words, four bytes). Be aware of this when declaring columns intended to store Unicode surrogate pairs (a pair of 16 bit values that represent a character in the range [0x010000..0x10FFFF]). By default, Adaptive Server correctly handles surrogates, and does not split the pair. Truncation of Unicode data is handled in a manner similar to that of char and varchar data.

You can use unichar expressions anywhere char expressions are used, including comparison operators, joins, subqueries, and so forth. However, mixed-mode expressions of both unichar and char are performed as unichar. The number of Unicode values that can participate in such operations is limited to the maximum size of a unichar string.

The normalization process modifies Unicode data so there is only a single representation in the database for a given sequence of abstract character (see “Introduction to the Basics,” in the Performance and Tuning Series: Basics for a discussion of normalization). Often, characters followed by combined diacritics are replaced by pre-combined forms. This allows significant performance optimizations. By default, the server assumes all Unicode data should be normalized.


Relational expressions

All relational expressions involving at least one expression of unichar or univarchar, are based on the default Unicode sort order. If one expression is unichar and the other is varchar (nvarchar, char, or nchar), the latter is implicitly converted to unichar.

Table 6-4 shows which expressions are most often used in where clauses, and where they may be combined with logical operators.

When comparing Unicode character data, “less than” means closer to the beginning of the default Unicode sort order, and “greater than” means closer to the end. “Equality” means the Unicode default sort order makes no distinction between two values (although they need not be identical). For example, the precomposed character ê must be considered equal to the combining sequence consisting of the letter e followed by U+0302. (A precomposed character is a Unicode character you can decompose into an equivalent string of several other characters.) If the Unicode normalization feature is turned on (the default), Unicode data is automatically normalized and the server never sees unnormalized data.

Table 6-4: Relational expressions

expr1 op_compare [any | all] (subquery)

The use of any or all with comparison operators and subquery expr2 , implicitly invokes min or max. For instance, “expr1> any expr2” means, in effect, “expr1> min(expr2)”.

expr1 [not] in (expression list)

expr1 [not] in (subquery)

The in operator checks for equality with each of the elements in expr2, which can be a list of constants, or the results of a subquery.

expr1 [not] between expr2 and expr3

The between operator specifies a range. It is, in effect, shorthand for “expr1 = expr2 and expr1<= expr3”.

expr1 [not] like "match_string" [escape"esc_char”]

The like operator specifies a pattern to be matched. The semantics for pattern matching with Unicode data are the same as for regular character data. If expr1 is a unichar column name, then “match_string” may be either a unichar string or a varchar string. In the latter case, an implicit conversion takes place between varchar and unichar


Join operators

Join operators appear in the same manner as comparison operators. In fact, you can use any comparison operator can be used in a join. Expressions involving at least one expression of type unichar are based on the default Unicode sort order. If one expression is of type unichar and the other type varchar (nvarchar, char, or nchar), the latter is implicitly converted to unichar.


Union operators

The union operator operates with unichar data much like it does with varchar data. Corresponding columns from individual queries must be implicitly convertible to unichar, or explicit conversion must be used.


Clauses and modifiers

When unichar and univarchar columns are used in group by and order by clauses, equality is judged according to the default Unicode sort order. This is also true when using the distinct modifier.