Using readtext with unitext Columns

When you issue readtext on a column defined for the unitext datatype, the readtext offset parameter specifies the number of bytes, or Unicode values, to skip before starting to read the unitext data.

The readtext size parameter specifies the number of bytes, or 16-bit Unicode values, to read. If you specify using bytes (the default), the offset and size values are adjusted to always start and end on the Unicode character boundaries, if necessary.

If enable surrogate processing is on, readtext truncates only on the surrogate boundary, and starting/ending positions are also adjusted accordingly and returns whole Unicode characters. For this reason, issuing readtext against a column defined for unitext may return fewer bytes than specified.

In the following example, the unitext column ut includes the string U+0101U+0041U+0042U+0043:

declare @val varbinary (16) 
select @val = textptr (ut) from unitable
where i = 1
readtext foo.ut @val 1 5

This query returns the value U+0041U+0042.

The offset position is adjusted to 2 since readtext cannot start from the second byte of a Unicode character. Unicode characters are always composed of an even number of bytes. Starting at the second byte (or ending in an odd number of bytes) shifts the result by one byte, and renders the result set inaccurate.

In the example above, the size value is adjusted to 4 since readtext cannot read the partial byte of the fourth character, U+0043.

In the following query, enable surrogate processing is enabled, and the ut column contains the string U+d800dc00U+00c2U+dbffdeffU+d800dc00:

declare @val varbinary (16) 
select @val = textptr (ut) from unitable
where i = 2
readtext foo.ut @val 1 8

This query returns the value U+00c2U+dbffdeff. The starting position is reset to 2, and the actual result size is 6 bytes rather than 8 since readtext does not break in the middle of a surrogate pair. Surrogate pairs (in this example, the first value in the range d800..dbff and the second in the range dc00..dfff) require 4-byte boundaries, and the rules of Unicode conformance for UTF-16 do not allow the division of these 4-byte characters.