Overview of character sets, encodings, and collations

Each piece of software works with a character set. A character set is a set of symbols, including letters, digits, spaces, and other symbols. An example of a character set is ISO-8859-1, also known as Latin1.

To properly represent these characters internally, each piece of software employs an encoding, also known as character encoding. An encoding is a method by which each character is mapped onto one or more bytes of information, and is presented as a hexadecimal number. An example of an encoding is UTF-8.

Sometimes the terms character set and encoding are used interchangeably, since the two aspects are so closely related.

A code page is one form of encoding. A code page is a mapping of characters to numeric representations, typically an integer between 0 and 255. An example of a code page is Windows code page 1252.

For the purposes of this documentation, the terms encoding, character encoding, character set encoding, and code page are synonymous.

Database servers, which sort characters (for example, listing names alphabetically), use a collation. A collation is a combination of a character encoding (a map between characters and their representation) and a sort order for the characters. There may be more than one sort order for each character set; for example, a case sensitive order and a case insensitive order, or two languages may sort the same characters in a different order.

Characters are printed or displayed on a screen using a font, which is a mapping between characters in the character set and their appearance. Fonts are handled by the operating system.

Operating systems also use a keyboard mapping to map keys or key combinations on the keyboard to characters in the character set.

Discuss this page in DocCommentXchange.
Send feedback about this page using email.