Share via


Character Views

To understand character set standards, you must distinguish between three distinct views of the characters:

  • Character repertoire (the abstract list of characters).
  • Characters as "code points" with scalar values.
  • Characters as encoded data.

Character repertoire (the abstract list of characters)

The character repertoire is an abstract list of more than one million characters found in a wide variety of scripts including Latin, Cyrillic, Chinese, Korean, Japanese, Hebrew, and Aramaic. Other symbols such as musical notation are also included in the character repertoire.

Both the Unicode and GB18030 standards have a character repertoire. As new characters are added to one standard, the other standard also adds those characters, to maintain parity.

Characters as "code points" with scalar values

Note

This second character view applies only to Unicode, not to GB18030.

Each character in the character repertoire is assigned to a "code point". Each code point has a specific numerical value, called its scalar value. The scalar value is often expressed in hexadecimal.

Code points exist in a "code space". The code space consists of a range of scalar values, which are divided across two planes:

  • Basic Multilingual Plane (64k in size).

    In Unicode, the hexadecimal expression of the values in this lower plane range from U+0000 to U+FFFF.

  • Supplemental Multilingual Plane (16 additional sections of 64k in size).

    In Unicode, the hexadecimal expression of the values in this upper plane range from U+10000 to U+10FFFF.

The complete code space for all possible scalar values is 17 * 64k in size (1,088,000 possible values).

Characters as encoded data

Each encoding form converts characters from the character repertoire to encoded data.

In GB18030, the encoded data is derived directly from the character repertoire: the concept of a scalar value as an intermediary between the character repertoire and the encoded data is limited to Unicode only.

In Unicode, the encoded data is derived by applying an algorithm to the scalar value.

Unicode defines three character encoding forms:

  • UTF-8
  • UTF-16
  • UTF-32