Comhroinn trí


cmap — Character to Glyph Index Mapping Table

Table overview

This table defines the mapping of character codes to a default glyph index. Different subtables may be defined that each contain mappings for different character encoding schemes. The table header indicates the character encodings for which subtables are present.

Regardless of the encoding scheme, character codes that do not correspond to any glyph in the font should be mapped to glyph index 0. The glyph at this location must be a special glyph representing a missing character, commonly known as .notdef.

Each subtable is in one of seven possible formats and begins with a format field indicating the format used. The first four formats — formats 0, 2, 4 and 6 — were originally defined prior to Unicode 2.0. These formats allow for 8-bit single-byte, 8-bit multi-byte, and 16-bit encodings. With the introduction of supplementary planes in Unicode 2.0, the Unicode addressable code space extends beyond 16 bits. To accommodate this, three additional formats were added — formats 8, 10 and 12 — that allow for 32-bit encoding schemes.

Other enhancements in Unicode led to the addition of other subtable formats. Subtable format 13 allows for an efficient mapping of many characters to a single glyph; this is useful for “last-resort” fonts that provide fallback rendering for all possible Unicode characters with a distinct fallback glyph for different Unicode ranges. Subtable format 14 provides a unified mechanism for supporting Unicode variation sequences.

Of the seven available formats, not all are commonly used today. Formats 4 or 12 are appropriate for most new fonts, depending on the Unicode character repertoire supported. Format 14 is used in many applications for support of Unicode variation sequences. Some platforms also make use for format 13 for a last-resort fallback font. Other subtable formats are not recommended for use in new fonts. Application developers, however, should anticipate that any of the formats may be used in fonts.

Note: The 'cmap' table version number remains at 0x0000 for fonts that make use of the newer subtable formats.

'cmap' Header

The Character to Glyph Index Mapping table is organized as follows.

'cmap' Header:

Type Name Description
uint16 version Table version number (0).
uint16 numTables Number of encoding tables that follow.
EncodingRecord encodingRecords[numTables]

Encoding records and encodings

The array of encoding records specifies particular encodings and the offset to the subtable for each encoding.

EncodingRecord:

Type Name Description
uint16 platformID Platform ID.
uint16 encodingID Platform-specific encoding ID.
Offset32 subtableOffset Byte offset from beginning of table to the subtable for this encoding.

The platform ID and platform-specific encoding ID in the encoding record are used to specify a particular character encoding. In the case of the Macintosh platform, a language field within the mapping subtable is also used for this purpose.

The encoding record entries in the 'cmap' header must be sorted first by platform ID, then by platform-specific encoding ID, and then by the language field in the corresponding subtable. Each platform ID, platform-specific encoding ID, and subtable language combination may appear only once in the 'cmap' table.

Apart from a format 14 subtable, all other subtables are exclusive: applications should select and use one and ignore the others. If a Unicode subtable is used (platform 0, or platform 3 / encoding 1 or 10), then a format 14 subtable using platform 0/encoding 5 can also be supplemented for mapping Unicode variation sequences.

If a font includes Unicode subtables for both 16-bit encoding (typically, format 4) and also 32-bit encoding (formats 10 or 12), then the characters supported by the subtable for 32-bit encoding should be a superset of the characters supported by the subtable for 16-bit encoding, and the 32-bit encoding should be used by applications.

Fonts should not include 16-bit Unicode subtables using both format 4 and format 6; format 4 should be used. Similarly, fonts should not include 32-bit Unicode subtables using both format 10 and format 12; format 12 should be used.

If a font includes encoding records for Unicode subtables of the same format but with different platform IDs, an application may choose which to select, but should make this selection consistently each time the font is used.

Platform IDs

The following platform IDs are defined:

Platform ID Platform name Platform-specific encoding IDs
0 Unicode Various
1 Macintosh Script manager code
2 ISO [deprecated] ISO encoding [deprecated]
3 Windows Windows encoding
4 Custom Custom

Platform ID values 240 through 255 are reserved for user-defined platforms and shall never be assigned to a registered platform.

Unicode platform (platform ID = 0)

The following encoding IDs are defined for use with the Unicode platform:

Encoding ID Description
0 Unicode 1.0 semantics—deprecated
1 Unicode 1.1 semantics—deprecated
2 ISO/IEC 10646 semantics—deprecated
3 Unicode 2.0 and onwards semantics, Unicode BMP only
4 Unicode 2.0 and onwards semantics, Unicode full repertoire
5 Unicode variation sequences—for use with subtable format 14
6 Unicode full repertoire—for use with subtable format 13

Use of encoding IDs 0, 1 or 2 is deprecated.

Encoding ID 3 should be used in conjunction with 'cmap' subtable formats 4 or 6. Encoding ID 4 should be used in conjunction with subtable formats 10 or 12.

Unicode variation sequences supported by the font should be specified in the 'cmap' table using a format 14 subtable. A format 14 subtable must only be used under platform ID 0 and encoding ID 5; and encoding ID 5 should only be used with a format 14 subtable.

Encoding ID 6 should only be used in conjunction with 'cmap' subtable format 13; and subtable format 13 should only be used under platform ID 0 and encoding ID 6.

Macintosh platform (platform ID = 1)

Older Macintosh versions required fonts to have a 'cmap' subtable for platform ID 1. For current Apple platforms, use of platform ID 1 is discouraged. See the 'name' table chapter for details regarding encoding IDs defined for the Macintosh platform.

ISO platform (platform ID = 2)

Use of this platform ID is deprecated.

The following encoding IDs are defined for use with the ISO platform:

Code ISO encoding
0 7-bit ASCII
1 ISO 10646
2 ISO 8859-1

Windows platform (platform ID = 3)

The Windows platform supports several encodings. When creating fonts for Windows, Unicode 'cmap' subtables should always be used—platform ID 3 with encoding ID 1 or encoding ID 10. See below for additional details.

The following encoding IDs are supported on the Windows platform:

Platform ID Encoding ID Description
3 0 Symbol
3 1 Unicode BMP
3 2 ShiftJIS
3 3 PRC
3 4 Big5
3 5 Wansung
3 6 Johab
3 7 Reserved
3 8 Reserved
3 9 Reserved
3 10 Unicode full repertoire

Fonts that support only Unicode BMP characters (U+0000 to U+FFFF) on the Windows platform must use encoding 1 with a format 4 subtable. This encoding must not be used to support Unicode supplementary-plane characters.

Fonts that support Unicode supplementary-plane characters (U+10000 to U+10FFFF) on the Windows platform must use encoding 10 with a format 12 subtable.

The symbol encoding was created to support fonts with arbitrary ornaments or symbols not supported in Unicode or other standard encodings. A format 4 subtable would be used, typically with up to 224 graphic characters assigned at code positions beginning with 0xF020. This corresponds to a sub-range within the Unicode Private-Use Area (PUA), though this is not a Unicode encoding. In legacy usage, some applications would represent the symbol characters in text using a single-byte encoding, and then map 0x20 to the OS/2.usFirstCharIndex value in the font. In new fonts, symbols or characters not in Unicode should be encoded using PUA code points in a Unicode 'cmap' subtable.

See the Recommendations chapter for additional information.

Custom platform (platform ID = 4) and OTF Windows NT compatibility mapping

Platform ID 4 is a legacy platform that was created to provide compatibility of older applications with OpenType fonts that had been adapted from older Type 1 fonts. This platform is not commonly used today, and should not be used in new fonts.

ID Custom encoding
0-255 OTF Windows NT compatibility mapping

This 'cmap' platform provides a compatibility mechanism for non-Unicode applications that use the font as if it were Windows ANSI encoded. Non-Windows ANSI Type 1 fonts, such as Cyrillic and Central European fonts that Adobe shipped in the past, had “0” (Windows ANSI) recorded in the CharSet field of the .PFM file; Adobe Type Manager for Windows 9x ignored the CharSet altogether. Adobe provides this compatibility 'cmap' encoding in every OpenType font converted from a Type1 font in which the Encoding is not StandardEncoding.

When platform ID 4 is used, the encoding ID must be set to the Windows charset value (in the range 0 to 255, inclusive) present in the .PFM file of the original Type 1 font.

If a platform ID 4, encoding ID 0 – 255 'cmap' encoding is present in an OpenType font with CFF outlines, then the OTF font driver in Windows NT will: (a) superimpose the glyphs encoded at character codes 0-255 in the encoding on the corresponding Windows ANSI (code page 1252) Unicode values in the Unicode encoding it reports to the system; (b) add Windows ANSI (CharSet 0) to the list of CharSets supported by the font; and (c) consider the value of the encoding ID to be a Windows CharSet value and add it to the list of CharSets supported by the font. Note that the 'cmap' subtable needs to use Format 0 or 6 for its subtable, and the encoding needs to be identical to the CFF’s encoding.

'cmap' subtable formats

Use of the language field in 'cmap' subtables

All 'cmap' subtable formats include a language field. The language field must be set to zero for all 'cmap' subtables whose platform IDs are other than Macintosh (platform ID 1). For 'cmap' subtables whose platform IDs are Macintosh, set this field to the Macintosh language ID of the 'cmap' subtable plus one, or to zero if the 'cmap' subtable is not language-specific. For example, a Mac OS Turkish 'cmap' subtable must set this field to 18, since the Macintosh language ID for Turkish is 17. A Mac OS Roman 'cmap' subtable must set this field to 0, since Mac OS Roman is not a language-specific encoding.

Format 0: Byte encoding table

Format 0 was the standard mapping subtable used on older Macintosh platforms but is not required on newer Apple platforms.

'cmap' Subtable Format 0:

Type Name Description
uint16 format Format number is set to 0.
uint16 length This is the length in bytes of the subtable.
uint16 language For requirements on use of the language field, see “Use of the language field in 'cmap' subtables” in this document.
uint8 glyphIdArray[256] An array that maps character codes to glyph index values.

This is a simple 1 to 1 mapping of character codes to glyph indices. The glyph set is limited to 256. If this format is used to index into a larger glyph set, only the first 256 glyphs will be accessible.

Format 2: High byte mapping through table

This subtable format was created for “double-byte” encodings following national character code standards used for Japanese, Chinese, and Korean characters. These code standards use a mixed 8-/16-bit encoding. This format is not commonly used today.

In these mixed 8-/16-bit encodings, certain byte values signal the first byte of a 2-byte character. (These byte values are also valid as the second byte of a 2-byte character.) In addition, even for the 2-byte characters, the mapping of character codes to glyph index values depends heavily on the first byte. Consequently, the table begins with an array that maps the first byte to a SubHeader record. For 2-byte character codes, the SubHeader is used to map the second byte’s value into a sub-range of a glyph index array—a sub-array—, as described below. When processing mixed 8-/16-bit text, SubHeader 0 is special: it is used for single-byte character codes. When SubHeader 0 is used, a second byte is not needed; the single byte value is mapped through the specified sub-array.

'cmap' Subtable Format 2:

Type Name Description
uint16 format Format number is set to 2.
uint16 length This is the length in bytes of the subtable.
uint16 language For requirements on use of the language field, see “Use of the language field in 'cmap' subtables” in this document.
uint16 subHeaderKeys[256] Array that maps high bytes into the subHeaders array: value is subHeaders index × 8.
SubHeader subHeaders[ ] Variable-length array of SubHeader records.
uint16 glyphIdArray[ ] Variable-length array containing sub-arrays used for mapping the low byte of 2-byte characters.

The SubHeader record is structured as follows:

SubHeader Record:

Type Name Description
uint16 firstCode First valid low byte for this SubHeader.
uint16 entryCount Number of valid low bytes for this SubHeader.
int16 idDelta See text below.
uint16 idRangeOffset See text below.

The firstCode and entryCount values specify a subrange that begins at firstCode and has a length equal to the value of entryCount. This subrange stays within the 0-255 range of the byte being mapped. Bytes outside of this subrange are mapped to glyph index 0 (missing glyph). The offset of the byte within this subrange is then used as index into a corresponding sub-array of glyphIdArray. This sub-array is also of length entryCount. The value of the idRangeOffset is the number of bytes past the actual location of the idRangeOffset word where the glyphIdArray element corresponding to firstCode appears.

Finally, if the value obtained from the sub-array is not 0 (which indicates the missing glyph), you should add idDelta to it in order to get the glyphIndex. The value idDelta permits the same sub-array to be used for several different subheaders. The idDelta arithmetic is modulo 65536. If the result after adding idDelta to the value from the sub-array is less than zero, add 65536 to obtain a valid glyph ID.

Format 4: Segment mapping to delta values

This is the standard character-to-glyph-index mapping subtable for fonts that support only Unicode Basic Multilingual Plane characters (U+0000 to U+FFFF).

Note: To support Unicode supplementary-plane characters, format 12 should be used.

This format is used when the character codes for the characters represented by a font fall into several contiguous ranges, possibly with holes in some or all of the ranges (that is, some of the codes in a range might not have a representation in the font). The format-dependent data is divided into three parts, which must occur in the following order:

  1. A four-word header gives parameters for an optimized search of the segment list.
  2. Four parallel arrays describe the segments (one segment for each contiguous range of codes).
  3. A variable-length array of glyph IDs (unsigned words).

'cmap' Subtable Format 4:

Type Name Description
uint16 format Format number is set to 4.
uint16 length This is the length in bytes of the subtable.
uint16 language For requirements on use of the language field, see “Use of the language field in 'cmap' subtables” in this document.
uint16 segCountX2 2 × segCount.
uint16 searchRange Maximum power of 2 less than or equal to segCount, times 2
((2**floor(log2(segCount))) * 2, where “**” is an exponentiation operator)
uint16 entrySelector Log2 of the maximum power of 2 less than or equal to segCount
(log2(searchRange/2), which is equal to floor(log2(segCount)))
uint16 rangeShift segCount times 2, minus searchRange
((segCount * 2) - searchRange)
uint16 endCode[segCount] End characterCode for each segment, last=0xFFFF.
uint16 reservedPad Set to 0.
uint16 startCode[segCount] Start character code for each segment.
int16 idDelta[segCount] Delta for all character codes in segment.
uint16 idRangeOffset[segCount] Offsets into glyphIdArray or 0
uint16 glyphIdArray[ ] Glyph index array (arbitrary length)

The number of segments is specified by segCount, which is not given directly in the header but is readily derived from segCountX2. All of the other header parameters are derived from it. The searchRange value is twice the largest power of 2 that is less than or equal to segCount. For example, if segCount=39, we have the following:

segCountX278
searchRange64 (= 2 × (largest power of 2 <=39))
entrySelector5 (= log232)
rangeShift14 (= 2 × 39 - 64)

To assist in quick binary searches, the searchRange, entrySelector and rangeShift fields are included as parameters that can be used in configuring search algorithms. In particular, binary search is optimal when the number of entries is a power of two. The searchRange field provides the largest number of items that can be searched with that constraint (maximum power of two). The rangeShift field provides the remaining number of items that would also need to be searched. The entrySelector field indicates the maximum number of levels into the binary tree will need to be entered.

In early implementations on devices with limited hardware capabilities, optimizations provided by the searchRange, entrySelector and rangeShift fields were of high importance. They have less importance on modern devices but could still be used in some implementations. However, incorrect values could potentially be used as an attack vector against some implementations. Since these values can be derived from the segCountX2 field when the file is parsed, it is strongly recommended that parsing implementations not rely on the searchRange, entrySelector and rangeShift fields in the font but derive them independently from segCountX2. Font files, however, should continue to provide valid values for these fields to maintain compatibility with all existing implementations.

Each segment is described by a startCode and endCode, along with an idDelta and an idRangeOffset, which are used for mapping the character codes in the segment. The segments are sorted in order of increasing endCode values, and the segment values are specified in four parallel arrays. You search for the first endCode that is greater than or equal to the character code you want to map. If the corresponding startCode is less than or equal to the character code, then you use the corresponding idDelta and idRangeOffset to map the character code to a glyph index (otherwise, the missingGlyph is returned). For the search to terminate, the final startCode and endCode values must be 0xFFFF. This segment need not contain any valid mappings. (It can just map the single character code 0xFFFF to missingGlyph). However, the segment must be present.

If the idRangeOffset value for the segment is not 0, the mapping of character codes relies on glyphIdArray. The character code offset from startCode is added to the idRangeOffset value. This sum is used as an offset from the current location within idRangeOffset itself to index out the correct glyphIdArray value. This obscure indexing trick works because glyphIdArray immediately follows idRangeOffset in the font file. The C expression that yields the glyph index is:

glyphId = *(idRangeOffset[i]/2
            + (c - startCode[i])
            + &idRangeOffset[i])

The value c is the character code in question, and i is the segment index in which c appears. If the value obtained from the indexing operation is not 0 (which indicates missingGlyph), idDelta[i] is added to it to get the glyph index. The idDelta arithmetic is modulo 65536.

If the idRangeOffset is 0, the idDelta value is added directly to the character code offset (i.e. idDelta[i] + c) to get the corresponding glyph index. Again, the idDelta arithmetic is modulo 65536. If the result after adding idDelta[i] + c is less than zero, add 65536 to obtain a valid glyph ID.

As an example, the variant part of the table to map characters 10-20, 30-90, and 153-480 onto a contiguous range of glyph indices may look like this:

segCountX2: 8
searchRange: 8
entrySelector: 2
rangeShift: 0
endCode: 20 90 480 0xffff
reservedPad: 0
startCode: 10 30 153 0xffff
idDelta: -9 -18 -80 1
idRangeOffset: 0 0 0 0

This table yields the following mappings:

10 ⇒ 10 - 9 = 1
20 ⇒ 20 - 9 = 11
30 ⇒ 30 - 18 = 12
90 ⇒ 90 - 18 = 72
153 ⇒ 153 - 80 = 73
480 ⇒ 480 - 80 = 400
0xffff ⇒ 0

Note that the delta values could be reworked so as to reorder the segments.

Format 6: Trimmed table mapping

Format 6 was designed to map 16-bit characters to glyph indexes when the character codes for a font fall into a single contiguous range.

'cmap' Subtable Format 6:

Type Name Description
uint16 format Format number is set to 6.
uint16 length This is the length in bytes of the subtable.
uint16 language For requirements on use of the language field, see “Use of the language field in 'cmap' subtables” in this document.
uint16 firstCode First character code of subrange.
uint16 entryCount Number of character codes in subrange.
uint16 glyphIdArray[entryCount] Array of glyph index values for character codes in the range.

The firstCode and entryCount values specify a subrange (beginning at firstCode, length = entryCount) within the range of possible character codes. Codes outside of this subrange are mapped to glyph index 0. The offset of the code (from the first code) within this subrange is used as index to the glyphIdArray, which provides the glyph index value.

Format 8: mixed 16-bit and 32-bit coverage

Subtable format 8 was designed to support Unicode supplementary-plane characters in UTF-16 encoding, though it is not commonly used. Format 8 is similar to format 2 in that it provides for mixed-length character codes. Instead of allowing for 8- and 16-bit character codes, however, it allows for 16- and 32-bit character codes.

If a font contains Unicode supplementary-plane characters (U+10000 to U+10FFFF), then it’s likely that it will also include Unicode BMP characters (U+0000 to U+FFFF) as well. Hence, there is a need to map a mixture of 16-bit and 32-bit character codes. A simplifying assumption is made: namely, that there are no 32-bit character codes which share the same first 16 bits as any 16-bit character code. (Since the Unicode code space extends only to U+10FFFF, a potential conflict exists only for characters U+0000 to U+0010, which are non-printing control characters.) This means that the determination as to whether a particular 16-bit value is a standalone character code or the start of a 32-bit character code can be made by looking at the 16-bit value directly, with no further information required.

'cmap' Subtable Format 8:

Type Name Description
uint16 format Subtable format; set to 8.
uint16 reserved Reserved; set to 0
uint32 length Byte length of this subtable (including the header)
uint32 language For requirements on use of the language field, see “Use of the language field in 'cmap' subtables” in this document.
uint8 is32[8192] Tightly packed array of bits (8K bytes total) indicating whether the particular 16-bit (index) value is the start of a 32-bit character code
uint32 numGroups Number of groupings which follow
SequentialMapGroup groups[numGroups] Array of SequentialMapGroup records.

Each sequential map group record specifies a character range and the starting glyph ID mapped from the first character. Glyph IDs for subsequent characters follow in sequence.

SequentialMapGroup Record:

Type Name Description
uint32 startCharCode First character code in this group; note that if this group is for one or more 16-bit character codes (which is determined from the is32 array), this 32-bit value will have the high 16-bits set to zero
uint32 endCharCode Last character code in this group; same condition as listed above for the startCharCode
uint32 startGlyphID Glyph index corresponding to the starting character code

A few notes here. The endCharCode is used, rather than a count, because comparisons for group matching are usually done on an existing character code, and having the endCharCode be there explicitly saves the necessity of an addition per group. Groups must be sorted by increasing startCharCode. A group’s endCharCode must be less than the startCharCode of the following group, if any.

To determine if a particular word (cp) is the first half of 32-bit code points, one can use an expression such as ( is32[ cp / 8 ] & ( 1 << ( 7 - ( cp % 8 ) ) ) ). If this is non-zero, then the word is the first half of a 32-bit code point.

0 is not a special value for the high word of a 32-bit code point. A font must not have both a glyph for the code point 0x0000 and glyphs for code points with a high word of 0x0000.

The presence of the packed array of bits indicating whether a particular 16-bit value is the start of a 32-bit character code is useful even when the font contains no glyphs for a particular 16-bit start value. This is because the system software often needs to know how many bytes ahead the next character begins, even if the current character maps to the missing glyph. By including this information explicitly in this table, no “secret” knowledge needs to be encoded into the OS.

Although this format was created to support Unicode supplementary-plane characters, it is not widely supported or used. Also, no character encoding other than Unicode uses mixed 16-/32-bit characters. The use of this format is discouraged.

Format 10: Trimmed array

Subtable format 10 was designed to support Unicode supplementary-plane characters, though it is not commonly used. Format 10 is similar to format 6 in that it defines a trimmed array for a tight range of character codes. It differs, however, in that it uses 32-bit character codes.

'cmap' Subtable Format 10:

Type Name Description
uint16 format Subtable format; set to 10.
uint16 reserved Reserved; set to 0
uint32 length Byte length of this subtable (including the header)
uint32 language For requirements on use of the language field, see “Use of the language field in 'cmap' subtables” in this document.
uint32 startCharCode First character code covered
uint32 numChars Number of character codes covered
uint16 glyphIdArray[] Array of glyph indices for the character codes covered

This format is not widely used and is not supported on Windows platforms. It would be most suitable for fonts that support only a contiguous range of Unicode supplementary-plane characters, but such fonts are rare.

Format 12: Segmented coverage

This is the standard character-to-glyph-index mapping subtable for fonts supporting Unicode character repertoires that include supplementary-plane characters (U+10000 to U+10FFFF).

Fonts that include a format 12 subtable can also include a format 4 subtable for compatibility with older applications. This is not required, however. See the Recommendations chapter for additional information.

Format 12 is similar to format 4 in that it defines segments for sparse representation. It differs, however, in that it uses 32-bit character codes.

'cmap' Subtable Format 12:

Type Name Description
uint16 format Subtable format; set to 12.
uint16 reserved Reserved; set to 0
uint32 length Byte length of this subtable (including the header)
uint32 language For requirements on use of the language field, see “Use of the language field in 'cmap' subtables” in this document.
uint32 numGroups Number of groupings which follow
SequentialMapGroup groups[numGroups] Array of SequentialMapGroup records.

The sequential map group record is the same format as is used for the format 8 subtable. The qualifications regarding 16-bit character codes does not apply here, however, since characters codes are uniformly 32-bit.

SequentialMapGroup Record:

Type Name Description
uint32 startCharCode First character code in this group
uint32 endCharCode Last character code in this group
uint32 startGlyphID Glyph index corresponding to the starting character code

Groups must be sorted by increasing startCharCode. A group’s endCharCode must be less than the startCharCode of the following group, if any. The endCharCode is used, rather than a count, because comparisons for group matching are usually done on an existing character code, and having the endCharCode be there explicitly saves the necessity of an addition per group.

Format 13: Many-to-one range mappings

This subtable provides for situations in which the same glyph is used for hundreds or even thousands of consecutive characters spanning across multiple ranges of the code space. This subtable format may be useful for “last resort” fonts, although these fonts may use other suitable subtable formats as well. (For “last resort” fonts, see also the 'head' table flags, bit 14.)

Note: Subtable format 13 has the same structure as format 12; it differs only in the interpretation of the startGlyphID/glyphID fields.

'cmap' Subtable Format 13:

Type Name Description
uint16 format Subtable format; set to 13.
uint16 reserved Reserved; set to 0
uint32 length Byte length of this subtable (including the header)
uint32 language For requirements on use of the language field, see “Use of the language field in 'cmap' subtables” in this document.
uint32 numGroups Number of groupings which follow
ConstantMapGroup groups[numGroups] Array of ConstantMapGroup records.

The constant map group record has the same structure as the sequential map group record, with start and end character codes and a mapped glyph ID. However, the same glyph ID applies to all characters in the specified range rather than sequential glyph IDs.

ConstantMapGroup Record:

Type Name Description
uint32 startCharCode First character code in this group
uint32 endCharCode Last character code in this group
uint32 glyphID Glyph index to be used for all the characters in the group’s range.

Format 14: Unicode variation sequences

Subtable format 14 specifies the Unicode variation sequences (UVSes) supported by the font. A variation sequence, according to the Unicode Standard, comprises a base character followed by a variation selector. For example, <U+82A6, U+E0101>.

This subtable format must only be used under platform ID 0 and encoding ID 5.

The subtable partitions the UVSes supported by the font into two categories: “default” and “non-default” UVSes. Given a UVS, if the glyph obtained by looking up the base character of that sequence in the Unicode 'cmap' subtable (i.e., the BMP subtable or BMP + supplementary-planes subtable) is the glyph to use for that sequence, then the sequence is a “default” UVS. Otherwise, it is a “non-default” UVS, and the glyph to use for that sequence is specified in the format 14 subtable itself.

The example at the bottom of the page shows how a font vendor can use format 14 for a JIS-2004-aware font.

'cmap' Subtable Format 14:

Type Name Description
uint16 format Subtable format; set to 14.
uint32 length Byte length of this subtable (including this header)
uint32 numVarSelectorRecords Number of variation Selector Records
VariationSelector varSelector[numVarSelectorRecords] Array of VariationSelector records.

Each VariationSelector record specifies a variation selector character, and offsets to “default” and “non-default” tables used to map variation sequences using that variation selector.

VariationSelector Record:

Type Name Description
uint24 varSelector Variation selector
Offset32 defaultUVSOffset Offset from the start of the format 14 subtable to Default UVS table. May be 0.
Offset32 nonDefaultUVSOffset Offset from the start of the format 14 subtable to Non-Default UVS table. May be 0.

The VariationSelector records are sorted in increasing order of varSelector. No two records may have the same varSelector value.

A VariationSelector record and the data its offsets point to specify those UVSes supported by the font for which the variation selector is the varSelector value of the record. The base characters of the UVSes are stored in the tables pointed to by the offsets. The UVSes are partitioned by whether they are default or non-default UVSes.

Glyph IDs to be used for non-default UVSes are specified in the Non-Default UVS table.

Default UVS table

A Default UVS table is simply a range-compressed list of Unicode scalar values, representing the base characters of the default UVSes which use the varSelector of the associated VariationSelector record.

DefaultUVS Table:

Type Name Description
uint32 numUnicodeValueRanges Number of Unicode character ranges.
UnicodeRange ranges[numUnicodeValueRanges] Array of UnicodeRange records.

Each Unicode range record specifies a contiguous range of Unicode values.

UnicodeRange Record:

Type Name Description
uint24 startUnicodeValue First value in this range
uint8 additionalCount Number of additional values in this range

For example, the range U+4E4D – U+4E4F (3 values) will set startUnicodeValue to 0x004E4D and additionalCount to 2. A singleton range will set additionalCount to 0.

The sum (startUnicodeValue + additionalCount) must not exceed 0xFFFFFF.

The Unicode Value Ranges are sorted in increasing order of startUnicodeValue. The ranges must not overlap; i.e., (startUnicodeValue + additionalCount) must be less than the startUnicodeValue of the following range (if any).

All code points listed in the ranges array should have corresponding entries in a Unicode 'cmap' subtable. Applications could encounter fonts in which this is not the case, however.

Non-Default UVS table

A Non-Default UVS table is a list of pairs of Unicode scalar values and glyph IDs. The Unicode values represent the base characters of all non-default UVSes which use the varSelector of the associated VariationSelector record, and the glyph IDs specify the glyph IDs to use for the UVSes.

NonDefaultUVS Table:

Type Name Description
uint32 numUVSMappings Number of UVS Mappings that follow
UVSMapping uvsMappings[numUVSMappings] Array of UVSMapping records.

Each UVSMapping record provides a glyph ID mapping for one base Unicode character, when that base character is used in a variation sequence with the current variation selector.

UVSMapping Record:

Type Name Description
uint24 unicodeValue Base Unicode value of the UVS
uint16 glyphID Glyph ID of the UVS

The UVS Mappings are sorted in increasing order of unicodeValue. No two mappings in this table may have the same unicodeValue values.

Typically, code points listed in the uvsMappings array will have corresponding entries in a Unicode 'cmap' subtable. This is not required, however. For example, it might not be the case if a font is intended for use with content in which a given Unicode character only occurs in a variation sequence.

Example

Here is an example of how a format 14 'cmap' subtable may be used in a font that is aware of JIS-2004 variant glyphs. The CIDs (character IDs) in this example refer to those in the Adobe Character Collection “Adobe-Japan1” and can be assumed to be identical to the glyph IDs in the font in our example.

JIS-2004 changed the default glyph variants for some of its code points. For example:

JIS-90: U+82A6 ⇒ CID 1142
JIS-2004: U+82A6 ⇒ CID 7961

Both of these glyph variants are supported through the use of Unicode variation sequences, as the following examples from Unicode’s UVS registry show:

U+82A6 U+E0100 ⇒ CID 1142
U+82A6 U+E0101 ⇒ CID 7961

If the font wants to support the JIS-2004 variants by default, it will:

  • encode glyph ID 7961 at U+82A6 in the Unicode 'cmap' subtable;
  • specify <U+82A6, U+E0101> in the UVS 'cmap' subtable’s Default UVS table (varSelector will be 0x0E0101 and defaultUVSOffset will point to data containing a 0x0082A6 Unicode value);
  • specify <U+82A6, U+E0100> ⇒ glyph ID 1142 in the UVS 'cmap' subtable’s Non-Default UVS table (varSelector will be 0x0E0100 and nonDefaultBaseUVSOffset will point to data containing a unicodeValue 0x0082A6 and glyphID 1142).

If, however, the font wants to support the JIS-90 variants by default, it will:

  • encode glyph ID 1142 at U+82A6 in the Unicode 'cmap' subtable;
  • specify <U+82A6, U+E0100> in the UVS 'cmap' subtable’s Default UVS table;
  • specify <U+82A6, U+E0101> ⇒ glyph ID 7961 in the UVS 'cmap' subtable’s Non-Default UVS table.