Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
There is a class of legacy fonts for Arabic that use not-well-documented mechanisms and legacy shaping implementations in Windows. This topic provides some details regarding this class of fonts.
These fonts date to the 1990s. These are TrueType fonts, in the sense that they are .ttf files that, in general, follow the TrueType spec. However, they make use of undocumented details not defined in the TrueType or OpenType specs, and do not conform to current specifications for how to implement Unicode fonts for Arabic.
Font encoding and character set declarations
These fonts use Windows Symbol encoding — that is, they have a cmap subtable for platform ID 3, encoding ID 0. By itself, the Windows Symbol encoding implies no interoperable character semantics, but rather implies font-specific semantics. Arabic character semantics are not established in the cmap subtable, as would be expected. Rather, this is done using special values within the OS/2 table that are not documented in the TrueType or OpenType specification.
Note that some of these fonts may also include cmap subtables for other platforms and encodings. Those should not be used, however. The fonts were created in an era in which single-byte encodings were used, and these alternate cmap subtables are likely to be redefinitions of the declared encodings. For example, the Royal Arabic font includes a Mac Roman cmap subtable, but the glyphs that are mapped do not match the Mac Roman character set.
These fonts include a version 0 OS/2 table. Version 0 includes the fsSelection field, which is a uint16 value with various bits defined as flags. Only a limited set of bits — bits 0 to 6 — are defined; remaining bits are documented as reserved and to be set to 0. In this class of fonts, however, values are set in the reserved, upper word of the fsSelection field. These non-standardized values are how the character semantics are declared.
The values set in the upper word of the fsSelection field are not bit flags, as otherwise used for the fsSelection field. Instead, the upper byte is set to one of the following constant values:
#define ARABIC_CHARSET_SIMPLIFIED 178
#define ARABIC_CHARSET_TRADITIONAL 179
The first of these corresponds to the ARABIC_CHARSET constant defined in gdi32.h for use in the lfCharset member of the LOGFONT structure. The second is used in some fonts, but is not defined in gdi32.h.
Note: There is a correspondence between CHARSET values in Windows GDI and code pages, and code pages are referenced in the ulCodePageRange fields of the OS/2 table (version 1 and later). However, the ulCodePageRange fields are used to indicate logical character sets that are supported in the font, but say nothing about actual character encodings used in the font. For this class of fonts, the ulCodePageRange fields are not relevant.
Note: Similar use of the upper word of the fsSelection field is known to have been used for legacy Hebrew and Thai fonts as well.
The nature of the font encodings
In terms of Unicode characters, the Arabic characters supported by this class of fonts is a subset from the range U+0620 to U+065F.
Each of these legacy font encodings is a presentation-form encoding. That is, all presentation-form glyphs are mapped directly by some character code in the cmap table. The presentation forms will include the basic contextual shapes of Arabic letters: isolate, initial, medial, final. In addition, there are other presentation forms for certain ligatures; these are not documented here in detail.
Note: There is nothing in the fonts themselves that determines the encoded nature of text data that might be displayed with these fonts. Some legacy applications might generate documents using these fonts with contextual forms of Arabic letters represented directly in the documents. Some legacy applications might generate documents with Arabic text encoded in visual rather than logical order. Other legacy applications might encode Arabic text in logical order using one character code for Arabic letters, with the software resolving visual order and selecting contextual forms and ligatures from the font using character codes of the presentation-form encoding.
As noted above, the fonts use Windows Symbol cmap subtables. In most fonts that use Windows Symbol encoding, the character index values are in the range 0xF020 to 0xF0FF. In this class of fonts, however, different ranges are assumed for the two encoding declarations:
- ARABIC_CHARSET_SIMPLIFIED: 0xF100 to 0xF1FF
- ARABIC_CHARSET_TRADITIONAL: 0xF200 to 0xF2FF
These ranges are reflected in the usFirstCharIndex and usLastCharIndex fields of the fonts’ OS/2 tables.
Note: In legacy, single-byte applications, these 16-bit ranges would be mapped to a single-byte code point range 0x20 to 0xFF in documents. For example, when ARABIC_CHARSET_SIMPLIFIED is set in the font’s OS/2 table, the code 0x45 in a document would be mapped to 0xF145 when searching in the font’s cmap table.
Legacy encoding details
This section documents the semantic interpretation of code points in these legacy encodings. This is expressed as a mapping from Unicode characters to corresponding legacy codes for different contextual forms (if relevant).
Legacy Traditional Arabic encoding
The mapping from Unicode to the legacy Traditional Arabic encoding is provided in two tables:
The first table covers Arabic letters, which have joining presentation form variants. The legacy encoding includes ligature forms for certain Arabic letter sequences as separate characters; separate rows are included in the table for each letter sequence. The column headings indicate whether there are left or right connecting strokes: "initial" implies a left connection, "medial" implies left and right connections, "final" implies a right connection, and "isolate" implies no connections.
The second table covers other Arabic-script characters and joining controls. This includes Arabic combining marks. The legacy encoding includes ligature forms for certain mark combinations as separate characters; separate rows are included for each mark combination.
For most marks or mark combinations, the legacy encoding has high and low positional variants encoded as separate characters. (This is similar to use of the 'mset' OpenType feature to substitute positional-variant glyphs, except in the legacy encoding the substitution is done for pre-determined cases and handled by character codes.) In the second table, these positional-variant characters are listed in the same row.
Arabic letters and letter sequences (ligatures)
Unicode | Legacy encoding | |||
---|---|---|---|---|
Initial | Medial | Final | Isolate | |
0621 hamza | F2D5 | |||
0622 alef with madda above | F246 | F245 | ||
0623 alef with hamza above | F244 | F243 | ||
0624 waw with hamza above | F2DB | F2DA | ||
0625 alef with hamza below | F248 | F247 | ||
0626 yeh with hamza above | F2D6 | F2D7 | F2D8 | F2D9 |
0627 alef | F242 | F241 | ||
0628 beh | F249 | F24A | F24B | F24C |
0628 + 062C beh jeem | F280 | |||
0628 + 062D beh hah | F281 | |||
0628 + 062E beh khah | F282 | |||
0628 + 0631 beh reh | F215 | |||
0628 + 0645 beh meem | F296 | F202 | ||
0628 + 0646 beh noon | F292 | |||
0628 + 064A beh yeh | F21D | |||
0629 teh marbuta | F2D2 | F2D1 | ||
062A teh | F24D | F24E | F24F | F250 |
062A + 062C teh jeem | F283 | |||
062A + 062D teh hah | F284 | |||
062A + 062E teh khah | F285 | |||
062A + 0631 teh reh | F216 | |||
062A + 0645 teh meem | F297 | F203 | ||
062A + 0646 teh noon | F293 | |||
062A + 064A teh yeh | F21E | |||
062B theh | F251 | F252 | F253 | F254 |
062B + 0645 theh meem | F204 | |||
062C jeem | F255 | F256 | F257 | F258 |
062C + 0645 jeem meem | F29A | |||
062D hah | F259 | F25A | F25C | F260 |
062D + 0645 hah meem | F29B | |||
062E khah | F261 | F262 | F263 | F264 |
062E + 0645 khah meem | F29C | |||
062F dal | F266 | F265 | ||
0630 thal | F268 | F267 | ||
0631 reh | F26A | F269 | ||
0632 zain | F26C | F26B | ||
0633 seen | F26D | F26E | F26F | F270 |
0633 + 0645 seen meem | F218 | |||
0634 sheen | F271 | F272 | F273 | F274 |
0634 + 0645 sheen meem | F219 | |||
0635 sad | F275 | F276 | F277 | F278 |
0636 dad | F279 | F27A | F27C | F27E |
0637 tah | F27F | F2F1 | F2A1 | F2A2 |
0638 zah | F2A3 | F2A4 | F2A5 | F2A6 |
0639 ain | F2A7 | F2A8 | F2A9 | F2AA |
063A ghain | F2AB | F2AC | F2AD | F2AE |
0641 feh | F2AF | F2B0 | F2B1 | F2B2 |
0641 + 064A feh yeh | F29F | |||
0642 qaf | F2B3 | F2B4 | F2B5 | F2B6 |
0643 kaf | F2B7 | F2B8 | F2B9 | F2BA |
0644 lam | F2BB | F2BC | F2BD | F2BE |
0644 + 0622 lam alef madda above | F2E1 | F2E0 | ||
0644 + 0623 lam alef hamza above | F2DF | F2DE | ||
0644 + 0625 lam alef hamza below | F2E3 | F2 | ||
0644 + 0627 lam alef | F2DD | F2DC | ||
0644 + 062C lam jeem | F286 | F212 | ||
0644 + 062D lam hah | F287 | F213 | ||
0644 + 062E lam khah | F288 | F214 | ||
0644 + 0644 + 0647 lam lam heh | F201 | |||
0644 + 0645 lam meem | F29D | F205 | ||
0644 + 0645 + 062C lam meem jeem | F211 | |||
0644 + 0645 + 062D lam meem hah | F210 | |||
0644 + 0647 lam heh | F21A | |||
0644 + 0649 lam alef maksura | F295 | |||
0644 + 064A lam yeh | F21C | |||
0645 meem | F2BF | F2C0 | F2C1 | F2C2 |
0645 + 062C meem jeem | F289 | |||
0645 + 062D meem hah | F28A | |||
0645 + 062E meem khah | F28B | |||
0645 + 0645 meem meem | F29E | |||
0646 noon | F2C3 | F2C4 | F2C5 | F2C6 |
0646 + 062C noon jeem | F28C | |||
0646 + 062D noon hah | F28D | |||
0646 + 062E noon khah | F28E | |||
0646 + 0645 noon meem | F298 | F206 | ||
0646 + 064A noon yeh | F21F | |||
0647 heh | F2C7 | F2C8 | F2C9 | F2CA |
0648 waw | F2CC | F2CB | ||
0649 alef maksura | F2D3 | F2D4 | ||
064A yeh | F2CD | F2CE | F2CF | F2D0 |
064A + 062C yeh jeem | F28F | |||
064A + 062D yeh hah | F290 | |||
064A + 062E yeh khah | F291 | |||
064A + 0631 yeh reh | F217 | |||
064A + 0645 yeh meem | F299 | |||
064A + 0646 yeh noon | F294 |
Other characters and character sequences (ligatures)
Unicode | Legacy encoding |
---|---|
060C Arabic comma | F20C |
061B Arabic semicolon | F23B |
061F Arabic question mark | F23F |
0640 tatweel | F25F |
064B fathatan | F2E7 (high) or F2F5 (extra high) |
064C dammatan | F2E8 (high) or F2F6 (extra high) |
064D kasratan | F2EB (low) or F2F9 (extra low) |
064E fatha | F2E4 (high) or F2F2 (extra high) |
064F damma | F2E5 (high) or F2F3 (extra high) |
0650 kasra | F2EA (low) or F2F8 (extra low) |
0651 shadda | F2E9 (high) or F2F7 (extra high) |
0651 + 064B shadda fathatan | F2EE (high) or F2FC (extra high) |
0651 + 064C shadda dammatan | F2EF (high) or F2FD (extra high) |
0651 + 064D shadda kasratan | F2FF |
0651 + 064E shadda fatha | F2EC (high) or F2FA (extra high) |
0651 + 064F shadda damma | F2ED (high) or F2FB (extra high) |
0651 + 0650 shadda kasra | F2F0 (high) or F2FE (extra high) |
0652 sukun | F2E6 (high) or F2F4 (extra high) |
0660 Arabic-Indic digit zero | F230 |
0661 Arabic-Indic digit one | F231 |
0662 Arabic-Indic digit two | F232 |
0663 Arabic-Indic digit three | F233 |
0664 Arabic-Indic digit four | F234 |
0665 Arabic-Indic digit five | F235 |
0666 Arabic-Indic digit six | F236 |
0667 Arabic-Indic digit seven | F237 |
0668 Arabic-Indic digit eight | F238 |
0669 Arabic-Indic digit nine | F239 |
066B Arabic decimal separator | F25E |
200C zero width non-joiner | F20C |
200D zero width joiner | F20D |
200E left-to-right mark | F20E |
200F right-to-left mark | F20F |
Legacy Simplified Arabic encoding
The mapping from Unicode to the legacy Simplified Arabic encoding is presented by means of a code snippet. Four arrays are used for isolate, initial, medial and final forms. This mapping logically assumes four different contextual shapes for every Arabic character, though in some cases the same presentation-form code point is mapped for more than one context — for example, one legacy presentation form code point for both isolate and initial contexts.
Note: This mapping data does not reflect ligatures for Arabic letter combinations that are directly encoded in the legacy Simplified Arabic encoding.
#define NUM_ARABIC_LETTER_TABLES 0x0004
#define U_ARABIC_SCRIPT_COUNT 0x40
/************************************************************************************
S I M P L I F I E D A R A B I C
*************************************************************************************/
// index for invalid glyph is 0x00
// These are based on starting index 0xF100
const USHORT cpOldTTFSimpArabicShapes[NUM_ARABIC_LETTER_TABLES][U_ARABIC_SCRIPT_COUNT] =
{
// Isolate Shapes
{
0x00 , 0xad , 0x45 , 0x43 , 0xbb , 0x47 , 0xba , 0x41 , // 0x620
0x4a , 0xa9 , 0x4c , 0x4e , 0x51 , 0x54 , 0x57 , 0x58,
0x59 , 0x5a , 0x60 , 0x62 , 0x64 , 0x66 , 0x68 , 0x69 , // 0x630
0x6a , 0x6e , 0x72 , 0x00 , 0x00 , 0x00 , 0x00 , 0x00,
0x5f , 0x75 , 0x78 , 0x7a , 0x7c , 0x7e , 0xe1 , 0xa4 , // 0x640
0xa5 , 0xac , 0xa8 , 0xc7 , 0xc8 , 0xcb , 0xc4 , 0xc5,
0xca , 0xc9 , 0xc6 , 0x00 , 0x00 , 0x00 , 0x00 , 0x00 , // 0x650
0x00 , 0x00 , 0x00 , 0x00 , 0x00 , 0x00 , 0x00 , 0x00,
},
// Initial shapes
{
0x00 , 0xad , 0x45 , 0x43 , 0xbb , 0x47 , 0xae , 0x41 , // 0x620
0x49 , 0xa9 , 0x4b , 0x4d , 0x4f , 0x52 , 0x55 , 0x58,
0x59 , 0x5a , 0x60 , 0x61 , 0x63 , 0x65 , 0x67 , 0x69 , // 0x630
0x6a , 0x6b , 0x6f , 0x00 , 0x00 , 0x00 , 0x00 , 0x00,
0x5f , 0x73 , 0x76 , 0x79 , 0x7b , 0x7d , 0x7f , 0xa1 , // 0x640
0xa5 , 0xac , 0xa6 , 0xc7 , 0xc8 , 0xcb , 0xc4 , 0xc5,
0xca , 0xc9 , 0xc6 , 0x00 , 0x00 , 0x00 , 0x00 , 0x00 , // 0x650
0x00 , 0x00 , 0x00 , 0x00 , 0x00 , 0x00 , 0x00 , 0x00,
},
// Medial shapes
{
0x00 , 0xad , 0x46 , 0x44 , 0xbb , 0x48 , 0xae , 0x42 , // 0x620
0x49 , 0xa9 , 0x4b , 0x4d , 0x4f , 0x52 , 0x55 , 0x58,
0x59 , 0x5a , 0x60 , 0x61 , 0x63 , 0x65 , 0x67 , 0x69 , // 0x630
0x6a , 0x6c , 0x70 , 0x00 , 0x00 , 0x00 , 0x00 , 0x00,
0x5f , 0x74 , 0x77 , 0x79 , 0x7b , 0x7d , 0x7f , 0xa2 , // 0x640
0xa5 , 0xac , 0xa6 , 0xc7 , 0xc8 , 0xcb , 0xc4 , 0xc5,
0xca , 0xc9 , 0xc6 , 0x00 , 0x00 , 0x00 , 0x00 , 0x00 , // 0x650
0x00 , 0x00 , 0x00 , 0x00 , 0x00 , 0x00 , 0x00 , 0x00,
},
// Final shapes
{
0x00 , 0xad , 0x46 , 0x44 , 0xbb , 0x48 , 0xaf , 0x42 , // 0x620
0x4a , 0xaa , 0x4c , 0x4e , 0x50 , 0x53 , 0x56 , 0x58,
0x59 , 0x5a , 0x60 , 0x62 , 0x64 , 0x66 , 0x68 , 0x69 , // 0x630
0x6a , 0x6d , 0x71 , 0x00 , 0x00 , 0x00 , 0x00 , 0x00,
0x5f , 0x75 , 0x78 , 0x7a , 0x7c , 0x7e , 0xe1 , 0xa3 , // 0x640
0xa5 , 0xab , 0xa7 , 0xc7 , 0xc8 , 0xcb , 0xc4 , 0xc5,
0xca , 0xc9 , 0xc6 , 0x00 , 0x00 , 0x00 , 0x00 , 0x00 , // 0x650
0x00 , 0x00 , 0x00 , 0x00 , 0x00 , 0x00 , 0x00 , 0x00,
},
};