CHM Localization and Unicode issues - dbcsFix.exe
In this thread (https://forums.microsoft.com/MSDN/ShowPost.aspx?PostID=2139818&SiteID=1) we discussed about the localization issues in CHM. With September Sandcastle release we have addressed the Unicode issues for CHMs built in East Asian languages. Here’s the problem:
1. For localized CHMs the sources need to be in ANSI if the language’s characters don’t all map to Western-1252.
2. If you compile ANSI sources, the HTML help compiler assumes that the HTML codepage is your current system codepage. If your system is set to EN-US ( like my system), the resulting CHM contains incorrect characters, unless you change my system settings (which require a reboot).
To resolve this, one must:
1. Note the codepage in the source HTML (a META tag).
2. Re-encode all files in ANSI, using the appropriate code page.
3. Trick the OS in to stating its current codepage is something different than what it really is.
4. Compile.
Here’s our solution:
1. Have ChmBuilder write the codepage as UTF-8 into the HTML as it generates them (they actually are UTF-8 at this point).
2. Re-encode the files using dbcsFix.exe. DBCS stands for Double Byte Character Set and we use this program to convert UTF-8 to ANSI . While doing so, substitute the actual codepage (e.g., big5) for what was initially written (UTF-8).
3. Wrap the call to HHC.exe in a call to MS APPLocale or SbAppLocale.exe, passing in the appropriate LCID.
dbcsFix.exe Details:
dbcsFix.exe attempts to work around limitations in the CHM compiler regarding character encodings and representations. Specifically:
1. Replaces some characters with ASCII equivalents, as follows:
Char name |
utf8 (hex) |
ascii |
Non-breaking space |
\xC2\xA0 |
" " (for all languages except Japanese) |
Non-breaking hyphen |
\xE2\x80\x91 |
"-" |
En dash |
\xE2\x80\x93 |
"-" |
Left curly single quote |
\xE2\x80\x98 |
"'" |
Right curly single quote |
\xE2\x80\x99 |
"'" |
Left curly double quote |
\xE2\x80\x9C |
"\"" |
Right curly double quote |
\xE2\x80\x9D |
"\"" |
Horizontal ellipsis |
U+2026 |
"..." |
After this step, no further work is done when LCID == 1033.
2. Replaces some characters with named entities, as follows:
Char name |
utf8 (hex) |
named entity |
Copyright |
\xC2\xA0 |
© |
Registered trademark |
\xC2\xAE |
® |
Em dash |
\xE2\x80\x94 |
— |
Trademark |
\xE2\x84\xA2 |
™ |
3. Replaces the default "CHARSET=UTF-8" setting in the HTML generated by ChmBuilder with "CHARSET=" + the proper value for the specified LCID, as determined by the application's .config file
4. Re-encodes all input HTML from their current encoding (UTF-8, as output by ChmBuilder) to the correct encoding for the specified LCID.
USAGE:
dbcsFix.exe [-d=Directory] [-l=LCID]
-d is the directory containing CHM input files (e.g., HHP file). For example, 'C:\DocProject\Output\Chm'. Default is the current directory.
-l is the language code ID in decimal. For example, '1033'. Default is '1033' (for EN-US). Usage is also available with -?
After processing the inputs with dbcsFix.exe, the call to the CHM compiler must be made when the system locale is the same as the value set when calling this tool. This can be done either by changing your system settings via the control panel, or by MS APPLocale or by SbAppLocale.exe. In the latter case, the call should be similar to:
SbAppLocale.exe $(LCID) "%PROGRAMFILES%\HTML Help Workshop\hhc.exe" Path\Project.HHp
REFERENCES:
Here are some useful links about Unicode and general encoding issues:
- https://www.unicode.org/faq/basic_q.html [<--THIS ONE IS A MUST-READ]
- https://www.unicode.org/standard/WhatIsUnicode.html
- https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/CP437.TXT
- https://www.unicode.org/reports/tr22/
- https://www.w3.org/Protocols/rfc1341/7_1_Text.html
- https://www.cl.cam.ac.uk/~mgk25/unicode.html
- https://www.columbia.edu/kermit/utf8.html
- https://en.wikipedia.org/wiki/Character_encoding
- https://en.wikipedia.org/wiki/Windows_code_pages
- https://en.wikipedia.org/wiki/Unicode
- https://en.wikipedia.org/wiki/ISO_8859
- https://en.wikipedia.org/wiki/Windows-1252
- https://www.microsoft.com/globaldev/reference/sbcs/1252.mspx
Sincere thanks to my colleagues John Carl and Justin Russell for developing dbcsFix.exe. Cheers.
Anand..
Comments
Anonymous
September 28, 2007
PingBack from http://msdnrss.thecoderblogs.com/2007/09/29/chm-localization-and-unicode-issues-dbcsfixexe/Anonymous
September 29, 2007
Thanks for the info on dbcsFix. And BTW, nice job guys on the dnrTV video - I learned some stuff from it :)Anonymous
October 01, 2007
I am excited to announce the availability of September 2007 version for Sandcastle. The latest versionAnonymous
September 02, 2010
Please note that MS AppLocale has moved to this location www.microsoft.com/.../AppLocale.aspxAnonymous
September 02, 2010
Oh I take that back sorry. That page also has a bad link. So where has MS AppLocale gone? Maybe deleted because of DoJ -- Like other stuff that quietly disappears because MS don't have the rights to make the source available?Anonymous
March 24, 2011
The comment has been removed