Introduction to MLang
MLang implements a set of services that is designed to help make software that interacts with Internet data more international. More specifically, MLang helps solve problems presented by the multilingual environment that exists for software today. This article describes the services that are provided by the MLang Component Object Model (COM) object.
- Internet Protocols
- About MLang
- MLang Dependencies
- MLang Architecture
- About Character Sets
- The MIME Database
- Related MLang APIs
- Converting Between Character Sets
- Code Page Detection
- Font Linking
- Line Breaking
- Related Links
MLang offers APIs for several areas of Internet software internationalization. Included in these is a rich set of charset conversion APIs, the most important of which translates all current Internet character sets to Unicode and back. Users can use this to base their applications entirely on Unicode and to free them from dependencies on particular Internet character sets. MLang also provides two code page detection methods that automatically determine in which languages and code pages data is written. MLang maps the relationship between a MIME charset identifier such as "gb2312" and an unsigned integer that can be passed into MLang APIs or into standard Microsoft Win32 APIs. MLang provides information about the charset and font support for specific scripts available on the computer that is running the software. This script and font information can be used to implement font linking, which is a general way to render an arbitrary Unicode character on a system. To complete the functionality, MLang allows an application to perform globalized line breaking and maps between language identifiers in RFC1766 format and Windows locale identifiers (LCIDs).
The two major protocols used by MLang are MIME and RFC1766. For more information on these protocols, see the Request for Comments (RFC) documents that cover their specifications. The following documents on these protocols can be found on the Internet.
- Tags for the Identification of Languages (RFC1766)
- MIME, Part one: Format of Internet Message Bodies (RFC2045)
- MIME, Part two: Media Types (RFC2046)
- MIME, Part three: Message Header Extensions for Non-ASCII Text (RFC2047)
- MIME, Part four: Registration Procedures (RFC2048)
- MIME, Part five: Conformance Criteria and Examples (RFC2049)
MLang is implemented as an in-process server and is distributed as Mlang.dll only with Microsoft Internet Explorer 4.0 or later. This section provides the most important information needed to use MLang and includes a brief introduction to the overall structure of the MLang COM objects and interfaces.
If you want to compile programs that use the functionality provided in Mlang.dll, you must make sure the Mlang.h header file is in the include directory of the C/C++ compiler you use. In addition, to use the conversion methods provided by MLang, the National Language Support (.nls) files corresponding to the code pages you wish to use must be installed on your system.
The Component Object Library
This overview assumes you have knowledge of OLE and COM programming topics. In particular, before an application can use MLang's features, the Component Object Library must be initialized through a call to the CoInitialize function. Every call to CoInitialize must be accompanied by a call to CoUninitialize when the application is terminated. CoUninitialize ensures that the application does not quit until it has received all of its pending messages.
One of the most useful operations MLang provides is converting strings between local character sets and Unicode. Unicode is a method of representing characters by using a 16-bit encoding scheme. Unicode characters are called "wide characters" because of their consistent double-byte value, which allows Unicode to include every character of every language in the world. (For a complete list of the characters and symbols Unicode supports, see The Unicode Home Page.) Microsoft provides support for Unicode in a number of different ways. Windows 2000 runs in Unicode, as opposed to a multibyte encoding scheme, and therefore supports Unicode versions of Microsoft Foundation Classes (MFC) functions. Windows 95 and Windows 98 do not support Unicode versions of MFC functions but are able to run applications that use Unicode.
The primary MLang interface is the IMultiLanguage2 interface, which provides a set of charset conversion methods, code page detection methods, and a number of methods that gather information on code pages, locales, and character sets. IMultiLanguage2 also provides access to all the interfaces and objects exposed by MLang. It should be created first and used to obtain the other MLang interfaces as they are needed. The IEnumCodePage, IEnumRfc1766, and IMLangConvertCharset interfaces are acquired through the IMultiLanguage2 IMultiLanguage::EnumCodePages, IMultiLanguage::EnumRfc1766, and IMultiLanguage::CreateConvertCharset methods, respectively. When callers retrieve these interfaces through the above methods, they must release them all individually; they will not be freed simply by releasing IMultiLanguage2.
About Character Sets
A character set is a group of characters from a given language. For example, the ASCII character set is the standard United States-English character set. MLang provides a number of APIs to help you use multiple character sets, including APIs that perform conversion to Unicode and font linking.
The following terms and definitions pertain to character sets. Understanding them will help you better understand MLang methods.
An encoding is a mapping of a character to a sequence of bits. All encodings except Unicode are called multibyte encodings.
A charset is the application of an encoding for each character in a character set. In other words, it is a character set in which every character has been assigned an encoding unique numeric value.
A code page is a unique physical implementation of a charset. A code page is usually identified by an unsigned integer, although some MLang APIs require a DWORD as a code page identifier value instead.
The MIME Database
The MIME database contains detailed information about character sets, code pages, and locales. In previous versions of Windows Internet Explorer, the MIME database was kept in the registry, so clients could directly access the registry and make changes to the database. Microsoft Internet Explorer 5 stores the database in MLang itself; therefore, no changes can be made directly to the database. This new database implementation ensures that the data is stable and synchronized.
MLang uses the MIME database to gather the information necessary to convert strings between code pages, as well as to translate code page identifiers to charset names and LCIDs to RFC1766-conforming names. To accomplish this task, MLang provides the following features.
MLang defines three structures that are used to gather and contain information from the MIME database about specific code pages, character sets, and locales:
MLang provides the following two interfaces that can be used to enumerate code page and locale information, respectively, from the MIME database:
Related MLang APIs
MLang also provides the following methods to obtain information from the MIME database. A number of these methods use this information to convert between RFC1766-conforming names and LCIDs, or between Internet character sets and Windows code pages.
Converting Between Character Sets
MLang provides a number of different ways to convert strings from one code page to another. Most important is the MLang Conversion object. This object supports the IMLangConvertCharset interface and is dedicated to converting strings between a source code page and a destination code page, both of which the user must specify.
The IMultiLanguage2 interface also supports a number of conversion methods. These methods function in the same manner as the IMLangConvertCharset methods, but they might take slightly different parameters. The IMultiLanguage2 methods provide the same functionality as the IMLangConvertCharset methods, but they are less efficient when multiple conversions must be done between the same combination of source and destination code pages.
When using the conversion functionality provided by MLang, a caller should always take care to verify whether the parameters that specify the size of the source string and destination buffer are measured in a byte count or character count. In general, all APIs that are specifically dedicated to converting to or from Unicode take a character count for the Unicode string. All other strings are measured in a byte count.
MLang provides the following conversion methods.
Code Page Detection
Through the IMultiLanguage2 interface, MLang provides two code page detection methods, IMultiLanguage2::DetectInputCodepage and IMultiLanguage2::DetectCodepageInIStream. These methods determine the possible languages and code pages of the text data that is given by the caller, returning the results in an array of DetectEncodingInfo structures. In addition to containing a detected language and code page, these structures include two members that indicate the percentage of the data that is in the detected language, as well as the relative confidence that the structure contains the correct language and code page. To help increase the accuracy of the detected code page and language, the MLDETECTCP enumerated type is provided. These values specify the type of data that is being given to the detection method.
Through the IMLangFontLink interface, MLang enables a client to create customized fonts that can display text from a variety of different code pages—a process known as font linking. Through font linking, MLang enables a client to output text that contains characters from a number of different languages. This functionality is especially useful when dealing with Unicode strings, which might contain characters from many character sets at once. For more information on font linking, see How to Use Font Linking.
A set of code pages is a central concept used throughout the IMLangFontLink and IMLangCodePages interfaces. A DWORD is used to represent a set of code pages, and each bit in the DWORD represents a specific code page. When a bit is set to 1, its corresponding code page is considered included in the set; if the bit is set to 0, its code page is not considered a member. Thus the DWORD 0x1e0000 would represent the code pages corresponding to the bits 0x100000, 0x80000, 0x40000, and 0x20000.
IMLangCodePages implements the set of code pages described above and provides the IMLangCodePages::CodePageToCodePages and IMLangCodePages::CodePagesToCodePage methods, which translate between the UINT code page identifier value for a single code page and the bit that represents it in a DWORD set of code pages. In addition, IMLangCodePages has methods that determine the code pages that contain a given Unicode character or the characters in a Unicode string. IMLangFontLink is derived from IMLangCodePages, so all IMLangCodePages methods are available through both interfaces.
MLang provides a line-breaking functionality for console-based applications. The IMLangLineBreakConsole interface enables a user to break either Unicode or multibyte strings depending on the maximum number of columns that are desired for output. A column in this context is the size of one half-width character; a full-width character is two columns wide. For more information on line breaking, see How to Break Text Based on Locale.