Extend The Global Reach Of Your Applications With Unicode 5.0
Julie D. Allen and Michael S. Kaplan and Cathy Wissink
|This article discusses:
||This article uses the following technologies:
Unicode 5.0, Windows Vista
Why Unicode Matters
What's New in Unicode 5.0
Why the Change?
Most major software vendors provide a number of localized or translated products, as well as products that are capable of handling non-Latin writing systems. But there's still a long way to go. Ultimately, the goal of software internationalization is to allow the greatest number of people to communicate with each other, independent of language, writing system, or location. Simply put, people want to communicate with others, on their terms, in their own languages.
As members of the Unicode Consortium, Microsoft and other companies help the industry move closer to that goal of comprehensive language and regional support. In this article we will discuss the role Microsoft plays in this particular standards effort and, more importantly, describe the new features of Unicode 5.0 and how they are implemented in Windows Vista™.
Why Unicode Matters
Microsoft has actively participated in the Unicode Consortium for a number of reasons. First of all, consistency, stability, and interoperability of data, independent of writing system, location, and platform, are some of the key tenets of global software development. The industry needs internationalization standards, especially for encoding, that work consistently regardless of the platform they run on. These standards must also be representative of the key linguistic and cultural stakeholders they support. The Unicode Consortium develops standards with ongoing input from key stakeholders: industry, government bodies, language authorities, and related standards bodies. The existence of a single worldwide system to store, transmit, and manipulate the world's data streamlines and considerably accelerates the development and deployment of international software.
The benefits Unicode delivered to Microsoft were made clear when adoption of a single-source codebase supporting Unicode reduced the development time. For Windows® 2000, it took months to prepare the localized versions after shipping the English language product. This decreased to weeks in Windows XP.
The Unicode Consortium is also a crucial player in ensuring that the languages and cultures of the world are capable of being represented in technology, most notably through their cooperation with the Script Encoding Initiative (see unicode.org/pending/about-sei.html). This initiative is supported by Microsoft and other software vendors via their various emerging market programs. The result from Microsoft has been the Language Interface Packs (LIPs) and Enabling Language Kits (ELKs) programs, developer tools such as the Microsoft® Locale Builder and the Microsoft Keyboard Layout Creator. Also see the sidebar "ELKs and LIPs," as well as Figures 1 and 2, for a glimpse of Unicode in action.
Figure 1** Inuktitut Characters **(Click the image for a larger view)
Figure 2** Malayalam Characters **(Click the image for a larger view)
What's New in Unicode 5.0
Like earlier major versions, the Unicode Standard, Version 5.0, has been released as a book (Addison-Wesley, 2006) with associated data files. The book has been updated substantially and contains pages of new and clarified text, new and improved tables and illustrations, and more. For the first time, the Unicode Standard Annexes, which contain information about topics such as the Bidirectional Algorithm, identifiers, line-breaking properties, and text boundaries, are printed in the book. The annexes have been carefully edited and rewritten to provide improved guidance to developers and implementers.
The conformance clauses and definitions in the book have been renumbered in Version 5.0, eliminating the confusion of numbers such as C12b and D30e used in Version 4.0. (Appendix D includes a handy table showing the numbers used in earlier versions.)
Unicode 5.0 has more explicit guidelines for rendering Indic scripts. For CJK characters, the book now contains the IICore radical-stroke index, focusing on the subset of roughly 10,000 CJK characters most important for the East Asian markets. This new index makes it much easier to look up important characters in the standard; the full CJK radical-stroke index is available online for looking up rare characters. Character properties have been extended and improved.
Information about security considerations and reliability, topics of great importance to implementers, has been vastly extended. Version 5.0 is synchronized with coordinated updates to other important standards, including the Unicode Collation Algorithm, the Common Locale Data Repository, and the International Standard ISO/IEC 10646. (See the sidebar "Phishing and Unicode Security.")Phishing and Unicode Security
Although the RFCs related to International Domain Names made great efforts to point out the potential problems relating to visual spoofing that could occur with every character in Unicode being used for domain names, most early adopters paid no heed to the warnings. It took the visual spoofing of Unicode characters to get the attention of the software community, even though it had been called out as a potential security leak long before in various IETF documents.
This kind of spoofing can happen when, for example, a bank Web site has the letter l in its name and a spoofer replaces it with the number l. The characters look the same, and the casual user would never know that he is typing his bank user name and password into the Web site of a criminal. And when the characters allowed in a URL include many languages/writing systems that have characters that look the same as but are different from those in other languages, the opportunity to spoof a URL increases dramatically.
Unicode Technical Report #36, Unicode Security Considerations (unicode.org/reports/tr36) and Unicode Technical Standard #39, Unicode Security Mechanisms (unicode.org/reports/tr39) by Michel Suignard of Microsoft both explain how an application or platform supporting Unicode can avoid these issues.
Version 5.0 adds new characters for Cyrillic, Greek, Hebrew, Kannada, Latin, mathematics, phonetic extensions, and symbols. It also adds several minority and historical scripts, such as Tifinagh and Phoenician. The Tifinagh script is used by more than 20 million people in Morocco and elsewhere who speak varieties of languages commonly called Berber or Amazigh. (Incidentally, the word "Tifinagh" means the Phoenician letters.) Phoenician is the script that sailed the ancient Mediterranean and the forerunner to Arabic, Greek, Hebrew, and other scripts in modern use. The book Unicode Standard, Version 5.0 is a treasure trove of intriguing details about scripts, as well as useful implementation information. It is an indispensable reference. Figures 3 and 4 show a few more character sets.
Figure 3** Khmer Characters **(Click the image for a larger view)
Figure 4** Lao Characters **(Click the image for a larger view)
Why the Change?
In prior versions of Windows, Microsoft products lagged a few versions behind Unicode; for example, data that came from a number of Unicode versions (anywhere from Unicode 1.1 to 3.1) would ship in Windows XP, so that whichever feature was leveraging the data, whether it was collation, character properties, line breaking, the bidirectional algorithm, or UTF-8 conversion, the necessary Unicode version was available. This mixed approach, while expedient for a shortened development cycle, led to various problems, including:
- Inadequate algorithmic support for operations such as UTF-8 conversions
- Insufficient coverage for languages for which there later turned out to be a business need to support
- Lack of good coverage of security-sensitive issues clarified in later versions
Such problems became more transparent when Microsoft started shipping the ELKs in order to support LIPs beginning with Windows XP Service Pack 2 (SP2). Several of the languages on the list of potential ELKs and LIPs had to be removed from consideration because core support for Unicode properties, casing, and collation did not exist for languages such as Mongolian and Yi. Furthermore, Microsoft learned the hard way that trying to limit the supported ranges of Unicode to those considered strategic business cases for the company often proved ineffective, since yesterday's weak business cases often become tomorrow's strategic interests.
Besides, updating to the most recent version of the Unicode standard comprehensively across Windows and the Microsoft .NET Framework will help fill some gaps in terms of expected international support in Microsoft products, in the standard itself, and with related linguistic community concerns.
The customer requirement for Unicode 5.0 support on Windows Vista spanned the core services that ISVs could not extend themselves, such as casing, property, and character type values; default collation weights; and normalization support. (We did not try to ship font and rendering support of every character in Unicode 5.0, as that would be an incredibly time-consuming and difficult task, especially given the difference in targets between Unicode and Windows. The goal was to support languages as well as they could be supported with regard to the aforementioned core operations, while leaving other extensible details like typography and keyboard support to font authors and tools like the Microsoft Keyboard Layout Creator.)
Given that version support in Windows and the .NET Framework lay somewhere between Unicode 1.1 and Unicode 3.1, describing what was updated in Unicode 5.0 for Microsoft is significantly complex. The support of new languages and scripts is perhaps the easiest part for people to understand in terms of updates to the standard. However, there are many more additional must-have features in the new version of the standard that customers request in terms of Unicode support. Specifically on Windows, the following Unicode-related data has been updated and these character additions have been included:
- Enhanced security features in Unicode.
- Updates for normalization and international domain names.
- Latest version support for simple casing and other common operations.
- Support in collation beyond the simple code point ordering for Extension B that was added to Windows XP, needed for both Japanese/JIS X 213 and Hong Kong /HKSCS, due to many new characters. This particular item is not a feature of Unicode as much as a feature of collation efforts by Microsoft; it now follows the example of the Unicode Collation Algorithm and adds the characters in the latest version of Unicode to its default table.
- Support for mathematics in Unicode.
- Stronger model for scripts in South Asia and Southeast Asia, improved due to implementers' input in recent years.
- A great deal of work on the conformance model for the standard that makes Unicode as a whole more stable and consistent due to better validation of the data and properties provided by the standard.
Because extensibility of language and culture support, as well as consistent Unicode implementation, were important customer requirements for Windows Vista, it was crucial to develop an internationalization standards strategy and implementation plan that provided integration of the most recent version of the Unicode Standard across the platform.
While Microsoft is one of the early adopters of Unicode 5.0, other industry players are also already using this release or plan to upgrade in the near future. Adoption of this latest version of the standard benefits a wide spectrum of people, including developers, ISVs, and end users. All of the new Unicode features beneficial to Microsoft products apply just as significantly to others in the industry. Some obvious technical benefits include more software that leverages compatible UTF-8 conversions and more browsers that add anti-phishing capabilities.ELK and LIPs
Enabling Language Kits (ELKs) and Language Interface Packs (LIPs) were developed in response to the length of time it was taking to enable new languages for Windows.
ELKs allow installation on a per-locale basis of all the components that a locale needs to be supported, including new locale information, new keyboards, new sorts, new fonts, and new shaping engines. And all of this work is done without waiting for a major system release.
LIPs have a very different goal. They represent a localization framework to help mitigate the expense and time required to localize Windows into another language. Each LIP is a lightweight localization that translates the most commonly seen strings, enforcing the 80/20 rule that localizes the 20 percent of strings seen 80 percent of the time.
For Microsoft, the business case for upgrading to Unicode 5.0 was clear and compelling. Other organizations interested in supporting the Unicode effort should consider the many benefits of joining the Unicode Consortium and being actively involved in important projects. Upgrading to Unicode 5.0 provides an even more comprehensive set of characters for languages, but perhaps even more significantly, it makes for a safer and more functional Internet for everyone around the world, including those who never venture far beyond the world of ASCII.
Julie D. Allen is Senior Editor and Project Manager with the Unicode Consortium. She has been involved with the Unicode Standard since Version 2.1. She has a Ph.D. in Germanics from the University of Washington, and honed her writing skills with an earlier degree in English literature.
Michael S. Kaplan is a Technical Lead at Microsoft, centering on Collation, Keyboards, and Locales. He was the principal developer for both the Microsoft Layer for Unicode on Windows 9x and the Microsoft Keyboard Layout Creator. He is the author of Internationalization with Visual Basic (Sams Publishing, 2000). His blog gets new posts about internationalization at blogs.msdn.com/michkap.
Cathy Wissink is a Group Program Manager at Microsoft, working in the Global Platform and Technology Services group. Cathy was a linguist intern on the Windows NT 3.1 team in 1991. Many Windows projects later, she joined Microsoft in 2000. She holds an MA and a PhC (Candidate of Philosophy Certificate) from the University of Washington in Germanic Linguistics, and an MEd in Adult Education and Training from Seattle University.