CLR Inside Out
This column is based on a prerelease version of Silverlight. All information is subject to change.
Localization with the CLR
Working with Text
Most programs present some sort of data to users. Clearly applications with a GUI, such as a Windows Presentation Foundation (WPF) or Windows Forms application, have been designed to provide information to users. However, even simple console applications display text. The process of globalization is making sure that your application can handle data that comes from different cultures. This can be as simple as making sure your calendar title is wide enough to handle Hawaiian month names or using the negative number format that your users expect.
Globalization is different from localization, which is actually translating resources from your application into different languages. Your application needs to be aware of globalization even if it is only in one language.
Here's an example of how there can be variations, even in an application that has not been translated. Certain types of data, such as dates, can be formatted in a variety of ways. Figure 1 lists just a few formats that use English names for months and days of the week. All the entries represent the same piece of data. But users have certain expectations about date formats, and they can indicate their preference via settings in the control panel. So if your program makes a call to DateTime.Now.ToString, any number of strings could be returned. If you have not specified a format, ToString automatically picks up the user's preferences.
Figure 1 Some Date Formats
The date string "1/2/03" is meaningless unless you know what each part of the string represents. You would think that knowing the format string would help you interpret the meaning of the data. But even that is no guarantee. Not everyone uses the same calendar. In fact, the System.Globalization namespace currently supports 15 different calendars. So the string "6/26/1980" interpreted as "M/d/yy" might represent different days in the Gregorian calendar and the Julian calendar. (There are a number of Web sites that can convert dates to different calendars.)
Interestingly enough, the same interpretation issues apply to numbers. Some users might expect the decimal separator to be a period, while others are used to seeing a comma. So when one user expects 1.23, another might expect 1,23. It gets even more complicated when you're dealing with negative numbers. If you look at System.Globalization.NumberFormatInfo.NumberNegativePattern, you will see that the Microsoft .NET Framework stores 5 patterns for the placement of the negative sign. And that's with only one symbol. It stores 16 different patterns for negative currency values in NumberFormatInfo.CurrencyNegativePattern.
As you might guess, when storing data as strings you should always explicitly indicate the format you wish to follow and use that same format for both the ToString and Parse methods. Remember, some strings can be ambiguous without context. Of course, all this fancy formatting matters only for display purposes; it is a best practice to pick an invariant representation for stored data that will only be read by a machine. The Framework provides a mechanism for this with CultureInfo.InvariantCulture.
With so many options for formatting, how are you to know which one to choose? Fortunately, the .NET Framework has an infrastructure in place for formatting data for display. As you've seen, user preferences can be followed automatically. But how does the Framework decide what to use when a user hasn't specified?
Taking a closer look at the System.Globalization namespace can help answer this question. The contents of System.Globalization are used to describe information about a culture. We already know that it contains 15 calendars and 16 ways to format negative currency values. In addition, it has other culture-specific properties for time and numbers, as well as classes for storing information about languages and regions. It also contains a mechanism to organize and group this information for use. That mechanism is the CultureInfo class, which stores information about a culture, including number formatting, date formatting, and calendar, as well as names for the culture and information about the writing system.
CultureInfo objects are created using an identifier for a particular culture. For example, in order to create an object that contains information about Swedish as it is used in Finland, you would specify the name "sv-FI" in the constructor. The Framework includes information on many cultures, and Windows also has data for a variety of locales. In addition to this, users can create their own custom culture.
CultureInfo implements IFormatProvider, which is a type that ToString can accept as an argument for dates and numbers. So to go back to the original DateTime.Now.ToString example, if you wanted to get the date formatted with the default format associated with the sv-FI culture, you could write the following:
If there were no user overrides for sv-FI and the date in question happened to be "6/26/1980" again, the date would print as "26.6.1980" followed by the time string. If you wanted to get the long date format associated with sv-FI, you could write:
CultureInfo ci = new CultureInfo("sv-FI"); Console.WriteLine(DateTime.Now.ToString(ci.DateTimeFormat.LongDatePattern, ci));
This would yield the string "den 26 juni 1980." Note that there the LongDatePattern, which only includes the date portion of the DateTimeand omits the time. These examples work because the formatting information stored in the CultureInfo object is used.
That explains how ToString works when specific formatting information is passed in. But as you saw at the very beginning, there are overloads of ToString that take no formatting information at all. Yet the data is still formatted. Where does the information come from?
One of the static properties on the CultureInfo class is CurrentCulture. This returns the culture that is associated with the current thread. This property can be set via Thread.CurrentThread.CurrentCulture. When threads are created, they default to the current culture of the operating system. If no format information is specified for a ToString or Parsing operation, the information in CultureInfo.CurrentCulture is used.
Culture information in general is subject to change. A country might adopt the Euro or change time zone or add another date format. So even if you specify a culture, the data might change between uses. Remember, the globalization information is for display only. If you need to store data as strings, the InvariantCulture property on CultureInfo is a good choice.
Localization with the CLR
What if you want to localize your application to more than one language? Doing this might involve all of the data display functionality handled in System.Globalization, as well as the translation of different parts of your application. You might also want to display different images for different cultures. One way that you could go about doing this would be to make multiple copies of your application, one for each language to which you want to localize your product.
This is quite inefficient, though. Fortunately, the Framework provides a way to abstract these localizable resources via classes found in the System.Resources namespace. The main classes for creating, reading, and using resources are ResourceWriter, ResourceReader, and ResourceManager.
The ResourceWriter class allows you to store resources as pairs of names and values. When you create an instance of ResourceWriter, it is for a particular file. The naming convention for these files is [basename].[culturename].resources. "Basename" is used for organizing your resources and is the name of an application or a class. You can use the CultureInfo.Name property to find the culture name.
U.S. English resources for the sample application MySampleApplication, for example, would be put in the file "MySampleApplication.en-US.resources." In addition, apps should have a default resource file for the neutral culture named [basename].resources. It is generally a good idea to package your resource files in satellite assemblies so that you can version the localization data separately from the main assembly. However, it's always a good idea to include one set of resources, the neutral resources, in the main assembly so that resource lookup will always have at least one set of resources to find.
The ResourceReader class allows you to enumerate the name/value pairs from a resource file. However, the most commonly performed task is to look up a particular resource for a particular culture. This is accomplished via the ResourceManager class. You can access individual resources with the GetObject and GetString methods, or you can load all the resources for a particular culture into a Hashtable using the GetResourceSet method.
Just like the globalization information, resources are stored by culture. And just as globalization can be configured with a default culture, the ResourceManager uses CurrentCulture.CurrentUICulture for its default culture. Again, this is a property of the current thread. CurrentCulture and CurrentUICulture often match, but they regulate different things. CurrentCulture controls the formatting of data, while CurrentUICulture determines which resources are loaded.
The framework carries around most of the data needed by classes in System.Globalization, but an application controls its own resources. If an application does not localize to a particular culture, the resources will not be available in that culture. But in general, you do not want resource lookup to fail. That is why you include the neutral resources in the main assembly in the first place—so that the application always has a set of resources to fall back to.
One property of CultureInfo that hasn't been mentioned is the Parent property. Culture information is arranged in a hierarchy where a neutral culture, that is a culture not associated with a particular region, is the parent of a culture associated with a particular region. For example, the parent of en-US, en-CA, and en-UK is the neutral culture en. If a resource for a culture is not found, the ResourceManager will probe the chain of parents until one is found. This will end with the InvariantCulture, the resources embedded into the main assembly.
The hierarchy of cultures allows you to factor common resources out of specific cultures and into neutral cultures. Let's say as an example that I want to localize my application into en-US, en-CA, and en-UK. Many of the resources that are needed for these cultures are likely the same with only a few differing. In this case, I can store all of the common resources in a file for the neutral culture en, putting only the differences in the resource files for the specific cultures.
If you had the resources that are illustrated in Figure 2 and tried to look up the resource corresponding to the name "greeting," en-US would return "hi" while en-UK and en-CA would both fall back to the en resource "hello."
Figure 2 Resource Hierarchy (Click the image for a larger view)
Working with Text
Text introduces a whole new set of complications. For instance, how do you actually represent all of the characters you might need for translating your text into other languages? Strings are represented in the .NET Framework by the System.String class. Fortunately, strings in the CLR use Unicode, specifically the UTF-16 encoding, so they can represent standard characters.
The System.Text namespace includes some classes that can be used for encoding Unicode characters into bytes and decoding bytes into characters. This allows you to translate between different Unicode encodings if you need to.
While Unicode is the current standard, there are some previous encodings that represented only characters used in a particular language or region. These encodings are known as "code pages." If you need to work with text encoded with one of these old code pages, the System.Text classes will allow you to work with those as well. In general, this is only necessary for working with legacy applications. New applications should use the Unicode standard.
Some strings might just be printed on the console or in a label, but often you will want to perform some action with a string. Perhaps a list of strings would be easier to read if it were sorted in the display. Maybe you want to convert characters in the string to a different case. These actions may seem simple enough, but looks can be deceiving. There is a fundamental problem in word sorting that might not immediately be apparent. In English, for example, there is a 26-character alphabet with a fixed ordering. But the characters have multiple cases. How do you sort instances of the same word with different casing?
The Framework has three modes of string comparison: ordinal, word, and string. The ordinal comparison looks at the numeric value of each character to compare, which means that ordering based on String.CompareOrdinal would consider "alphabet" greater than "Alphabet." Word comparison is the default. It is culture sensitive and might treat certain non-alphanumeric characters, such as the hyphen, as a special case. It can give them a small weight or consideration so that, for example, "a-lphabet" and "alphabet" sort near each other. String comparison is also culture sensitive but has no special cases, and the non-alphanumeric characters all sort before the alphanumeric ones.
If you were to do a culture-sensitive comparison for en-US using String.Compare, you would see that "Alphabet" is greater than "alphabet." If you were to do an ordinal comparison ignoring case, they would be equal. System.Globalization.CompareOptions includes nine different options that indicate which parameters to take into account when sorting.
Each culture has its own set of characters and sort order. Some have more than one. For example, Spanish (Spain), es-ES, uses International sorting by default, but can use the Traditional sort order as an alternative. Chinese (China), zh-CN, sorts by pronunciation by default, but can also sort by stroke count. Each CultureInfo object in the Framework has a CompareInfo property that is used for comparing strings. The Framework stores this data.
Generally speaking, the culture-sensitive ordering is much better for sorting and displaying. But when it comes to testing equality of strings, especially in certain cases when there might be security implications, ordinal comparison is definitely the best choice. Ordinal comparison looks only at the value of the compared characters; therefore, it is consistent while culture-sensitive comparisons may have different results based on the culture that is used.
One of the most commonly cited problems of using culture-sensitive string comparisons for security purposes is known as "the Turkish I problem." In most Latin alphabets, the character i (\u0069) is simply the lowercase version of the character I (\u0049). Most people using those alphabets have no idea that there could be any variation in this, so it's just considered to be a default. However, in Turkish (tr-TR) that mapping is incorrect. The uppercase version of the character i is the character İ (\u0130), or an uppercase I with a dot. Similarly, the lowercase version of the character I is the character ı (\u0131), or a lowercase i without a dot.
Consider the case where you want to check to see if a URI (Uniform Resource Identifier) starts with the string "FILE:". You want to do this in a case-insensitive way to make sure words such as "file:" don't get past your filter. If you compare "file:" to "FILE:" using the en-US culture and ignoring case, they will be equal. But if the tr-TR culture is used, they will not be equal. String.Compare defaults to the CurrentCulture.
So for such situations, you should use StringComparison.OrdinalIgnoreCase. An even better option is to use String.Equals, which is ordinal by default, if you are just testing equality. Figure 3 gives you an idea of the results of ToUpper and ToLower for the letter I. For more information on comparing strings, please see "New Recommendations for Using Strings in Microsoft .NET 2.0".
Figure 3 Results of ToUpper and ToLower
Globalization is especially important to Silverlight applications. With a desktop application, you can know exactly who your target audience is. But a Silverlight application on the Web can be seen by anyone. Silverlight has much of the same globalization infrastructure as the desktop Framework, but it gets most of its data from the underlying operating system. This helps a Silverlight application running on Windows look more like a Windows-based application and a Silverlight application running on a Mac OS look more like a Mac OS application.
Many of the differences between the desktop CLR and Silverlight (CoreCLR) stem from the fact that for speed and ease of download, a smaller runtime was needed for Silverlight. As such, Silverlight gets much more information from the underlying operating system rather than from carrying it all around in the CoreCLR. For instance, the CoreCLR doesn't have the sorting tables stored, so it only has access to those that the operating system has. Additionally, Silverlight only uses Unicode and does not have the legacy code pages. It also gets culture information from the operating system. This means that you really have to be prepared to handle a wide variety of data.
It has always been the case that data might not be available if it relied on a custom culture or an OS culture. But with all of the data coming from the OS, a single Silverlight application is more likely to run into different data. It is a good idea to make sure your target OS supports the UI cultures you want to use and to make sure your application behaves in a reasonable manner when the requested culture is not present.
Culture data is baked into the .NET Framework, making it difficult to update at the same frequency at which the data changes. This leads to data that can be out of date. It can also lead to the situation in which native applications and managed applications display different data. In Silverlight, you get consistency between the two, and applications can take advantage of updated OS information.
Using OS data also allows you to take advantage of more information. For example, in the desktop version of the .NET Framework, every CultureInfo has a single-parent culture, and to look up resources, all the ResourceManager has to do is simply walk the parent chain.
Take an application with de-DE resources and fr-FR resources. Now let's say that you have a user with es-ES as the primary culture but who also knows German and could read de-DE resources. The ResourceManager would fall back to the invariant culture even though resources existed that the user could understand. Most modern operating systems allow users to specify multiple language preferences. In Silverlight, we modified the ResourceManager to follow those preferences before falling back to the invariant culture.
Silverlight also contains a few differences in string comparison that are related to security rather than size. The default for most of the operations that involve string comparison in the desktop Framework is CurrentCulture. In Silverlight, the operations that involve partial matching, which includes both String.IndexOf and String.StartsWith, now use ordinal comparison by default. Compare and CompareTo still use CurrentCulture for the default, though you should really think about whether or not you are using the correct overload for the specific behavior you want. ToUpper and ToLower now use the InvariantCulture by default. Equality still uses ordinal comparison.
If you would like to have more information about building world-ready applications for Silverlight and Windows, a couple of good resources for you to look include the top-level MSDN pages for System.Globalization, for System.Resources, and for System.Text. You can find links to more detailed information there, as well as helpful example code. Finally, the Go Global Developer Center, is also an excellent resource for information about building international applications.
Send your questions and comments to email@example.com.
Melitta Andersen is a Program Manager on the Base Class Libraries team of the CLR. She mainly works on numerics, collections, and some of the infrastructure detailed in this column.