Storage and Performance Effects of Unicode
SQL Server 2005 stores Unicode data by using the UCS-2 encoding scheme. Under this mechanism, all Unicode characters are stored by using 2 bytes.
The difference in storing character data between Unicode and non-Unicode depends on whether non-Unicode data is stored by using double-byte character sets. All non-East Asian languages and the Thai language store non-Unicode characters in single bytes. Therefore, storing these languages as Unicode uses two times the space that is used specifying a non-Unicode code page. On the other hand, the non-Unicode code pages of many other Asian languages specify character storage in double-byte character sets (DBCS). Therefore, for these languages, there is almost no difference in storage between non-Unicode and Unicode.
The following table shows the non-Unicode code pages that specify character data storage in double-byte character sets.
Language | Code page |
---|---|
Simplified Chinese |
936 |
Traditional Chinese |
950 |
Japanese |
932 |
Korean |
949 |
The effect of Unicode data on performance is complicated by a variety of factors that include the following:
- The difference between Unicode sorting rules and non-Unicode sorting rules
- The difference between sorting double-byte and single-byte characters
- Code page conversion between client and server
SQL Server performs string comparisons of non-Unicode data defined with a Windows collation by using Unicode sorting rules. Because these rules are much more complex than non-Unicode sorting rules, they are more resource-intensive. So, although Unicode sorting rules are frequently more expensive, there is generally little difference in performance between Unicode data and non-Unicode data defined with a Windows collation.
The only case when SQL Server uses non-Unicode sorting rules is on non-Unicode data that is defined by using SQL collation. Sorts and scans in this instance are generally faster than when Unicode sorting rules apply. Unicode sorting rules apply to all Unicode data, defined by using either a Windows collation or SQL collation.
Of secondary importance, sorting lots of Unicode data can be slower than non-Unicode data, because the data is stored in double bytes. On the other hand, sorting Asian characters in Unicode is faster than sorting Asian DBCS data in a specific code page, because DBCS data is actually a mixture of single-byte and double-byte widths, while Unicode characters are fixed-width.
Other performance issues are primarily determined by the issue of converting the encoding mechanism between an instance of SQL Server and the client. Generally, the effects on performance of client/server code-page conversion are negligible. Nevertheless, you should understand what is occurring at this layer.
The ODBC API, version 3.6 or earlier, and the DB-Library API do not recognize Unicode. For clients that use data access methods defined by these APIs, resources are used to implicitly convert Unicode data to the client code page. Also, there is a risk of data corruption on the client side when the client code page does not recognize certain Unicode characters.
Later versions of ODBC, starting with Microsoft Data Access Components version 2.7 that was included with SQL Server version 7.0, and OLE DB and ADO are Unicode aware and assume a UCS-2 encoding mechanism. Therefore, if the application is Unicode enabled, there are no conversion issues when you work strictly with Unicode data from an instance of SQL Server. If a client is using a Unicode-enabled API but the data storage mechanism in the instance of SQL Server is not Unicode, there are no conversion issues. However, there is a risk that any data insert or update operations will be corrupted if the code points for any character cannot be mapped to the SQL Server code page.
Unicode Best Practices
Deciding whether to store non-DBCS data as Unicode is generally determined by an awareness of the effects on storage, and about how much sorting, conversion, and possible data corruption might happen during client interactions with the data. Sorting and conversion may affect performance, depending on where it occurs. However, for most applications the effect is negligible. Databases with well-designed indexes are especially unlikely to be affected. However, data corruption will affect not only the integrity of an application and database, but also the business as a whole. Considering this trade-off, storing character data in a specific code page may make sense if both of the following are true:
- Conserving storage space is an issue, because of hardware limitations. Or, you are performing frequent sorts of lots of data, and testing indicates that a Unicode storage mechanism severely affects performance.
- You are sure the code pages of all clients accessing this data match yours, and that this situation will not unexpectedly change.
Most of the time, the decision to store character data, even non-DBCS data, in Unicode should be based more on business needs instead of performance. In a global economy that is encouraged by rapid growth in Internet traffic, it is becoming more important than ever to support client computers that are running different locales. Additionally, it is becoming increasingly difficult to pick a single code page that supports all the characters required by a worldwide audience.
See Also
Concepts
Collation and Code Page Architecture
Collation Types