Compartir a través de


Why I write so many collation topics?

   You may wonder why I keep writing collation topics in my blogs.  I found our Books Online topics related to collation is not so clearly enough, and sometime it confuse our reader.  I intend to present my idea related to collation in a series of blogs, and finally I want group all these topics together to draw a clear picture about collation. My blogs will try to answer following question in details:

  • How I should choose a collation?
  • What is the different between collation A and B?
  • How collation impact the behavior of query, such as string functions,  variables assignment?
  • Best practices for using SQL Server's collation.

 Again, I strong suggest your guys send my comments or collation related question to me,   If I can not answer, I will find some language expert to answer this, and I will also share the answer as well.

Comments

  • Anonymous
    March 30, 2009
    SQL Server 2005 (90) versus 2008 (100) Binary Collation BIN2 etc. I cannot find any information detailing the internals of collations and the actual difference between versions of collations. I would like to know what the actual internals are and the difference between Latin1_General_BIN2 and Latin1_General_100_BIN2? Can you help?  Also if you can where did you get the information as I have trawled the net and cannot find what I need? I noticed something odd that the Latin1_General_BIN for NVARCHAR column ordered characters in the order NULL, 32, 0, 1, 2, 3,... (SPACE 32 straight after NULL) but for VARCHAR ordered NULL, 0, 1, 2, etc. Do you have any explanation for this? The BIN2 collation orders as I expected for NVARCHAR and VARCHAR i.e. NULL, 0, 1, 2, 3, etc.  Is the BIN2 collation surrogate character friendly (treats UTF-16 surrogate character as two UCS-2 characters out of valid range) or is it fully aware. The Microsoft description of the differences between BIN and BIN2 appear to try and mask the fact that BIN comparrison of multiple NVARCHAR characters for ordering in my tests is broken and hidden away is some reference to comparing the first character as 16-bits then each following byte of the characters but the Intel endian will order incorrectly.  Is my understanding correct? I think that my application should have the database created using BIN2 collation.  I feel that my application will by default match the equality tests with SQL servers equality tests of Unicode data. I can then set a separate collation for any special case.  I will perform any localised ordering within the application.  The only thing I cannot decide is the default collation for the database server instance.  Is there any issues with setting this to BIN2 apart from compatibility with existing databases and the case sensitivity of object names? Thanks, Justin...

  • Anonymous
    March 31, 2009
    Hello, Justin  One quick answer is that all BIN2 collation are the same in term of sorting/comparision in matter what version it was included.  The same is true for all Bin collation as well.    I will follow up  your questions later, and also will post one blog related to the binary collations.

  • Anonymous
    June 24, 2009
    If in a unicode column, the default collation is Latin General for English language, what is the sorting order applied to other lanugage in the same column, e.g., Chinese or Spanish? Are they sorted by unicode code points (binary order)? Thanks.

  • Anonymous
    June 24, 2009
    Hello,  Intangible   It is common misunderstand that Latin1 General collation can only sorting latin1 character correctly.   Unicode contain characters in a wide range of language.  If the sorting rule of two languages are not conflict, then it is possible to sorting the characters mixing of two language together.  For example,  latin_general can also sorting german lanugage correctly.  In term of chinese or spanish, they are sorting linguistically.   However, since chinese langauge and spanish both have two sorting algorithms, so we have to pick one in Latin1_General collation.  Chinese language has another addtion issue which is that the characters are re-used by Japanese and Korean. In other word, you may not know a character is a chinese character or korean character, and  which make choose correcting sorting order for these characters much hard.  

  • Anonymous
    June 25, 2009
    The comment has been removed

  • Anonymous
    June 26, 2009
    Hello, intangible   The using explict collate clause,  you will see no performance different if you have no index defined on that column, otherwise,  you will not be able to use the index defined on the column.   Latin1_General_CI_AS or Latin1_General_100_CI_AS can sort more than 50 languages. I will describe what is the reason behind this, and the languages supported by this collation later.  You noticed that we have French_CI_as, Turkush_CI_As collations.  The reason is that these languages have sorting conflicts with latin1_general, which have to be resolved by introducing new collations.  Please let me know if you have further questions.

  • Anonymous
    July 11, 2009
    Good blogs about unicode and collations. A answer to following question I cannot find though. I have following table CompanyID varchar(20) COLLATE Latin1_General_CI_AS CompanyName varchar(60) COLLATE Latin1_General_CI_AS This table construction is both used in a english database, as well as russian, as well as chinese. So the data in the table is not saved "correctly", but can be displayed correctly in chinese. But now we are going to change the table to following CompanyID nvarchar(20) COLLATE Latin1_General_CI_AS CompanyName nvarchar(60) COLLATE Latin1_General_CI_AS How can we convert the russian and chinese characters saved in the databases to correct unicode (nvarchar) characters? We have experimented with MultiByteToWideChar function and it seems to work for chinese. Is there any other options? Performance?

  • Anonymous
    July 17, 2009
    Hello, Leif  varchar data type always associate with a code page, and it can only store the characters defined in the codepage.  For example, the code page of latin1_general_ci_as collation is 1252, which is western eroupe. You can look at the codepoint table for this at here:http://en.wikipedia.org/wiki/Windows-1252 ,    However, you say you stored chinese and russian characters in the varchar column.  It seems impossible, but it really possible.  The reason is that varchar(10) is just 10 bytes, and you can store any binary sequences in matter what is the encoding.    I believe this can be achieved as binding as varbinary data type at the client program, and send the binary sequences to the server directly.  Note, if the real encoding (encoding and codepage are the same concept) of the column is different with the codepage associated with the column is different, you will have trouble when issue query to SQL Server to querying the data.  Since SQL Server assume the codepage is 1252, but you store data in other encoding format.  You will observed that SQL Server returns some character you cannot read, or sometime a question mark(?) which is indicate the data is corrupt.  So for you case, since you already insert data with different encoding into varchar column, you need first identity whether the data was corrupted or not. Second, find the really encoding of the column. From your describation, it looks like a encoding can store both russian and chinese character, which might eithe UTF-8 or UTF-16. and call corresponding MultiByteToWideChar  function with the correct codepage.  Note , if your data is already corrupted in the database, you might have issue to covert them to nvarchar type.

  • Anonymous
    June 28, 2010
    I have a question that I cannot have answered yet and the Microsoft documentation seems confused at least to me. Obviously I have googled it but I cannot find a real answer to this: Why must I choose Windows Collations intead of SQL collations. What are the real advantages of using Windows Collations instead of SQL Collations. I know that they manage in a different way Unicode and Non-unicode data, but, in the real life, which is the difference? Thank you in advance !