SQL collation and performance

Recently a colleague asked me if SQL collations have any impact on performance. We also hit upon another question: do local variables (DECLARE @somevar <datatype>) have a collation associated with them and if so how is it controlled?

Let us take some simple examples to demonstrate these points and derive conclusions.

Create databases

First, we are setting up 2 databases (inspired by SQL Books Online, these are aptly named GreekDB and LatinDB Smile).

— Setup databases first
create database greekdb
collate greek_ci_as
go

create database latindb
collate latin1_general_cs_as
go

Create tables with dummy data

Next, we are creating some tables in them, inserting data and then building indexes. First, we do the GreekDB ones:

use greekdb
create table greektable
(
    greekstr varchar(50)
)

insert greektable values (‘hello world’)

create index nc_greek
on greektable (greekstr)

Then the LatinDB ones:

use latindb
create table latintable
(
    latinstr varchar(50)
)

insert latintable values (‘hello world’)

create index nc_latin
on latintable (latinstr)

Query the tables

We start by querying these tables and observing the execution plans (I am using text showplan for ease of copying into this blog post.)

Query 1: Best case scenario

use greekdb
select *
from greektable
where greekstr = ‘Hello world’

Here is the execution plan:

  |–Index Seek(OBJECT:([greekdb].[dbo].[greektable].[nc_greek]), SEEK:([greekdb].[dbo].[greektable].[greekstr]=[@1]) ORDERED FORWARD)

Observations: Fairly obvious results, no collation mismatch, everything is fine, index gets used to seek.

Query 2: Cross database query with collation mismatch

use latindb
select *
from greekdb.dbo.greektable
where greekstr = ‘Hello world’

Here is the execution plan:

  |–Index Seek(OBJECT:([greekdb].[dbo].[greektable].[nc_greek]), SEEK:([greekdb].[dbo].[greektable].[greekstr]=CONVERT_IMPLICIT(varchar(8000),[@1],0)) ORDERED FORWARD)

Observations: Before you read further, keep in mind that USE DATABASE statement, referencing LatinDB. The rest of the query operates on greekdb. Looking at the execution plan, the implicit conversion of the string literal is interesting. String literals for a batch inherit the database collation (click on the previous link and check the Remarks section) if a COLLATE clause is not specified. That is what we are seeing here, the string literal ‘hello world’ defaults to the collation of the database, which is in this case is specified as LatinDB. Due to this, the string literal is converted to the collation of the column. The index seek can still be used because the column is not being converted.

Query 3: Usage of COLLATE to specify string literal collation

use latindb
select *
from greekdb.dbo.greektable
  where greekstr = ‘Hello world’ COLLATE greek_ci_as

Here is the execution plan:

  |–Index Seek(OBJECT:([greekdb].[dbo].[greektable].[nc_greek]), SEEK:([greekdb].[dbo].[greektable].[greekstr]=CONVERT(varchar(8000),[@1],0)) ORDERED FORWARD)

Observations: You see in this case that specifying COLLATE has no effect on the query plan, except that the CONVERT_IMPLICIT changes to a CONVERT (explicit.) The string literal still is coerced into the new collation (GREEK_CI_AS.) Please refer to the collation precedence rules for details (click on the previous link and check the ‘Collation rules’ section) on this.

Query 4: Explicit-explicit coercion is not allowed

select * from greekdb.dbo.greektable G
join latindb.dbo.latintable L
on G.greekstr = L.latinstr

Result: Msg 468, Level 16, State 9, Line 4
Cannot resolve the collation conflict between “Latin1_General_CS_AS” and “Greek_CI_AS” in the equal to operation

Observations: This should be obvious from the previous collation rules (if you clicked on the previous link).

Query 5: Usage of COLLATE to specify column collation

use latindb
select * from greekdb.dbo.greektable G
join latindb.dbo.latintable L
on G.greekstr collate latin1_general_cs_as = L.latinstr

Here is the execution plan:

|–Nested Loops(Inner Join, OUTER REFERENCES:([Expr1006]))
     |–Compute Scalar(DEFINE:([Expr1006]=CONVERT(varchar(50),[greekdb].[dbo].[greektable].[greekstr] as [G].[greekstr],0)))
     |    |–Index Scan(OBJECT:([greekdb].[dbo].[greektable].[nc_greek] AS [G]))
     |–Index Seek(OBJECT:([latindb].[dbo].[latintable].[nc_latin] AS [L]), SEEK:([L].[latinstr]=[Expr1006]) ORDERED FORWARD)

Observations: This query which is almost the same as previous one, works, while the previous errors out. The reason should be obvious: the explicit conversion of the Greek table columns to the Latin collation enables this to work. While this sounds good, there is potential for data loss. From this link, you can see the dreaded words: ‘Data loss during code page translations is not reported.’

Query 6: Explicit collation conversion on column

use greekdb
select *
from greekdb.dbo.greektable
where greekstr collate latin1_general_cs_ai = ‘Hello world’

Here is the execution plan:

  |–Index Scan(OBJECT:([greekdb].[dbo].[greektable].[nc_greek]),  WHERE:(CONVERT(varchar(50),[greekdb].[dbo].[greektable].[greekstr],0)=CONVERT_IMPLICIT(varchar(8000),[@1],0)))

Observations: In this query, which is kind of the same to the previous query, we convert the Greek column to a 3rd collation (accent-insensitive version of the LatinDb database collation.) The result is that underlying index is not used, causing a scan of the table. The other interesting thing to observe is the collation conversion of the string literal as well, which allows this comparison to happen.

Query 7: COLLATE specified with same collation

use greekdb
select *
from greekdb.dbo.greektable
where greekstr collate greek_ci_as = ‘Hello world’

Here is the execution plan:

  |–Index Seek(OBJECT:([greekdb].[dbo].[greektable].[nc_greek]), SEEK:([greekdb].[dbo].[greektable].[greekstr]=[@1]) ORDERED FORWARD)

Observations: This is the ‘best case’ all over again. The explicit COLLATE clause is effectively ignored by the engine. The index is used.

Query 8: Data type mismatch in case insensitive collation database

use greekdb
select *
from greekdb.dbo.greektable
where greekstr = N’Hello world’

Here is the execution plan:

  |–Nested Loops(Inner Join, OUTER REFERENCES:([Expr1006], [Expr1007], [Expr1005]))
       |–Compute Scalar(DEFINE:(([Expr1006],[Expr1007],[Expr1005])=GetRangeThroughConvert([@1],[@1],(62))))
       |    |–Constant Scan
       |–Index Seek(OBJECT:([greekdb].[dbo].[greektable].[nc_greek]), SEEK:([greekdb].[dbo].[greektable].[greekstr] > [Expr1006] AND [greekdb].[dbo].[greektable].[greekstr] < [Expr1007]),  WHERE:(CONVERT_IMPLICIT(nvarchar(50),[greekdb].[dbo].[greektable].[greekstr],0)=[@1]) ORDERED FORWARD)

Observations: This case is more complex. It involves specifying a string literal of type NVARCHAR, which is higher in precedence compared to our column data type of VARCHAR. The most important thing here is that the column ‘greekstr’ has to be implicitly converted to NVARCHAR(50). Also of note is that one row is returned by the SELECT query. The other thing of great interest is the usage of the GetRangeThroughConvert() function. Some details about this internal function are here and here, essentially you want to think of this as an optimization applied when SQL has to deal with an implicit conversion in the predicate, and enables the index to be ‘seeked’ (though there is a lot of work done before that seek can happen.)

Query 9: Data type mismatch in case-sensitive collation database

use latindb
select *
from latindb.dbo.latintable
where latinstr = N’Hello world’

Here is the execution plan:

|–Nested Loops(Inner Join, OUTER REFERENCES:([Expr1006], [Expr1007], [Expr1005]))
     |–Compute Scalar(DEFINE:(([Expr1006],[Expr1007],[Expr1005])=GetRangeThroughConvert([@1],[@1],(62))))
     |    |–Constant Scan
     |–Index Seek(OBJECT:([latindb].[dbo].[latintable].[nc_latin]), SEEK:([latindb].[dbo].[latintable].[latinstr] > [Expr1006] AND [latindb].[dbo].[latintable].[latinstr] < [Expr1007]),  WHERE:(CONVERT_IMPLICIT(nvarchar(50),[latindb].[dbo].[latintable].[latinstr],0)=[@1]) ORDERED FORWARD)

Observations: On the face of it, this query is very similar to the previous query. The main difference is that we are operating on a case-sensitive database. No rows are returned. Similar to the previous case, you see the effect of the GetRangeThroughConvert() internal function to enable the seek despite the data type mismatch.

Query 10: Case-insensitive collation specified

use latindb
select *
from latindb.dbo.latintable
where latinstr collate latin1_general_ci_as = N’Hello world’

Here is the execution plan:

  |–Index Scan(OBJECT:([latindb].[dbo].[latintable].[nc_latin]),  WHERE:(CONVERT_IMPLICIT(nvarchar(50),CONVERT(varchar(50),[latindb].[dbo].[latintable].[latinstr],0),0)=CONVERT_IMPLICIT(nvarchar(4000),[@1],0)))

Observations: This query is very similar to the previous ones, except that it has an explict COLLATE applied to the column. The query does return one row. You can see that not only is the latinstr column being converted (explicit CONVERT) to the same collation, it is later being implicitly converted to NVARCHAR. The other interesting aspect is that the string literal is also being converted to the same collation, which is necessary for a correct result.

Query 11: Effect of database collation on variable declaration

use greekdb — now this becomes the database whose collation will be used for all variables
declare @mygreekvar varchar(50)

use latindb
declare @mylatinvar varchar(50)  — you would think this is latin1_general_cs_as. Let’s see later.

set @mygreekvar = ‘hello world’

set @mylatinvar = ‘hello world’

if (@mygreekvar = @mylatinvar)
    select ‘they are the same’
else
    select ‘different’

use greekdb
select *
from greektable
where greekstr  = @mygreekvar

use latindb
select *
from latindb.dbo.latintable
where latinstr = @mylatinvar

Execution plan for the first SELECT:

  |–Index Seek(OBJECT:([greekdb].[dbo].[greektable].[nc_greek]), SEEK:([greekdb].[dbo].[greektable].[greekstr]=[@mygreekvar]) ORDERED FORWARD)

Execution plan for the second SELECT:

  |–Table Scan(OBJECT:([latindb].[dbo].[latintable]), WHERE:([latindb].[dbo].[latintable].[latinstr]=CONVERT_IMPLICIT(varchar(50),[@mylatinvar],0)))

Observations: Hopefully by now, you should have figured this one out. The collation applied to variables is actually that of the last USE database prior to the first DECLARE. Read more about that behavior here (opens up to Microsoft Connect.) So the @mylatinvar variable has to be converted to suit the latintable collation.

The key point I wish to call out here is that the usage of the COLLATE clause is potentially (see point a) below for why) going to cause a conversion of the column. The effect will depend on two things:

a. Whether the target collation is the same as the column collation or not. If the target is the same then there is no scalar conversion required.

AND

b. Whether this COLLATE clause is present in the predicate (such as WHERE clause or JOIN predicate). If it is in the predicate then it can cause a scan (if condition a) above has also been satisfied.)

c. ALSO

d. If the COLLATE is present in the output list (not in the predicate) and if condition a) has been satisfied then it will cause a scalar conversion operator to be introduced into the plan. Depending on how many such conversions (rows * columns) it may cause additional CPU overhead.

To summarize

PHEW! This is a rather long post, but exposes some important things:

  1. Collation mismatches can be dangerous – worst case, they can cause data loss or errors to occur (if coercion is disallowed) or in many cases can (at the very least) cause performance problems.
  2. The usage of the COLLATE clause can introduce a potential performance penalty. If this clause is applied on top of a table column inside a predicate, it may mean that any indexes might not be used (or we may have to do a lot of work using the GetRangeThroughConvert() function.)
  3. Lastly, if you are using cross-database references in a batch, beware of the collation assigned to each local variable – they inherit the same database collation of the last USE database prior to the first DECLARE.

Thank you for reading. Questions, comments are most welcome!

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.