varchar vs nvarchar

Question

varchar vs nvarchar

SQL Guy10 1

as a general rule, should you use varchar instead of nvarchar, if you won't ever be storing foreign language characters?

CathyJi-MSFT 22,396 Reputation points Microsoft External Staff

2021-01-25T02:08:28.53+00:00

Hi @SQL Guy10 ,

We have not received a response from you. Did the replies could help you? If the response helped, do "Accept Answer". By doing so, it will benefit all community members who are having this similar issue. Your contribution is highly appreciated.

Best regards,
Cathy

4 answers

Your answer

CathyJi-MSFT 22,396 Reputation points Microsoft External Staff

2021-01-25T02:08:28.53+00:00

Hi @SQL Guy10 ,

We have not received a response from you. Did the replies could help you? If the response helped, do "Accept Answer". By doing so, it will benefit all community members who are having this similar issue. Your contribution is highly appreciated.

Best regards,
Cathy

Answer 1

CathyJi-MSFT 22,396 Microsoft External Staff

Hi @SQL Guy10 ,

If we know that data to be stored in the column or variable doesn’t have any Unicode characters, we can use varchar. But some experts recommends nvarchar always because: since all modern operating systems and development platforms use Unicode internally, using nvarchar rather than varchar, will avoid encoding conversions every time you read from or write to the database.

The major difference between varchar vs nvarchar

Nvarchar stores UNICODE data. If you have requirements to store UNICODE or multilingual data, nvarchar is the choice. Varchar stores ASCII data and should be your data type of choice for normal use.
For nvarchar, when using characters defined in the Unicode range 0-65,535, one character can be stored per each byte-pair, however, in higher Unicode ranges (65,536-1,114,111) one character may use two byte-pairs. Whereas varchar only uses 1 byte.

Advantages and Disadvantages of varchar and nvarchar in SQL Server.

By the way, starting with SQL Server 2019 (15.x), consider using a UTF-8 enabled collation to support Unicode and minimize character conversion issues.

Please refer to below links to get more information.

Difference Between Sql Server VARCHAR and NVARCHAR Data Type
What is the difference between varchar and nvarchar?
SQL Server differences of char, nchar, varchar and nvarchar data types

Best regards,
Cathy

If the response is helpful, please click "Accept Answer" and upvote it, thank you.
Note: Please follow the steps in our documentation to enable e-mail notifications if you want to receive the related email notification for this thread.

tibor_karaszi@hotmail.com 4,316 Reputation points

2021-01-19T07:58:58.82+00:00

"By the way, starting with SQL Server 2019 (15.x), consider using a UTF-8 enabled collation to support Unicode and minimize character conversion issues."

I would not go this route, unless I have some very special circumstances. I strongly encourage anyone considering using UTF-8 to first read this: https://www.sqlservercentral.com/blogs/native-utf-8-support-in-sql-server-2019-savior-or-false-prophet
Solomon Rutzky 31 Reputation points

2021-01-20T06:35:52.24+00:00

Hello @tibor_karaszi@hotmail.com (and @CathyJi-MSFT ). Thanks for referring to my post 😺. Unfortunately, the SQL Server Central version of it is woefully outdated as their blog mirroring no longer refreshes when the original is updated. For a lot of posts that doesn't matter so much. But for this one there have been many updates as I modified it to keep up with their changes across the CTPs through RTM, etc. Please review the source post which might have a slightly different recommendation as compared to the older SQL Server Central version: Native UTF-8 Support in SQL Server 2019: Savior or False Prophet?

They ended up improving it a bit so it's not so bad now, but still best in a couple of scenarios. Please see my answer below for details.
tibor_karaszi@hotmail.com 4,316 Reputation points

2021-01-20T10:39:37.22+00:00

Thanks @SolomonRutzky-6202 . I didn't know you have two versions of the article out there. I do point to your article on a couple of my classes, and that is indeed your QuantumLeap link. For above post, I googled to find your blog quickly, and missed that it wasn't the same as I point to in my classes. Good to know there are two out there (and which to point to).
Solomon Rutzky 31 Reputation points

2021-02-15T23:07:40.743+00:00

Hello @tibor_karaszi@hotmail.com . Thank you very much for sharing that. I'm honoured that you use my content in some of your classes. Of course, now I feel bad that I have updates for that article (and a couple of others) that I've been needing to post for almost a year. But, 2020 / COVID being what it was/is, well, let's just say that it's not been the most productive year. Still, I'll try to get those updates posted soon.

Answer 2

Martin Cairney 2,261

If you never need to make use of the extended characters available in UNICODE then yes, it does make sense to only use VARCHAR.

However, if you are only using characters in the UNICODE range 0-65535 then they use the same storage space as VARCHAR.

See the discussion about NCHAR and NVARCHAR

Remember also that if you use UNICODE then you MUST always prefix strings with "N" otherwise you will introduce an implicit conversion and also affect the SARGability of your predicates.

tibor_karaszi@hotmail.com 4,316 Reputation points

2021-01-19T08:05:21.777+00:00
Martin,

I'm confused by your below statement:

"However, if you are only using characters in the UNICODE range 0-65535 then they use the same storage space as VARCHAR."

Above is, AFAIK, true if you use row compression and the data isn't off-row. But without row compression, nvarchar uses double the length compared to varchar:

USE tempdb /* DROP TABLE IF EXISTS t4 DROP TABLE IF EXISTS t5 */ CREATE TABLE t4(c1 int IDENTITY, c2 varchar(40)) CREATE TABLE t5(c1 int IDENTITY, c2 nvarchar(40)) INSERT INTO t4 SELECT TOP(50000) REPLICATE('a', 40) FROM sys.columns AS a CROSS JOIN sys.columns AS b INSERT INTO t5 SELECT TOP(50000) REPLICATE('a', 40) FROM sys.columns AS a CROSS JOIN sys.columns AS b EXEC sp_spaceused 't4' EXEC sp_spaceused 't5'

Did I misunderstand you?
Martin Cairney 2,261 Reputation points

2021-01-19T21:55:49.277+00:00

Hi @tibor_karaszi@hotmail.com - see the link to the Docs page in my post in the Remarks section

"when using characters defined in the Unicode range 0-65,535, one character can be stored per each byte-pair. However, in higher Unicode ranges (65,536-1,114,111) one character may use two byte-pairs."

There isn't mention of compression there, but after reading it again, it may also depend on the collation that you have set. I haven't experimented to see what differences there are.

On the test however, I would use a variables defined as VARCHAR(x) and NVARCHAR(x) and compare the length of that for the same string. I thik for the page storage you would need to dig into DBCC PAGE and see exactly what has been written to validate that test.
Erland Sommarskog 121.7K Reputation points MVP Volunteer Moderator

2021-01-19T23:15:01.167+00:00

"when using characters defined in the Unicode range 0-65,535, one character can be stored per each byte-pair.

A byte pair is two bytes per character. In varchar, each character in the ASCII range takes up one byte each. For characters beyond the ASCII range, it depends on the code page for the collation. But for collations for languages using Latin, Cyrillic or Greek script but based on UTF-8, that is one byte for every character.

It is different if you use UTF-8, or a collation for languages like Thai or Chinese.
Solomon Rutzky 31 Reputation points

2021-01-20T06:21:52.833+00:00

"characters in the UNICODE range 0-65535...use the same storage space as VARCHAR" is incorrect. NVARCHAR uses 2 bytes per BMP character (0-65535) and 4 bytes per Supplementary Character (65536-1114111). For details please see my post: How Many Bytes Per Character in SQL Server.

Using the N prefix is great advice, but it's mainly to avoid data loss (when Unicode chars not available in the code page associated with the current database's default collation get converted to either ? or a "best fit" mapping of something similar). The performance issue is specific to a VARCHAR column using a SQL Server collation compared to an NVARCHAR value. For details please see my post: Impact on Indexes When Mixing VARCHAR and NVARCHAR Types.
tibor_karaszi@hotmail.com 4,316 Reputation points

2021-01-20T10:35:04.83+00:00
Sorry if I state the obvious (after Erland's post), but the key word you probably overlooked is pair. I.e. a byte pair is indeed two bytes. :-)

As for a T-SQL example not using a table, here goes:

SELECT DATALENGTH('Microsoft SQL Server '), DATALENGTH(N'Microsoft SQL Server ');

Result

----------- ----------- 25 50
tibor_karaszi@hotmail.com 4,316 Reputation points

2021-01-20T10:43:22.087+00:00

Hi again Solomon. Did you also miss the word "pair" in that MS statement? I.e., "Unicode range 0-65,535, one character can be stored per each byte-pair. So unless I am out-of-the-ordinary confused, you seem to say the same thing as the documentation: that range uses two bytes.

The fact that both you and Martin (seem to have) missed that "pair" does indeed reflect that there's room for improvement in how MS phrases things, though!
Solomon Rutzky 31 Reputation points

2021-01-20T23:39:20.117+00:00

Hello. No, I definitely did not miss that word ;-). I was actually responding to Martin, quoting part of this answer. However, my comment was long and I had to cut parts out to fit within the limit (which is not the 1000 that it claims to be, but closer to 985). I updated my comment earlier today to include more of the quote so it would be clearer what I was responding to. Yes, I'm saying the same thing as the documentation, except I prefer to use number of bytes for everything as it's not only clearer, but is consistent when speaking of the other options.

And yes, there are several areas of the documentation that need improvement, but only so much time in a day, and even less these days :-(

Answer 3

Erland Sommarskog 121.7K MVP Volunteer Moderator

as a general rule, should you use varchar instead of nvarchar, if you won't ever be storing foreign language characters?

The problem is: can you really be sure. All of a sudden your business expands and the need arises.

If you start with varchar and need to change later - it will be painful. UTF-8 collations can save you, but not all the way since you need review column lengths.

Answer 4

General

It depends mainly on:

What do you consider "foreign language characters"?
How guaranteed are you that this requirement won't change?

Regarding "foreign language characters", will something like emojis every be an option? Meaning, are you thinking in terms of "no characters outside of the local alphabet and punctuation", or "alphabet and punctuation are local, but emojis might get thrown in"?

For me, I would say that "codes" usually have a stable range of characters and can safely be VARCHAR. For example, postal codes, airline confirmation codes, credit card authorization/confirmation codes, ISO country codes, etc.

But names, whether for people or places, or even products, etc, can increasingly contain "foreign language characters". Even if it's reasonable to expect local street names to contain only local language characters, if you are dealing with a list of customers and the business isn't 100% local, then you can't guarantee only local language characters. So for names, etc it is better to go with NVARCHAR.

UTF-8

UTF-8, introduced in SQL Server 2019, is a tricky option as it helps some scenarios but hurts others. In general, it makes sense to use under one of the following conditions:

You are dealing with app code that submits SQL using VARCHAR and/or string literals without the upper-case N prefix (whether it is a legacy app or an app that works with one or more non-SQL Server RDBMSs that allow for Unicode in VARCHAR and string literals missing the upper-case N prefix).
Your data is primarily US English characters (i.e. standard ASCII / code points 1 - 127) but you either have a small amount of data that isn't standard ASCII or there's at least potential to get non-standard ASCII data, and a significant portion of the data is in NVARCHAR(MAX) columns (i.e. can't benefit from Unicode Compression that is used in Row Compression but doesn't work on NVARCHAR(MAX) ).

For full details, please see my post: Native UTF-8 Support in SQL Server 2019: Savior or False Prophet?

Bytes per Character

Now, how many bytes are used per character is also tricky as it depends on the datatype and collation. In general:

VARCHAR can be either 1 or 2, though when using a UTF-8 collation it can be anywhere from 1 to 4.
NVARCHAR can be either 2 or 4.

For full details, please see my post: How Many Bytes Per Character in SQL Server: a Completely Complete Guide

tibor_karaszi@hotmail.com 4,316 Reputation points

2021-01-20T10:49:57.343+00:00

Third comment - last one for now :-) For your point 2 above, it seems you are simplifying things and as a blanket statement perhaps saying that nvarchar with row compression is preferred? I.e., without qualifying all your points in the recommendation section of your article...
Solomon Rutzky 31 Reputation points

2021-02-16T06:21:50.047+00:00

Hi @tibor_karaszi@hotmail.com . I was missing a qualification that I just added to my answer (for point #2): the real issue is NVARCHAR(MAX) since that datatype can't make use of Unicode Compression. And in fact, it only benefits from Page Compression when the values are stored in-row. It would be wonderful if Unicode Compression worked in off-row NVARCHAR data (i.e. where one would have the most data that would benefit the most from compression, as opposed to short values in NVARCHAR(1 - 4000) columns), so please vote for: unicode compression nvarchar(max).
tibor_karaszi@hotmail.com 4,316 Reputation points

2021-02-16T09:34:59.637+00:00

Hi @Solomon Rutzky . Voted! :-)

Share via

varchar vs nvarchar

4 answers

Your answer