Unexpected truncation error when inserting from UTF8 to non-UTF8 collation

Question

Unexpected truncation error when inserting from UTF8 to non-UTF8 collation

Ben 30

I believe I am hitting a bug with moving data from a UTF8 to a non-UTF8 collations in SQL Server 2022 CU13 (running under Linux but assuming for now this issue is not related).

The issue is that MSSQL appears to be determining the width of a string for insertion into a column using its source (UTF8) size not it's destination size. Yes understand string sizes are now in bytes not characters and that's the issue -- the size of the string appears to be calculated based on the encoding in the source collation (UTF8) and then generating an error because the destination column can't hold that many bytes. However, once converted to the destination column's collation (i.e. associated encoding of character set) the string requires less bytes and would fit. If not for the false error blocking it...

Note that starting with the same data in nvarchar and inserting it into the same destination column works fine. Also it can be verified that the data fits in the destination column by inserting it into a wider column with the same collation as the original destination and then measuring its size after insertion.

Below is SQL to reproduce. You'll see its mostly variations on the same thing up until the last steps where it generates the error when trying to insert a single GBP symbol (U+00A3) into a varchar(1) COLLATE [Windows-1252] column. While this symbol requires 2 bytes in UTF-8, it only requires 1 byte in a Windows-1252 codepage so should fit in the 1 character destination column except it errors if the source of that character is from an UTF-8 encoded column..

SET NOCOUNT ON;

DROP TABLE IF EXISTS A1252;
DROP TABLE IF EXISTS AUCS2;
DROP TABLE IF EXISTS AUTF8;
DROP TABLE IF EXISTS A1252b;
GO

CREATE TABLE A1252 (s varchar(1) COLLATE SQL_Latin1_General_CP1_CI_AS);
CREATE TABLE AUCS2 (s nvarchar(1) COLLATE Latin1_General_100_BIN2_UTF8);
CREATE TABLE AUTF8 (s varchar(2) COLLATE Latin1_General_100_BIN2_UTF8);
CREATE TABLE A1252b (s varchar(2) COLLATE SQL_Latin1_General_CP1_CI_AS);

INSERT INTO A1252
SELECT NCHAR(0x0024) as s; --USD symbol

INSERT INTO AUCS2
SELECT NCHAR(0x0024) as s; --USD symbol

INSERT INTO AUTF8
SELECT NCHAR(0x0024) as s; --USD symbol

--This will show a USD symbol is 1 character, 1 byte long in Windows-1252
SELECT s AS s_asWin1252
,LEN(s) LenOf_s_asWin1252
,DATALENGTH(s) as OctetLenOf_s_asWin1252
FROM A1252;

--This will show a USD symbol is 1 character, 2 bytes long in UCS-2
SELECT s AS s_asUCS2
,LEN(s) LenOf_s_asUCS2
,DATALENGTH(s) as OctetLenOf_s_asUCS2
FROM AUCS2;

--This will show a USD symbol is 1 character, 1 byte long in UTF-8
SELECT s AS s_asUTF8
,LEN(s) LenOf_s_asUTF8
,DATALENGTH(s) as OctetLenOf_s_asUTF8
FROM AUTF8;


--All good up to here, now let's try moving the data around

TRUNCATE TABLE A1252;

INSERT INTO A1252
SELECT s
FROM AUCS2;


TRUNCATE TABLE A1252;

INSERT INTO A1252
SELECT s
FROM AUTF8;


TRUNCATE TABLE AUCS2;

INSERT INTO AUCS2
SELECT s
FROM AUTF8;


TRUNCATE TABLE AUTF8;

INSERT INTO AUTF8
SELECT s
FROM AUCS2;


TRUNCATE TABLE A1252;

INSERT INTO A1252
SELECT s
FROM AUTF8;


--All good to here and confirm character and byte lengths as expected:
SELECT s AS s_asWin1252
,LEN(s) LenOf_s_asWin1252
,DATALENGTH(s) as OctetLenOf_s_asWin1252
FROM A1252;

SELECT s AS s_asUCS2
,LEN(s) LenOf_s_asUCS2
,DATALENGTH(s) as OctetLenOf_s_asUCS2
FROM AUCS2;

SELECT s AS s_asUTF8
,LEN(s) LenOf_s_asUTF8
,DATALENGTH(s) as OctetLenOf_s_asUTF8
FROM AUTF8;


--All good up to here -- now let's try the GBP symbol
INSERT INTO A1252
SELECT NCHAR(0x00A3) as s; --GBP symbol

INSERT INTO AUCS2
SELECT NCHAR(0x00A3) as s; --GBP symbol

INSERT INTO AUTF8
SELECT NCHAR(0x00A3) as s; --GBP symbol

--This will show a row for each USD and GBP symbols, each 1 character, 1 byte long in Windows-1252
SELECT s AS s_asWin1252
,LEN(s) LenOf_s_asWin1252
,DATALENGTH(s) as OctetLenOf_s_asWin1252
FROM A1252;

--This will show a row for each USD and GBP symbols, each 1 character, 2 bytes long in UCS-2
SELECT s AS s_asUCS2
,LEN(s) LenOf_s_asUCS2
,DATALENGTH(s) as OctetLenOf_s_asUCS2
FROM AUCS2;

--This will show a row USD symbol, 1 character and 1 byte long in UTF-8 and a GBP symbol, 1 character and 2 bytes long in UTF-8
SELECT s AS s_asUTF8
,LEN(s) LenOf_s_asUTF8
,DATALENGTH(s) as OctetLenOf_s_asUTF8
FROM AUTF8;


--All good up to here, now let's try moving the data around again
TRUNCATE TABLE A1252;

INSERT INTO A1252
SELECT s
FROM AUCS2;


TRUNCATE TABLE AUTF8;

INSERT INTO AUTF8
SELECT s
FROM AUCS2;


TRUNCATE TABLE AUCS2;

INSERT INTO AUCS2
SELECT s
FROM AUTF8;


--All good to here and confirm character and byte lengths as expected:
SELECT s AS s_asWin1252
,LEN(s) LenOf_s_asWin1252
,DATALENGTH(s) as OctetLenOf_s_asWin1252
FROM A1252;

SELECT s AS s_asUCS2
,LEN(s) LenOf_s_asUCS2
,DATALENGTH(s) as OctetLenOf_s_asUCS2
FROM AUCS2


SELECT s AS s_asUTF8
,LEN(s) LenOf_s_asUTF8
,DATALENGTH(s) as OctetLenOf_s_asUTF8
FROM AUTF8


--Though wasn't needed, this will of course work With the wider maximum column width:
INSERT INTO A1252b
SELECT s
FROM AUTF8

--Confirming extra column width not actually needed:
SELECT s AS s_asWin1252
,LEN(s) LenOf_s_asWin1252
,DATALENGTH(s) as OctetLenOf_s_Win1252
FROM A1252b

--Emptying A1252 to avoid ambiguity
TRUNCATE TABLE A1252;

--Confirming one last time that data in AUTF8 only requires 1 character/byte as Win1252
SELECT s AS s_asUTF8,s COLLATE SQL_Latin1_General_CP1_CI_AS as s_asWin1252
,LEN(s) LenOf_s_asUTF8,LEN(s COLLATE SQL_Latin1_General_CP1_CI_AS) as LenOf_s_asWin1252
,DATALENGTH(s) as OctetLenOf_s_asUTF8,DATALENGTH(s COLLATE SQL_Latin1_General_CP1_CI_AS) as OctetLenOf_s_asWin1252
FROM AUTF8;

PRINT 'The next SQL statement will fail claiming truncation but it should have worked as well as when the data came from AUCS2'
SELECT TOP(0) NULL AS 'The next SQL statement will fail claiming truncation but it should have worked as well as when the data came from AUCS2'
--This is failing but it should work the same as it did when coming from AUCS2
INSERT INTO A1252
SELECT s
FROM AUTF8;

SELECT TOP(0) NULL AS 'Contents of AUTF8 that are supposed to be in A1252'
SELECT s AS s_asWin1252
,LEN(s) LenOf_s_asWin1252
,DATALENGTH(s) as OctetLenOf_s_Win1252
FROM AUTF8;


SELECT TOP(0) NULL AS 'Contents of A1252 that should have been the same as AUTF8';
SELECT s AS s_asWin1252
,LEN(s) LenOf_s_asWin1252
,DATALENGTH(s) as OctetLenOf_s_Win1252
FROM A1252;

Viorel 125.7K

To work around the issue, it is possible to use COLLATE during insertion:

drop table if exists A1252
drop table if exists AUTF8
GO

create table A1252 (s varchar(1) collate SQL_Latin1_General_CP1_CI_AS)
create table AUTF8 (s varchar(2) collate Latin1_General_100_BIN2_UTF8)

insert into A1252 values (N'£')
insert into AUTF8 values (N'£')

select * from A1252 -- outputs "£"
select * from AUTF8 -- outputs "£"

/* Error: "String or binary data would be truncated in table 'A1252'"
insert into A1252
select s from AUTF8
*/

-- Workaround:
insert into A1252
select s collate SQL_Latin1_General_CP1_CI_AS from AUTF8

select * from A1252 -- outputs two "£"

Ben 30 Reputation points

2024-06-11T14:44:02.1333333+00:00

Thanks Viorel, yes that also works. I just wanted to report it as a bug.

Also that workaround requires modifying every similar insert. And because some of the data I receive comes with lots of columns before I can normalize it and they are mix of string and numeric and date types (meaninng I can't just add a COLLATE to column but have to first check the type of each) , that's a lot of query tweaking. I have code to generate some so I can tweak that code to generate workaround code but I think the underlying bug should be fixed. Also it means that autogen code needs to know the destination database's default column collation, which may not be known at the time of code generation.

I get how MSSQL got into the situation where *char(x) is in bytes not characters unlike every other database but it seems even MS's programming team gets stuck dealing with the permutations. Plus add that columns default to the collation of the destination database (even if the other database is tempdb via #/##) rather than the default collation of the current context/database and it's getting very complicated...

Answer accepted by question author

1 additional answer

Your answer

Viorel 125.7K Reputation points

2024-06-11T04:45:14.7833333+00:00

To work around the issue, it is possible to use COLLATE during insertion:

drop table if exists A1252 drop table if exists AUTF8 GO create table A1252 (s varchar(1) collate SQL_Latin1_General_CP1_CI_AS) create table AUTF8 (s varchar(2) collate Latin1_General_100_BIN2_UTF8) insert into A1252 values (N'£') insert into AUTF8 values (N'£') select * from A1252 -- outputs "£" select * from AUTF8 -- outputs "£" /* Error: "String or binary data would be truncated in table 'A1252'" insert into A1252 select s from AUTF8 */ -- Workaround: insert into A1252 select s collate SQL_Latin1_General_CP1_CI_AS from AUTF8 select * from A1252 -- outputs two "£"
Ben 30 Reputation points

2024-06-11T14:44:02.1333333+00:00

Thanks Viorel, yes that also works. I just wanted to report it as a bug.

Also that workaround requires modifying every similar insert. And because some of the data I receive comes with lots of columns before I can normalize it and they are mix of string and numeric and date types (meaninng I can't just add a COLLATE to column but have to first check the type of each) , that's a lot of query tweaking. I have code to generate some so I can tweak that code to generate workaround code but I think the underlying bug should be fixed. Also it means that autogen code needs to know the destination database's default column collation, which may not be known at the time of code generation.

I get how MSSQL got into the situation where *char(x) is in bytes not characters unlike every other database but it seems even MS's programming team gets stuck dealing with the permutations. Plus add that columns default to the collation of the destination database (even if the other database is tempdb via #/##) rather than the default collation of the current context/database and it's getting very complicated...

Answer 1

Erland Sommarskog 128.9K MVP Volunteer Moderator

I concur with Viorel that COLLATE is the solution.

As for this being a bug, I am not sure that Microsoft will agree. But you can report the issue here: https://feedback.azure.com/d365community/forum/04fe6ee0-3b25-ec11-b6e6-000d3a4f0da0

You should include your business reason for this change, because that is something that Microsoft gives significance to.

Ben 30 Reputation points

2024-06-11T22:01:46.5566667+00:00

Thanks Erland, I am curious why you don't think this is a bug? The data fits in the target column and it works fine if the source column is either nvarchar or already in the destination column's collation. It appears that SQL is calculating the length of the column in bytes before the implicit cast/collation change rather than after.

That appraoch would seem the same as failing the insertion of an nvarchar(6) string into a varchar(10) column because the source string is 12 bytes or failing the insertion of a "1" into a varchar(1) column because the source was an int. It seems to me the determination of whether a string will fit in a column should always be based on the string post-implicit conversion to the destination column's type and collation. And in all the examples I can think of it does work that way.

Agree Viorel's solution will workaround the issue. My interpretation is that it is forcing the conversion before the length calculation/destination limit check sidestepping the issue but what is the upside to the current behavor when there's no explicit collation conversion? It doesn't seem consistent with how any other implicit conversions work.
Erland Sommarskog 128.9K Reputation points MVP Volunteer Moderator

2024-06-12T21:20:52.3233333+00:00
I didn't say that I don't think this is a bug, I only questioned that Microsoft would agree. :-)

I agree that it make sense to perform the collation conversion before the checking the length. But I suspect that someone once made the design decision to things in the current order. Maybe thinking that it didn't matter.

And before the introduction of UTF-8 collations, the situation was probably a bit too much of an edge case. However, the situation can arise also without UTF-8. Here is an example:

CREATE TABLE Greek (a char(3) COLLATE Greek_CI_AS NOT NULL) CREATE TABLE Japanese (a varchar(10) COLLATE Japanese_CI_AS NOT NULL) INSERT Japanese (a) VALUES (N'αβγ') SELECT a, len(a), datalength(a) FROM Japanese INSERT Greek(a) VALUES (N'αβγ') SELECT a, len(a), datalength(a) FROM Greek INSERT Greek(a) SELECT a FROM Japanese go DROP TABLE Greek, Japanese

The code page for Japanese_CI_AS includes the Greek characters, but they are encoded in double-bytes.

By all means submit a bug. I just wanted to lower your expectations.

If this would be a blocking issue for you, you would need to open a support case.
Ben 30 Reputation points

2024-06-13T16:04:29.62+00:00
Ah thanks now I understand and appreciate the clarification. Didn't know about the above and thought this was just them still flushing out UTF8 support, which I appreciate, along with UTF16 w/SC, rewrites the concept of *char (I am not even sure a fixed-length char with a variable-length encoding has any useful meaning...)

In any case, thinking through the implications of having the length check before the conversion, I realized the reverse direction might even be worse as that means some truncation errors wouldn't be caught even if they would be normally. Maybe the below will get their attention?

Basically SQL reports success upon inserting the Euro symbol (U+20AC) into a 2-byte UTF8 column even though that symbol requires 3 bytes. It can't do that of course and the result is silently truncated to an empty string. Doing the equivalent either directly (i.e. from NCHAR) or from an nvarchar raises the appropriate truncation error.

SET NOCOUNT ON; DROP TABLE IF EXISTS A1252; DROP TABLE IF EXISTS AUCS2; DROP TABLE IF EXISTS AUTF8; CREATE TABLE A1252 (s varchar(1) COLLATE SQL_Latin1_General_CP1_CI_AS); CREATE TABLE AUCS2 (s nvarchar(1) COLLATE Latin1_General_100_BIN2_UTF8); CREATE TABLE AUTF8 (s varchar(2) COLLATE Latin1_General_100_BIN2_UTF8); INSERT INTO A1252 SELECT NCHAR(0x20AC) as s; SELECT s AS s_asWin1252 ,LEN(s) LenOf_s_asWin1252 ,DATALENGTH(s) as OctetLenOf_s_asWin1252 FROM A1252; INSERT INTO AUTF8 SELECT s FROM A1252; SELECT s AS s_asUTF8 ,LEN(s) LenOf_s_asUTF8 ,DATALENGTH(s) as OctetLenOf_s_asUTF8 FROM AUTF8; SELECT Cast (s as varbinary) as s_bin, Cast('' as varbinary) as emptystr_bin FROM AUTF8; TRUNCATE TABLE AUTF8; INSERT INTO AUCS2 SELECT s FROM A1252; SELECT s AS s_asUCS2 ,LEN(s) LenOf_s_asUCS2 ,DATALENGTH(s) as OctetLenOf_s_asUCS2 FROM AUCS2; INSERT INTO AUTF8 SELECT s FROM AUCS2; INSERT INTO AUTF8 SELECT NCHAR(0x20AC) as s; SELECT s FROM AUTF8;
Erland Sommarskog 128.9K Reputation points MVP Volunteer Moderator

2024-06-13T20:55:00.3033333+00:00

Thanks for the example in the other direction! That's not really the correct result, I think.

If you file a bug report, please share the link!
Ben 30 Reputation points

2024-06-14T19:45:22.6566667+00:00

Thanks that would be great:

https://feedback.azure.com/d365community/idea/1d00a590-ba29-ef11-8ee7-000d3ae688a1

So far it hasn't gained any questions/responses/feedback so I just upvoted my own posting over there...not sure that has any value...
Erland Sommarskog 128.9K Reputation points MVP Volunteer Moderator

2024-06-16T11:57:19.33+00:00

I've voted, but the more I have been thinking of it, the less likely I think it is that Microsoft will address this.

For the case where you are coming from UTF-8 there is a simple workaround.

For the case where you are going to UTF-8 and silent truncation, a change here is likely to be considered breaking. That is, there may be customers who rely on the current behaviour, and who would be very upset if they all of a sudden starts to get an error. Microsoft could still opt to mend this, but it would be change that is guarded by the compatibility level.

It may sound funny that they would ignore bug reports, but anyone who has worked with system development, knows that fixing one bug can easily introduce another one. And this is likely to require changes in very central parts of the query processing. Nothing you do lightly.
Ben 30 Reputation points

2024-06-17T14:31:28.0233333+00:00

I know what you mean and was already lowering my expectations by the time I posted that link back here. I was trying to think about anyone who would complain that they are no longer getting silent truncation of their data but I could also see why MS wouldn't want to fix this in a CU after it's been out in the field since 2019. Unfortunately I only recently started playing with UTF8 and fine-tuning my use of various types/collations.

It's kind of ironic but it's probably easier for them to fix a performance issue than bugs like this. Which itself can't be easy these days given the permutations of configurations and the likely number of code paths under the hood. Unfortunately all these issues are starting to make the platform feel very legacy.

Though I was able to retain some optimisim for the platform after seeing the upcoming additions of ANSI || and regular expressions:

https://devblogs.microsoft.com/azure-sql/author/abhtiwar/

I appreciate tracking closer to standards and can't count how many times I wanted real regular expression support.
Erland Sommarskog 128.9K Reputation points MVP Volunteer Moderator

2024-06-17T20:56:27.6133333+00:00

I was trying to think about anyone who would complain that they are no longer getting silent truncation of their data

Oh, there are a lot of people out there that get very upset when code that "used to work", suddenly produces an error message.
Ben 30 Reputation points

2024-06-18T15:10:06.6466667+00:00

I could see that though if someone was silently losing data maybe they would prefer to know?

I did find these notes in the 2022 fix list which suggests there is precedence for fixing bugs that erroneously generate error messages?

KB Article KB URL Description

5036432 https://learn.microsoft.com/troubleshoot/sql/releases/sqlserver-2022/cumulativeupdate13 Fixes an issue in which using the INSERT statement with the CAST or CONVERT function from a string representing negative zero to a decimal or numeric datatype succeeds, but you see the following error message on DBCC CHECKDB and DBCC CHECKTABLE: Msg 2570, Level 16, State 3, Line <LineNumber> Page (1:360), slot 0 in object ID <ObjectID>, index ID <IndexID>, partition ID <PartitionID>, alloc unit ID <UnitID> (type "In-row data"). Column "<ColumnName>" value is out of range for data type "decimal". Update column to a legal value.

5033663 https://learn.microsoft.com/troubleshoot/sql/releases/sqlserver-2022/cumulativeupdate12 Fixes an access violation that you encounter when a Plan Cache object type differs from what Cardinality Estimation (CE) feedback expects. In this scenario, CE feedback casts the object type into the expected one and tries to access to a field that doesn't exist in the object type.

Though since this change both makes things that previously worked and flags things that should have not worked, it may be better handled like one of these fixes:

5022375 https://learn.microsoft.com/troubleshoot/sql/releases/sqlserver-2022/cumulativeupdate1 Improvement: Automatically enables the binary large object (BLOB) trace ring buffer feature when a BLOB assertion failure is detected. This improvement helps to better investigate such issues.

5022375 https://learn.microsoft.com/troubleshoot/sql/releases/sqlserver-2022/cumulativeupdate1 This update removes the requirement for the trace flag (TF) 809 for the hybrid buffer pool with direct write feature. After you apply this update, this feature is enabled by default in SQL Server 2022. This update introduces TF 898 to disable the Direct Write behavior of the hybrid buffer pool for troubleshooting or debugging purposes.

If they added a TF to fix for this version and then made that the default for the next, I wouldn't mind.

Actually at this point I would just be curious to hear MSFT's position on this...
Erland Sommarskog 128.9K Reputation points MVP Volunteer Moderator

2024-06-18T21:02:02.6166667+00:00
I could see that though if someone was silently losing data maybe they would prefer to know?

Not if the truncation brings a critical business process to a halt.

And if the truncation is a problem, they should have noticed earlier.

Also, in many cases, altering the column length is not an option (because it is a third-party system), so accepting the truncation is the only option. Question is only if should be silent, or explicit by means of convert.

Silent truncation occurs elsewhere in SQL Server. For instance:

DECLARE @short varchar(4) SELECT @short = 'Too long'

Completes with errors. (Originally, before ANSI was a thing, SQL Server always employed silent truncation.)

5036432 https://learn.microsoft.com/troubleshoot/sql/releases/sqlserver-2022/cumulativeupdate13 Fixes an issue in which using the INSERT statement with the CAST or CONVERT function from a string representing negative zero to a decimal or numeric datatype succeeds, but you see the following error message on DBCC CHECKDB and DBCC CHECKTABLE:

Here the bad insert causes other issues down the line, so it calls for fixing. It is not clear to me that the resolution in this case is to raise on error on INSERT. The may only make sure that they are inserting a legal value.

Answer 2

Ali Varzeshi 80

The root cause of this problem lies in how SQL Server handles character encoding during data conversion between columns with different collations. Specifically, when moving data from a UTF-8 encoded column to a column with a different encoding, SQL Server uses the byte size of the characters in the source UTF-8 encoding to determine if the data will fit in the destination column. This leads to a truncation error because SQL Server does not take into account that the characters may require fewer bytes in the destination encoding, resulting in an incorrect assumption that the data exceeds the column's capacity. This mismatch in byte size calculation prevents the insertion even though the data would fit after conversion.

Ben 30 Reputation points

2024-06-14T19:59:37.46+00:00

Agree SQL Server is using the byte length of the string in its source collation (with character sets implicit in that) rather than it's to-be byte length in the destination column. Putting aside that I think most SQL programmers/query writers would prefer to work in characters rather than bytes, I think the current behavor is a bug since it causes things to not work that should and allows silent data truncations when it shouldn't.

In my mind MSSQL should test whether a string exceeds the destination column's capacity after it converts it to the destination type and collation. I don't think that would be a big change nor break anyone else's working code but am curious to hear MS's assessment. Given the long history of MSSQL and many permutations of configuraiton options under the hood, there could be far more codepaths that would have to be touched to fix this properly that I can imagine...

Share via

Unexpected truncation error when inserting from UTF8 to non-UTF8 collation

1 additional answer

Your answer