Many SQL 823, 824 and 3041 errors since VM migrations

Question

Many SQL 823, 824 and 3041 errors since VM migrations

Mike Liening 6

Our server team migrated our SQL Server VMs (VMware) from:
HPE hosts using fibre channel through brocade fabric to our Pure FlashArray

to:
Cisco hosts connected directly to the same Pure FlashArray, but now using iSCSI.

Ever since the VM migrations last fall, we've had 400+ 823 and 824 errors (checksum errors) across 41 VM's, and approx 250 log corruption errors dectected during backups (error 3041). We had none of these errors prior to this VM migration, over the last several years.

The 823 and 824 errors are nearly always associated with tempdb. One interesting finding is that of the 95 823 errors, all but one report the same checksum value for the expected checksum and the actual checksum, i.e.

Error: 823, Severity: 24, State: 7. - SQL Server encountered: 'incorrect checksum (expected: 0xec15dd20; actual: 0xec15dd20)' resulting from an attempt to read the following: sort run page (4:33709), in file 'F:\DB_MP1\MSSQL12.MSSQLSERVER\MSSQL\DATA\tempdb3.ndf', in database with ID 2. Sort is retrying the read

Nearly all of the 330 824 errors have different expected and actual checksum values:

Error: 824, Severity: 24, State: 7. - SQL Server encountered: 'incorrect checksum (expected: 0x72b58865; actual: 0x4149dc7b)' resulting from an attempt to read the following: sort run page (1:33008), in file 'E:\SQLData\MP3_TempDB\Data\tempdb.mdf', in database with ID 2. Sort is retrying the read.

The 3041 errors, of which we've had 239 of since the migration all look similar to this (various databases, various VMs):

Error: 3041, Severity: 16, State: 1. - Backup detected log corruption in database cabinet. Context is Bad Middle Sector. LogFile: 2 'E:\CAB_Logs\Logs\cabinet_1.LDF' VLF SeqNo: x1108ab VLFBase: x5b8d9e000 LogBlockOffset: x60761d000 SectorStatus: 2 LogBlock.StartLsn.SeqNo: x1108ab LogBlock.StartLsn.Blk: x2743ec Size: xf000 PrevSize: xf000

We've engaged support from Microsoft, Cisco, VMWare, Pure and Clumio. None can find any issues. Microsoft basically stopped offering support once we were once able to trigger an 823/824 using SQLIOSim. I get it, the errors are related to something in the I/O path, not within SQL Server, but we were hoping for more from them. They shrugged off the errors with the same checksum values as being 'transient' errors.

Another troubleshooting step we took was to change the schedule of VM backups (VM level backups, not SQL database backups) performed by our backup tool, Clumio, on some of the servers experiencing the SQL errors. The timing of subsequent errors followed the change of the backup window.

From a DBA perspective, it sounds like there is a bottleneck somewhere in the I/O path that appears during VM backup windows. Neither our server team, nor any of our vendors can see any misconfigurations, extended high latency or other indications of any issue. We see very short higher than normal latency events during the vm backups - up to around 40ms sor very short periods of time. We've had far, far higher latency back when we were using a spinning disk SAN and even back then, had no 823 or 824 errors.

So here we are, experiencing several 823 and 824 errors daily and everyone is doing the "not me" thing. Fortunately, except for a few pages showing up in various server's suspect pages table, the errors have been associated with tempdb and we've had no databases actually go into suspect mode.

Any feedback or suggestions would be much appreciated!

Tom Phillips 17,771 Reputation points

2022-05-31T18:04:31.217+00:00

Please post the results of SELECT @@VERSION.

There was a bug in SQL Server (I would have to find) which incorrectly reported this kind of error. Please make sure you are on the current patch level.

https://learn.microsoft.com/en-US/troubleshoot/sql/general/determine-version-edition-update-level
Mike Liening 6 Reputation points

2022-05-31T18:40:51.547+00:00

These errors occur on the following versions & builds. For the most part, these VMs are on the same SQL build now as they were before the VM migrations when they never experienced these errors.

Version Build
SQL 2008 10.0.5520.0
SQL 2008 R2 10.50.1600.1
SQL 2008 R2 10.50.6000.34
SQL 2012 11.0.6607.3
SQL 2012 11.0.7001.0
SQL 2012 11.0.7507.2
SQL 2014 12.0.5000.0
SQL 2014 12.0.5203.0
SQL 2014 12.0.6024.0
SQL 2016 13.0.4001.0
SQL 2016 13.0.4446.0
SQL 2016 13.0.5026.0
SQL 2016 13.0.5492.2
SQL 2016 13.0.5598.27
SQL 2016 13.0.5698.0
SQL 2017 14.0.3223.3
SQL 2017 14.0.3238.1
SQL 2017 14.0.3335.7
SQL 2019 15.0.4003.23
SQL 2019 15.0.4053.23
SQL 2019 15.0.4138.2
Tom Phillips 17,771 Reputation points

2022-05-31T19:24:46.477+00:00

Just to add "SQLIOSim" is not in any way related to SQL Server. It simulates how SQL Server reads/writes to the disk. The fact it happens when SQLIOSim runs indicates, it is not a problem with SQL Server itself, but with something else. That is why MS washed their hands and said talk to someone else.
Mike Liening 6 Reputation points

2022-05-31T20:49:24.043+00:00

Yeah, I get it. I was probably unrealistically hoping that they've seen similar issues in environments like ours.

1 answer

Your answer

Tom Phillips 17,771 Reputation points

2022-05-31T18:04:31.217+00:00

Please post the results of SELECT @@VERSION.

There was a bug in SQL Server (I would have to find) which incorrectly reported this kind of error. Please make sure you are on the current patch level.

https://learn.microsoft.com/en-US/troubleshoot/sql/general/determine-version-edition-update-level
Mike Liening 6 Reputation points

2022-05-31T18:40:51.547+00:00

These errors occur on the following versions & builds. For the most part, these VMs are on the same SQL build now as they were before the VM migrations when they never experienced these errors.

Version Build
SQL 2008 10.0.5520.0
SQL 2008 R2 10.50.1600.1
SQL 2008 R2 10.50.6000.34
SQL 2012 11.0.6607.3
SQL 2012 11.0.7001.0
SQL 2012 11.0.7507.2
SQL 2014 12.0.5000.0
SQL 2014 12.0.5203.0
SQL 2014 12.0.6024.0
SQL 2016 13.0.4001.0
SQL 2016 13.0.4446.0
SQL 2016 13.0.5026.0
SQL 2016 13.0.5492.2
SQL 2016 13.0.5598.27
SQL 2016 13.0.5698.0
SQL 2017 14.0.3223.3
SQL 2017 14.0.3238.1
SQL 2017 14.0.3335.7
SQL 2019 15.0.4003.23
SQL 2019 15.0.4053.23
SQL 2019 15.0.4138.2
Tom Phillips 17,771 Reputation points

2022-05-31T19:24:46.477+00:00

Just to add "SQLIOSim" is not in any way related to SQL Server. It simulates how SQL Server reads/writes to the disk. The fact it happens when SQLIOSim runs indicates, it is not a problem with SQL Server itself, but with something else. That is why MS washed their hands and said talk to someone else.
Mike Liening 6 Reputation points

2022-05-31T20:49:24.043+00:00

Yeah, I get it. I was probably unrealistically hoping that they've seen similar issues in environments like ours.

Answer 1

Tom Phillips 17,771

Did you see this?

https://kb.vmware.com/s/article/88201?lang=en_US

Mike Liening 6 Reputation points

2022-05-31T20:51:40.477+00:00

No, not yet - seems like a recent KB article. Thanks for pointing that out. I'll forward it to our server team for their review. Although a fix isn't available, the workaround seems easy to implement. I'll report back our findings (may take a week or two)..
Thanks again.
Mike Liening 6 Reputation points

2022-06-16T15:58:28.697+00:00

It's been almost 2 weeks since our server team implemented the vmware workaround, and we've had no 823 or 824 errors. Thanks for passing along that kb article!

Share via

Many SQL 823, 824 and 3041 errors since VM migrations

1 answer

Your answer