Did you see this?
Many SQL 823, 824 and 3041 errors since VM migrations
Our server team migrated our SQL Server VMs (VMware) from:
HPE hosts using fibre channel through brocade fabric to our Pure FlashArray
to:
Cisco hosts connected directly to the same Pure FlashArray, but now using iSCSI.
Ever since the VM migrations last fall, we've had 400+ 823 and 824 errors (checksum errors) across 41 VM's, and approx 250 log corruption errors dectected during backups (error 3041). We had none of these errors prior to this VM migration, over the last several years.
The 823 and 824 errors are nearly always associated with tempdb. One interesting finding is that of the 95 823 errors, all but one report the same checksum value for the expected checksum and the actual checksum, i.e.
Error: 823, Severity: 24, State: 7. - SQL Server encountered: 'incorrect checksum (expected: 0xec15dd20; actual: 0xec15dd20)' resulting from an attempt to read the following: sort run page (4:33709), in file 'F:\DB_MP1\MSSQL12.MSSQLSERVER\MSSQL\DATA\tempdb3.ndf', in database with ID 2. Sort is retrying the read
Nearly all of the 330 824 errors have different expected and actual checksum values:
Error: 824, Severity: 24, State: 7. - SQL Server encountered: 'incorrect checksum (expected: 0x72b58865; actual: 0x4149dc7b)' resulting from an attempt to read the following: sort run page (1:33008), in file 'E:\SQLData\MP3_TempDB\Data\tempdb.mdf', in database with ID 2. Sort is retrying the read.
The 3041 errors, of which we've had 239 of since the migration all look similar to this (various databases, various VMs):
Error: 3041, Severity: 16, State: 1. - Backup detected log corruption in database cabinet. Context is Bad Middle Sector. LogFile: 2 'E:\CAB_Logs\Logs\cabinet_1.LDF' VLF SeqNo: x1108ab VLFBase: x5b8d9e000 LogBlockOffset: x60761d000 SectorStatus: 2 LogBlock.StartLsn.SeqNo: x1108ab LogBlock.StartLsn.Blk: x2743ec Size: xf000 PrevSize: xf000
We've engaged support from Microsoft, Cisco, VMWare, Pure and Clumio. None can find any issues. Microsoft basically stopped offering support once we were once able to trigger an 823/824 using SQLIOSim. I get it, the errors are related to something in the I/O path, not within SQL Server, but we were hoping for more from them. They shrugged off the errors with the same checksum values as being 'transient' errors.
Another troubleshooting step we took was to change the schedule of VM backups (VM level backups, not SQL database backups) performed by our backup tool, Clumio, on some of the servers experiencing the SQL errors. The timing of subsequent errors followed the change of the backup window.
From a DBA perspective, it sounds like there is a bottleneck somewhere in the I/O path that appears during VM backup windows. Neither our server team, nor any of our vendors can see any misconfigurations, extended high latency or other indications of any issue. We see very short higher than normal latency events during the vm backups - up to around 40ms sor very short periods of time. We've had far, far higher latency back when we were using a spinning disk SAN and even back then, had no 823 or 824 errors.
So here we are, experiencing several 823 and 824 errors daily and everyone is doing the "not me" thing. Fortunately, except for a few pages showing up in various server's suspect pages table, the errors have been associated with tempdb and we've had no databases actually go into suspect mode.
Any feedback or suggestions would be much appreciated!