संपादित करें

SQL Server I/O fundamentals

Applies to: SQL Server Azure SQL Managed Instance SQL Server on Azure Virtual Machines

The primary purpose of a SQL Server database is to store and retrieve data, so intensive disk input/output (I/O) is a core characteristic of the Database Engine. Because disk I/O operations can consume many resources and take a relatively long time to finish, SQL Server focuses on making I/O highly efficient.

Storage subsystems for SQL Server are provided in multiple form factors, including mechanical drives and solid-state storage. This article provides details on how to use drive caching principles to improve Database Engine I/O.

SQL Server requires that systems support guaranteed delivery to stable media as outlined under the SQL Server I/O Reliability Program Requirements. For more information about the input and output requirements for the SQL Server Database Engine, see SQL Server Database Engine Disk Input/Output (I/O) requirements.

Disk I/O

The buffer manager only performs reads and writes to the database. Other file and database operations such as open, close, extend, and shrink are performed by the database manager and file manager components.

Disk I/O operations by the buffer manager have the following characteristics:

  • I/O is typically performed asynchronously, which allows the calling thread to continue processing while the I/O operation takes place in the background. Under some circumstances (for example, misaligned log I/O), synchronous I/Os can occur.

  • All I/Os are issued in the calling threads unless the affinity I/O option is in use. The affinity I/O mask option binds SQL Server disk I/O to a specified subset of CPUs. In high-end SQL Server online transactional processing (OLTP) environments, this extension can enhance the performance of SQL Server threads issuing I/Os.

  • Multiple page I/Os are accomplished with scatter-gather I/O, which allows data to be transferred into or out of noncontiguous areas of memory. This means that SQL Server can quickly fill or flush the buffer cache while avoiding multiple physical I/O requests.

Long I/O requests

The buffer manager reports on any I/O request that is outstanding for at least 15 seconds. This process helps the system administrator distinguish between SQL Server problems and I/O subsystem problems. Error message MSSQLSERVER_833 is reported and appears in the SQL Server error log as follows:

SQL Server has encountered ## occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [##] in database [##] (#). The OS file handle is 0x00000. The offset of the latest long I/O is: 0x00000.

A long I/O can be either a read or a write; the message doesn't currently indicate which. Long-I/O messages are warnings, not errors. They don't indicate problems with SQL Server but with the underlying I/O system. The messages help the system administrator find the cause of poor SQL Server response times more quickly, and distinguish problems that are outside the control of SQL Server. As such, they don't require any action, but the system administrator should investigate why the I/O request took so long, and whether the time is justifiable.

Causes of long I/O requests

A long I/O message can indicate that an I/O is permanently blocked and will never complete (known as lost I/O), or merely that it isn't complete yet. You can't tell from the message which scenario is the case, although a lost I/O often leads to a latch timeout.

Long I/Os often indicate a SQL Server workload that's too intense for the disk subsystem. An inadequate disk subsystem might be indicated when:

  • Multiple long I/O messages appear in the error log during a heavy SQL Server workload.
  • Performance Monitor counters show long disk latencies, long disk queues, or no disk idle time.

A component in the I/O path (for example, a driver, controller, or firmware) can cause long I/Os by continually postponing servicing an old I/O request, in favor of servicing newer requests. This problem can occur in interconnected environments, such as iSCSI and Fibre Channel networks (either due to a misconfiguration or path failure). The Performance Monitor tool can make this problem difficult to confirm because most I/Os are being serviced promptly. Workloads that perform large amounts of sequential I/O, such as backup and restore, table scans, sorting, creating indexes, bulk loads, and zeroing out files, can aggravate long I/O requests.

Isolated long I/Os that don't appear related to any of the previous conditions can be caused by a hardware or driver problem. The system event log might contain a related event that helps to diagnose the problem.

I/O performance issues caused by inefficient queries or filter drivers

Slow I/O can be caused by queries that aren't written efficiently or tuned properly with indexes and statistics. Another common factor in I/O latency is the presence of antivirus or other security programs that scan database files. This scanning software might extend to the network layer, which adds network latency, in turn indirectly affecting database latency. Although the scenario described about 15-second I/O is more common with hardware components, shorter I/O delays are more frequently observed with unoptimized queries or misconfigured antivirus programs.

For detailed information on how to address these issues, see Troubleshoot slow SQL Server performance caused by I/O issues.

For information on how to configure antivirus protection on SQL Server, see Configure antivirus software to work with SQL Server.

Write caching in storage controllers

I/O transfers that don't use a cache can take much longer on mechanical drives because of hard drive spin rates, the mechanical time needed to move the drive heads, and other limiting factors. SQL Server installations target systems that provide caching controllers. These controllers disable the on-disk caches and provide stable media caches to satisfy SQL Server I/O requirements. They avoid performance issues related to storage seek and write times by using the various optimizations of the caching controller.

Note

Some storage vendors use persistent memory (PMEM) as storage rather than a cache, which can improve overall performance. For more information, see Configure persistent memory (PMEM) for SQL Server on Windows and Configure persistent memory (PMEM) for SQL Server on Linux.

Use of a write caching (also called write-back caching) storage controller can improve SQL Server performance. Write caching controllers and storage subsystems are safe for SQL Server, if they're designed for use in a data-critical transactional database management system (DBMS) environment. These design features must preserve cached data if a system failure occurs. Using an external uninterruptible power supply (UPS) to achieve this protection is generally not sufficient, because failure modes that are unrelated to power can occur.

Important

SQL Server depends on guaranteed delivery to stable media for transactional integrity and recovery. Unsafe caching that doesn't ensure preservation of data across failures can corrupt databases, regardless of the consistency of transaction log writes. Always verify that any write-caching mechanism provides full durability guarantees.

Caching controllers and storage subsystems can be safe for use by SQL Server. Most new purpose-built server platforms that incorporate these controllers are safe. However, you should check with your hardware vendor to be sure that the storage subsystem was tested and approved for use in a data-critical transactional relational database management system (RDBMS) environment.

Cache subsystem safety guidelines

Write-back caching controllers can improve performance if they meet specific safety requirements:

  • Include battery-backed cache or non-volatile memory, such as NVDIMM or super-capacitor-backed flash.
  • Be certified by the vendor for data-critical OLTP database environments.
  • Provide protection that covers all failure conditions, not just power loss.

Important

Don't rely on an external UPS alone. Faults unrelated to power, such as firmware bugs or hardware failure, can still lead to cache loss.

Write-ahead logging

SQL Server data modification statements generate logical page writes. You can picture this stream of writes as going to two places: the log and the database itself. For performance reasons, SQL Server defers writes to the database through its own cache buffer system. The system only momentarily defers writes to the log until COMMIT time. It doesn't cache these writes in the same way as data writes. Because log writes for a given page always come before the page's data writes, the log is sometimes referred to as a write-ahead log (WAL).

Write-ahead logging (WAL) protocol

The term protocol is an excellent way to describe WAL. The WAL used by SQL Server is known as ARIES (Algorithm for Recovery and Isolation Exploiting Semantics). For more information, see Manage accelerated database recovery.

It's a specific and defined set of implementation steps necessary to ensure that data is stored and exchanged properly and can be recovered to a known state in the event of a failure. Just as a network contains a defined protocol to exchange data in a consistent and protected manner, so too does the WAL describe the protocol to protect data. All versions of SQL Server open the log and data files using the Win32 CreateFile function. The dwFlagsAndAttributes member includes the FILE_FLAG_WRITE_THROUGH option when opened by SQL Server.

FILE_FLAG_WRITE_THROUGH

SQL Server creates its database files using the FILE_FLAG_WRITE_THROUGH flag. This option instructs the system to write through any intermediate cache and go directly to storage. The system can still cache write operations, but it can't lazily flush them. For more information, see CreateFileA.

The FILE_FLAG_WRITE_THROUGH option ensures that when a write operation returns successful completion, the data is correctly stored in stable storage. This feature aligns with the Write-Ahead Logging (WAL) protocol specification to ensure data integrity. Many storage devices (NVMe, PCIe, SATA, ATA, SCSI, and IDE-based) contain onboard caches of 512 KB, 1 MB, and larger. Storage caches usually rely on a capacitor and not a battery-backed solution. These caching mechanisms can't guarantee writes across a power cycle or similar failure point. They only guarantee the completion of the sector write operations. As the storage devices continue to grow in size, the caches become larger, and they can expose larger amounts of data during a failure.

For more information on FUA support by Linux distribution and its effect on SQL Server, see SQL Server On Linux: Forced Unit Access (FUA) Internals.

Transactional integrity and SQL Server recovery

Transactional integrity is one of the fundamental concepts of a relational database system. Transactions are atomic units of work that are either totally applied or totally rolled back. The SQL Server write-ahead transaction log is a vital component in implementing transactional integrity.

Any relational database system must also deal with a concept closely related to transactional integrity, which is recovery from unplanned system failure. Several non-ideal, real-world effects can cause this failure. On many database management systems, system failure can result in a lengthy human-directed manual recovery process.

In contrast, the SQL Server recovery mechanism is automatic and operates without human intervention. For example, SQL Server could be supporting a mission-critical production application, and experience a system failure due to a momentary power fluctuation. Upon restoration of power, the server hardware restarts, networking software loads and initializes, and SQL Server restarts. As SQL Server initializes, it automatically runs its recovery process based on data in the transaction log. This entire process occurs without human intervention. When client workstations restart, users find all of their data present, up to the last transaction they entered.

Transactional integrity and automatic recovery in SQL Server constitute a powerful time-and-labor saving capability. If a write caching controller isn't properly designed for use in a data-critical transactional DBMS environment, it can compromise the ability of SQL Server to recover, potentially corrupting the database. This problem can occur if the controller intercepts SQL Server transaction log writes and buffers them in a hardware cache on the controller board, but doesn't preserve these written pages during a system failure.

Warning

If cached writes are discarded due to a system reset, database corruption can occur even if a UPS is present. Always ensure write caches are backed by battery or equivalent technology to guarantee data persistence.

On-disk write caching risks

Most storage device caching controllers perform write caching. You can't always disable the write caching function.

Even if the server uses a UPS, the device doesn't guarantee the security of the cached writes. Many types of system failures can occur that a UPS doesn't address. For example, a memory parity error, an operating system (OS) trap, or a hardware glitch that causes a system reset can produce an uncontrolled system interruption. A memory failure in the hardware write cache can also result in the loss of vital log information.

Another possible problem related to a write-caching controller can occur at system shutdown. It's not uncommon to cycle the OS or restart the system during configuration changes. Even if a careful operator follows the OS recommendation to wait until all storage activity ceases before restarting the system, cached writes can still be present in the controller. When the Ctrl+Alt+Del key combination is pressed, or a hardware reset button is pressed, cached writes can be discarded, potentially damaging the database.

It's possible to design a hardware write cache that takes into account all possible causes of discarding dirty cache data, which makes it safe for use by a database server. Some of these design features include intercepting the RST (reset) bus signal to avoid uncontrolled reset of the caching controller, on-board battery backup, and mirrored or error checking and correcting (ECC) memory. Check with your hardware vendor to ensure that the write cache includes these and any other features necessary to avoid data loss.

Use storage caches with SQL Server

A database system is first and foremost responsible for the accurate storage and retrieval of data, even in the event of unexpected system failures.

The system must guarantee the atomicity and durability of transactions, while accounting for current execution, multiple transactions, and various failure points. This property is often referred to as the ACID (Atomicity, Consistency, Isolation, and Durability) properties.

This section addresses the implications of storage caches. For further information on caching and alternate failure mode discussions, see the following articles:

Also review the following archived content:

The concepts in these two articles remain broadly applicable to current versions of SQL Server.

Battery-backed caching solutions

Enhanced caching controller systems disable on-disk cache and provide a functional battery-backed caching solution. These caches can maintain the data in the cache for several days and even allow the caching card to be placed in a second computer. When power is properly restored, the unwritten data is completely flushed before any further data access is allowed. Many of these systems allow you to establish a percentage of read versus write cache for optimal performance. Some systems contain large memory storage areas. Some hardware vendors provide high-end battery-backed drive caching systems with multiple gigabytes of cache. These systems can significantly improve database performance. Battery-backed caching solutions provide the durability and consistency of data that SQL Server expects.

Storage subsystem implementations

Storage subsystems exist in many types. Two common examples are RAID (redundant array of independent disks) and SAN (storage area network). These systems typically use SCSI-based drives. The following section describes high-level storage considerations.

SCSI, SAS, and NVMe

SCSI, SAS, and NVMe storage devices:

  • Are typically designed for heavy-duty use.
  • Are typically targeted at multiuser, server-based implementations.
  • Typically have better mean time to failure rates than other implementations.
  • Contain sophisticated heuristics to help predict imminent failures.

Note

SQL Server supports Internet Small Computer System Interface (iSCSI) technology components that meet the requirements of the Windows Hardware Compatibility Program. Although SQL Server doesn't interact directly with iSCSI, it operates seamlessly because Windows presents iSCSI storage as standard drives. SQL Server can then read from, and write to, remote block-level storage across IP networks. Since iSCSI depends on networks, you can experience delays or bottlenecks. Ensure the server's caching performance is optimal, and latency is minimized. For more information, see iSCSI Target Server Scalability Limits.

Non-SCSI

Other drive implementations, such as IDE, ATA, and SATA:

  • Are typically designed for light to medium-duty use.
  • Are typically targeted at single-user applications.

Non-SCSI, desktop-based controllers require more main processor (CPU) bandwidth and are frequently limited by a single active command. For example, when a non-SCSI drive is adjusting a bad block, the drive requires that the host commands wait. The ATA bus presents another example: the ATA bus supports two devices, but only a single command can be active. This limitation leaves one drive idle while the other drive services the pending command. RAID systems built on desktop technologies can all experience these symptoms and be greatly affected by the slowest responder. Unless these systems use advanced designs, their performance isn't as efficient as the performance of SCSI-based systems.

Storage considerations

A desktop-based drive or array can be a low-cost solution for some situations. For example, if you set up a read-only database for reporting, you don't encounter many of the performance factors of an OLTP database when drive caching is disabled.

Storage device sizes continue to increase. Low-cost, high-capacity drives can be appealing. But when you configure the drive for SQL Server and your business response time needs, carefully consider the following issues:

  • Access path design
  • The requirement to disable the on-disk cache

Mechanical hard drives

Mechanical drives use spinning magnetic platters for storing data. They're available in several capacities and form factors, such as IDE, SATA, SCSI, and Serial Attached SCSI (SAS). Some SATA drives include failure prediction constructs. SCSI drives are designed for heavier duty cycles and decreased failure rates.

IDE and ATA-based systems can postpone host commands when they perform activities such as bad block adjustment, leading to periods of stalled I/O activity.

SAS benefits include advanced queuing up to 256 levels, and head-of-queue and out-of-order queuing. The SAS backplane is designed in a way that enables the use of both SAS and SATA drives within the same system.

Your SQL Server installation depends on the controller's ability to disable the on-disk cache and to provide a stable I/O cache. Writing data out of order to various drives isn't a hindrance to SQL Server as long as the controller provides the correct stable media caching capabilities. The complexity of the controller design increases with advanced data security techniques such as mirroring.

Solid-state storage

Solid-state storage has advantages over mechanical (spinning) hard drives, but it's susceptible to many of the same failure patterns as spinning media. You can connect solid-state storage to your server using various interfaces, including NVM Express (NVMe), PCI Express (PCIe), and SATA. Treat the solid-state media as you would spinning media, and make sure that the appropriate safeguards are in place for power failure, such as a battery-backed caching controller.

Common problems caused by a power fault include:

  • Bit corruption: Records show random bit errors.
  • Flying writes: Well-formed records end up in the wrong place.
  • Shorn writes: Operations are partially done at a level below the expected sector size.
  • Metadata corruption: Metadata in the flash translation layer (FTL) is corrupted.
  • Unresponsive device: Device doesn't work at all, or mostly doesn't work.
  • Unserializability: Final state of storage doesn't result from a serializable operation order.

512e

Most solid-state storage devices report 512-byte sector sizes but use 4-KB pages inside the 1-MB erasure blocks. Using 512-byte aligned sectors for the SQL Server log device can generate more read/modify/write (RMW) activities, which can contribute to slower performance and drive wear.

Recommendation: Make sure the caching controller is aware of the correct page size of the storage device, and can align physical writes with the solid-state storage infrastructure properly.

0xFFFFFFFF

A newly formatted drive usually holds all zeros. An erased block of a solid-state device is all 1s, making a raw read of an erased block all 0xFF characters. However, it's unusual for a user to read an erased block during normal I/O operations.

Pattern stamping

A technique used in the past is to write a known pattern to the entire drive. Then as database activity executes against that same drive, incorrect behavior (stale read, lost write, or read of incorrect offset) can be detected when the pattern unexpectedly appears.

This technique doesn't work well on solid-state storage. The erasure and RMW activities for writes destroy the pattern. The solid-state storage garbage collection (GC) activity, wear leveling, proportional/set-aside list blocks, and other optimizations tend to cause writes to acquire different physical locations, unlike spinning media's sector reuse.

Firmware

The firmware used in solid-state storage tends to be complex when compared to spinning media counterparts. Many drives use multiple processing cores to handle incoming requests and garbage collection activities. Make sure you keep your solid-state device firmware up to date to avoid known problems.

Read data damage and wear leveling

A common garbage collection (GC) approach for solid-state storage helps prevent repeated, read data damage. When reading the same cell repeatedly, it's possible the electron activity can leak and cause damage to neighboring cells. Solid-state storage protects the data with various levels of error correction code (ECC) and other mechanisms.

One such mechanism relates to wear leveling. Solid-state storage keeps track of the read and write activity on the storage device. The garbage collection can determine hot spots or locations wearing faster than other locations. For example, the GC determines that a block is in a read-only state and needs to move. This movement is generally to a block with more wear, so the original block can be used for writes. This process helps balance the wear on the drive, but it moves read-only data to a location that has more wear and mathematically increases the failure chances, even if slightly.

Another side effect of wear leveling can occur with SQL Server. Suppose you execute DBCC CHECKDB, and it reports an error. If you run it a second time, there's a small chance that DBCC CHECKDB reports additional or a different pattern of errors, because the solid-state storage GC activity might make changes between the DBCC CHECKDB executions.

OS error 665 and defragmentation

Spinning media needs to keep blocks near one another to reduce the drive's head movement and increase performance. Solid-state storage doesn't have a physical head, which eliminates seek time. Many solid-state devices are designed to allow parallel operations on different blocks. Defragmentation of solid-state media is therefore unnecessary. Serial activities are the best I/O patterns to maximize I/O throughput on solid-state storage devices.

Note

Solid-state storage benefits from the trim feature, an operating system (OS) level command that erases blocks that are considered no longer in use. In Windows, use the Optimize Drives tool to set a weekly schedule for optimizing drives.

Recommendations:

  • Use an appropriate, battery-backed controller designed to optimize write activities. This choice can improve performance, reduce drive wear, and lower physical fragmentation levels.

  • Consider using the ReFS file system to avoid the NTFS attribute limitations.

  • Make sure the file growth settings are appropriate.

For more information on troubleshooting OS error 665 as it relates to fragmentation, see OS errors 665 and 1450 are reported for SQL Server files.

Compression

As long as the drive maintains the intent of stable media, compression can extend the drive life and might positively affect performance. However, some firmware might already compress data invisibly. Remember to test new storage scenarios before deploying them to your production environment.

Summary

  • Maintain proper backup and disaster recovery procedures and processes.
  • Keep your firmware up to date.
  • Listen closely to your hardware manufacturer's guidance.

Drive cache configuration

To use a drive with SQL Server, disable drive caching. By default, the drive cache is enabled. In Windows Server, use the Disk Properties > Hardware > Policy tab to disable write caching at the OS level.

Don't rely on OS-level settings alone. Some drives ignore Windows settings and require manufacturer-provided utilities or firmware settings to disable write caching. Confirm through vendor tools that write caching is actually disabled.

Note

Even with write caching disabled, drive firmware might introduce internal optimizations that delay flush commands. Always confirm the effective cache state before deployment by using testing tools such as SQLIOSim.

Cache considerations and SQLIOSim

To confirm transactional durability guarantees, validate your I/O subsystem by using SQLIOSim before moving to production. This utility simulates heavy asynchronous read and write activity to a simulated data device and log device. For more information, see Use the SQLIOSim utility to simulate SQL Server activity on a disk subsystem.

Note

Ensure that any alternate caching mechanism can properly handle multiple types of failure.

Many PC manufacturers order the drives with the write cache disabled. However, testing shows that this condition might not always be the case, so you should always test it completely. If you have any questions about the caching status of your storage device, contact the manufacturer and obtain the appropriate utility to disable write caching operations. On older storage media, you might also need jumper settings.

SQL Server requires systems to support guaranteed delivery to stable media, as outlined under the SQL Server I/O Reliability Program Requirements. For more information about the input and output requirements for the SQL Server Database Engine, see SQL Server Database Engine Disk Input/Output (I/O) requirements.

Improper write caching risks

When you enable write caching without proper safeguards, some storage subsystems acknowledge write operations as complete before data is safely written to durable media. If a power loss or system failure occurs, this condition can result in:

  • Data loss, where committed transactions are never persisted.
  • Database corruption due to broken write-order guarantees.

Important

Disable write caching for SQL Server data and log drives unless you confirm through hardware vendor documentation that:

  • The cache is battery-backed or uses persistent flash storage.
  • The drive guarantees durability across power outages and system crashes.

External UPS devices aren't sufficient because they might not protect against all failure modes, such as a controller firmware fault.