Understanding MPIO Features and Components
Applies To: Windows Server 2008 R2, Windows Server 2012
About MPIO
Microsoft Multipath I/O (MPIO) is a Microsoft-provided framework that allows storage providers to develop multipath solutions that contain the hardware-specific information needed to optimize connectivity with their storage arrays. These modules are called device-specific modules (DSMs). The concepts around DSMs are discussed later in this document.
MPIO is protocol-independent and can be used with Fibre Channel, Internet SCSI (iSCSI), and Serial Attached SCSI (SAS) interfaces in Windows Server® 2008, Windows Server 2008 R2 and Windows Server 2012.
Multipath solutions in Windows Server
When running on Windows Server 2008 R2, an MPIO solution can be deployed in the following ways:
By using a DSM provided by a storage array manufacturer for Windows Server in a Fibre Channel, iSCSI, or SAS shared storage configuration.
By using the Microsoft DSM, which is a generic DSM provided for Windows Server in a Fibre Channel, iSCSI, or SAS shared storage configuration.
Note
To work with the Microsoft DSM, storage must be SCSI Primary Commands-3 (SPC-3) compliant.
High availability solutions
Keeping mission-critical data continuously available has become a requirement over a wide range of customer segments from small business to datacenter environments. Enterprise environments that use Windows Server require no downtime for key workloads, including file server, database, messaging, and other line of business applications. This level of availability can be difficult and very costly to achieve, and it requires that redundancy be built in at multiple levels: storage redundancy, backups to separate recovery servers, server clustering, and redundancy of the physical path components.
Application availability through Failover Clustering
Clustering is the use of multiple servers, host bus adapters (HBAs), and storage devices that work together to provide users with high application availability. If a server experiences a hardware failure or is temporarily unavailable, end users are still able to transparently access data or applications on a redundant cluster node. In addition to providing redundancy at the server level, clustering can also be used as a tool to minimize the downtime required for patch management and hardware maintenance. Clustering solutions require software that enables transparent failover between systems. Failover Clustering [formerly known as Microsoft Cluster Server (MSCS)] is one such solution that is included with the Windows Server 2008 R2 and Windows Server 2012 operating systems.
High availability through MPIO
MPIO allows Windows® to manage and efficiently use up to 32 paths between storage devices and the Windows host operating system. Although both MPIO and Failover Clustering result in high availability and improved performance, they are not equivalent concepts. While Failover Clustering provides high application availability and tolerance of server failure, MPIO provides fault tolerant connectivity to storage. By employing MPIO and Failover Clustering together as complimentary technologies, users are able to mitigate the risk of a system outage at both the hardware and application levels.
MPIO provides the logical facility for routing I/O over redundant hardware paths connecting server to storage. These redundant hardware paths are made up of components such as cabling, host bus adapters (HBAs), switches, storage controllers, and possibly even power. MPIO solutions logically manage these redundant connections so that I/O requests can be rerouted if a component along one path fails.
As more and more data is consolidated on storage area networks (SANs), the potential loss of access to storage resources is unacceptable. To mitigate this risk, high availability solutions, such as MPIO, have now become a requirement.
Considerations for using MPIO
Consider the following when using MPIO:
When using the Microsoft DSM, storage that implements an Active/Active storage scheme but does not support ALUA will default to use the Round Robin load-balancing policy setting, although a different policy setting may be chosen later. Additionally, you can pre-configure MPIO so that when it detects a certain hardware ID, it defaults to a specific load-balancing policy setting.
For more information about load-balancing policy settings, see Referencing MPCLAIM Examples.
Windows multipathing solutions are required if you want to utilize the MPIO framework to be eligible to receive logo qualification for Windows Server. For additional information about Windows logo requirements, see Windows Quality Online Services (Winqual) (https://go.microsoft.com/fwlink/?LinkId=71551).
This joint solution allows storage partners to design hardware solutions that are integrated with the Windows operating system. Compatibility with both the operating system and other partner provided storage devices is ensured through the Windows Logo program tests to help ensure proper storage device functionality. This ensures a highly available multipath solution by using MPIO, which offers supportability across Windows operating system implementations.
To determine which DSM to use with your storage, refer to information from your hardware storage array manufacturer. Multipath solutions are supported as long as a DSM is implemented in line with logo requirements for MPIO. Most multipath solutions for Windows today use the MPIO architecture and a DSM provided by the storage array manufacturer. You can use the Microsoft DSM provided by Microsoft in Windows Server if it is also supported by the storage array manufacturer. Refer to your storage array manufacturer for information about which DSM to use with a given storage array, as well as the optimal configuration of it.
Multipath software suites available from storage array manufacturers may provide an additional value-add beyond the implementation of the Microsoft DSM because the software typically provides auto-configuration, heuristics for specific storage arrays, statistical analysis, and integrated management. We recommend using the DSM provided by the hardware storage array manufacturer to achieve optimal performance because the storage array manufacturer can make more advanced path decisions in their DSM that are specific to their array, which may result in quicker path failover times.
Making MPIO-based solutions work
The Windows operating system relies on the Plug and Play (PnP) Manager to dynamically detect and configure hardware (such as adapters or disks), including hardware used for high availability/high performance multipath solutions.
Note
You might be prompted to restart the computer after the MPIO feature is first installed.
Device discovery and enumeration
An MPIO/Multipath driver cannot work effectively or efficiently until it discovers, enumerates, and configures different devices that the operating system sees through redundant adapters into a logical group. We will briefly outline in this section how MPIO works with DSM in discovering and configuring the devices.
Without any multipath driver, the same devices through different physical paths would appear as totally different devices, thereby leaving room for data corruption. Figure 1 depicts this scenario.
Figure 1 Multipathing software and storage unit distinction
Following is the sequence of steps that the device driver stack walks through in discovering, enumerating, and grouping the physical devices and device paths into a logical set. (This assumes a scenario where a new device is presented to the server.)
A new device arrives.
The PnP manager detects the device’s arrival.
The MPIO driver stack is notified of the device’s arrival (it takes further action if it is a supported MPIO device).
The MPIO driver stack creates a pseudo device for the physical device.
The MPIO driver walks through all the available DSMs to determine which vendor-specific DSM can claim the device. After a DSM claims a device, it is associated only with the DSM that claimed it.
The MPIO driver, along with the DSM, verifies that the path to the device is connected, active, and ready for I/O.
If a new path for this same device arrives, MPIO then works with the DSM to determine whether this device is the same as any other claimed device. It then groups this physical path for the same device into a logical set for the multipath group that is called a pseudo-Logical Unit Number (pseudo-LUN).
Unique storage device identifier
For dynamic discovery to work correctly, some form of identifier must be identified and obtainable regardless of the path from the host to the storage device. Each logical unit must have a unique hardware identifier. The MPIO driver package does not use disk signatures placed in the data area of a disk for identification purposes by software. Instead, the Microsoft-provided generic DSM generates a unique identifier from the data that is provided by the storage hardware. MPIO also provides for optionally using a unique hardware identifier assigned by the device manufacturer.
Dynamic load balancing
Load balancing, the redistribution of read/write requests for the purpose of maximizing throughput between server and storage device, is especially important in high workload settings or other settings where consistent service levels are critical. Without MPIO software, a server sending I/O requests down several paths may operate with very heavy workloads on some paths while others are underutilized.
The MPIO software supports the ability to balance I/O workload without administrator intervention. MPIO determines which paths to a device are in an active state and can be used for load balancing. Each vendor’s load-balancing policy setting (which may use any of several algorithms, such as Round Robin, the path with the fewest outstanding commands, or a vendor unique algorithm) is set in the DSM. This policy setting determines how the I/O requests are actually routed.
Note
In addition to the support for load balancing provided by MPIO, the hardware used must support the ability to use multiple paths at the same time, rather than just fault tolerance.
Error handling, failover, and recovery
The MPIO driver, in combination with the DSM, supports end-to-end path failover. The process of detecting failed paths and recovering from the failure is automatic, usually fast, and completely transparent to the IT organization. The data ideally remains available at all times.
Not all errors result in failover to a new path. Some errors are temporary and can be recovered by using a recovery routine in the DSM; if recovery is successful, MPIO is notified and path validity is checked to verify that it can be used again to transmit I/O requests.
When a fatal error occurs, the path is invalidated and a new path is selected. The I/O is resubmitted on this new path without requiring the application layer to resubmit the data.
Differences in load-balancing technologies
There are two primary types of load-balancing technologies referred to within Windows. This document discusses only MPIO Load Balancing.
MPIO Load Balancing is a type of load balancing supported by MPIO that uses multiple data paths between server and storage to provide greater throughput of data than could be achieved with only one connection.
Network Load Balancing (NLB) is a failover cluster technology (formerly known as MSCS) that provides load balancing of network interfaces to provide greater throughput across a network to the server, and is most typically used with Internet Information Services (IIS).
Differences in failover technologies
When addressing data path failover, such as the failover of host bus adapter (HBA) or iSCSI connections to storage, the following main types of failover are available:
MPIO-based fault tolerant failover In this scenario, multiple data paths to the storage are configured, and in the event that one path fails, HBA or the network adapter is able to fail over to the other path and resend any outstanding I/O.
For a server that has one or more HBAs or network adapters, MPIO provides the following:
Support for redundant switch fabrics or connections from the switch to the storage array
Protection against the failure of one of the adapters within the server directly
MPIO-based load balancing In this scenario, multiple paths to storage are also defined; however, the DSM is able to balance the data load to maximize throughput. This configuration can also employ Fault Tolerant behavior so that if one path fails, all data would follow an alternate path.
In some hardware configurations you may have the ability to perform dynamic firmware updates on the storage controller, such that a complete outage is not required for firmware updates. This capability is hardware dependent and requires (at a minimum) that more than one storage controller be present on the storage so that data paths can be moved off of a storage controller for upgrades.
Failover Clustering This type of configuration offers resource failover at the application level from one cluster server node to another. This type of failover is more invasive than storage path failover because it requires client applications to reconnect after failover, and then resend data from the application layer. This method can be combined with MPIO-based fault tolerant failover and MPIO-based load balancing to further mitigate the risk of exposure to different types of hardware failures.
Different behaviors are available depending on the type of failover technology used, and whether it is combined with a different type of failover or redundancy. Consider the following scenarios:
Scenario 1: Using MPIO without Failover Clustering
This scenario provides for either a fault tolerant connection to data, or a load-balanced connection to storage. Since this layer of fault tolerant operation protects only the connectivity between the server and storage, it does not provide protection against server failure.
Scenario 2: Combining the use of MPIO in fault tolerant mode with Failover Clustering
This configuration provides the following advantages:
If a path to the storage fails, MPIO can use an alternate path without requiring client application reconnection.
If an individual server experiences a critical event such as hardware failure, the application managed by Failover Clustering is failed over to another cluster node. While this scenario requires client reconnection, the time to restore the service may be much shorter than that required for replacing the failed hardware.
Scenario 3: Combining the use of MPIO in load-balancing mode with Failover Clustering
This scenario provides the same benefits as listed in Scenario 2, plus the following benefit:
- During normal operation, multiple data paths may be employed to provide greater aggregate throughput than one path can provide.
About the Windows storage stack and drivers
For the operating system to correctly perform operations that relate to hardware, such as addition or removal of devices or transferring I/O requests from an application to a storage device, the correct device drivers must be associated with the device. All device-related functionality is initiated by the operating system, but under direct control of subroutines contained within each driver. These processes are considerably complicated when there are multiple paths to a device. The MPIO software prevents data corruption by ensuring correct handling of the driver associated with a single device that is visible to the operating system through multiple paths. Data corruption is likely to occur because when an operating system believes two separate paths lead to two separate storage volumes, it does not enforce any serialization or prevent any cache conflicts. Consider what would happen if a new NTFS file system tries to initialize its journal log twice on a single volume.
Storage stack and device drivers
Storage architecture in Windows consists of a series of layered drivers, as shown in Figure 2. (Note that the application and the disk subsystem are not part of the storage layers.) When a device such as a storage disk is first added in, each layer of the hierarchy is responsible for making the disk functional (such as by adding partitions, volumes, and the file system). The stack layers below the broken line are collectively known as the device stack and deal directly with managing storage devices.
Figure 2 Layered drivers in Windows storage architecture
Device drivers
Device drivers manage specific hardware devices, such as a disks or tapes, on behalf of the operating system.
Port drivers
Port drivers manage different types of transport, depending on the type of adapter (for example, USB, iSCSI, or Fibre Channel) in use. Historically, one of the most common port drivers in the Windows system was the SCSIport driver. In conjunction with the class driver, the port driver handles Plug and Play (PnP) and power functionality. Port drivers manage the connection between the device and the bus. Windows Server 2003 introduced a new port driver, Storport, which is better suited to high-performance, high-reliability environments, and is typically more commonly used today than SCSIport.
Miniport drivers
Each storage adapter has an associated device driver, known as a miniport. This driver implements only those routines necessary to interface with the storage adapter’s hardware. A miniport partners with a port driver to implement a complete layer in the storage stack, as shown in Figure 2.
Class drivers
Class drivers manage a specific device type. They are responsible for presenting a unified disk interface to the layers above (for example, to control read/write behavior for a disk). The class driver manages the functionality of the device. Class drivers (like port and miniport drivers) are not a part of the MPIO driver package per se; however, the PnP disk class driver, disk.sys, is used as part of the multipathing solution because the class driver controls the disk add/removal process, and I/O requests pass through this driver to the MPIO bus driver. For more information, see the MPIO drivers sections that follow.
The MPIO driver is implemented in the kernel mode of the operating system. It works in combination with the PnP Manager, the disk class driver, the port driver, the miniport driver, and a device-specific module (DSM) to provide full multipath functionality.
Multipath bus drivers (mpio.sys)
Bus drivers are responsible for managing the connection between the device and the host computer. The multipath bus driver provides a “software bus (also technically termed a “root bus”)”—the conceptual analog to an actual bus slot into which a device plugs. It acts as the parent bus for the multipath children (disk PDOs). As a root bus, mpio.sys can create new device objects that are not created by new hardware being added into the configuration. The MPIO bus driver also communicates with the rest of the operating system, and manages the PnP connection and power control between the hardware devices and the host computer, and uses WMI classes to allow storage array manufacturers to monitor and manage their storage and associated DSMs. For more information about WMI, see MPIO WMI Classes (https://go.microsoft.com/fwlink/?LinkId=163826).
DSM management
Management and monitoring of the DSM can be done through the Windows Management Instrumentation (WMI) interface or by using the mpclaim.exe tool, the latter of which includes the required WMI code within the MPIO drivers.
MPIO DSM
As explained previously in this document, a storage array manufacturer’s device-specific module (DSM) incorporates knowledge of the manufacturer’s hardware. A DSM interacts with the MPIO driver. The DSM plays a crucial role in device initialization and I/O request handling, including I/O request error handling. These DSM actions are described further in the following sections.
Device initialization
MPIO allows for devices from different storage vendors to coexist, and be connected to the same Windows Server-based system. This means that a single server running Windows Server can have multiple DSMs installed on it. When a new eligible device is detected via PnP, MPIO attempts to determine which DSM is appropriate to handle the device. MPIO contacts each DSM one at a time. The first DSM to claim ownership of the device is associated with that device and the remaining DSMs are not allowed a chance to press claims for that already claimed device. There is no particular order in which the DSMs are contacted, although the Microsoft DSM is always contacted last. If the DSM does support the device, it then indicates whether the device is a new installation, or is the same device previously installed but is now visible through a new path.
MPIO device discovery
Figure 3 illustrates how devices and path discovery work with MPIO.
Figure 3 Devices and path discovery with MPIO
Request handling
When an application makes an I/O request to a specific device, the DSM that claimed the device makes a determination, based on its internal load-balancing algorithms, as to which path the request should be sent.
Error handling
If the I/O request fails, the DSM is responsible for analyzing the failure to determine whether to retry the I/O, to cause a failover to a new path, or to return the error to the requesting application. In case of a failover, the DSM determines what new path should be used. The actual rebuild of the I/O and resubmission of the I/O is done by MPIO, and is not the responsibility of the DSM. The details of the DSM/MPIO interaction to make all of this happen are beyond the scope of this document, and are provided in the MPIO Driver Development Kit (DDK) available from Microsoft.
Details about the Microsoft DSM in Windows Server
The Microsoft device-specific module (DSM) provided in Windows Server includes support for the following policy settings:
Failover Only Policy setting that does not perform load balancing. This policy setting uses a single active path, and the rest of the paths are standby paths. The active path is used for sending all I/O. If the active path fails, one of the standby paths is used. When the path that failed is reactivated or reconnected, the standby path can optionally return to standby if failback is turned on. For more information about how to configure MPIO path automatic failback, see the section later in this document titled, “Configure the MPIO Failback Policy.”
Round Robin Load-balancing policy setting that allows the DSM to use all available paths for MPIO in a balanced way. This is the default policy that is chosen when the storage controller follows the active-active model and the management application does not specifically choose a load-balancing policy setting.
Round Robin with Subset Load-balancing policy setting that allows the application to specify a set of paths to be used in a round robin fashion, and with a set of standby paths. The DSM uses paths from active paths for processing requests as long as at least one of the paths is available. The DSM uses a standby path only when all of the active paths fail. For example, given 4 paths: A, B, C, and D, paths A, B, and C are listed as active paths and D is the standby path. The DSM chooses a path from A, B, and C in round robin fashion as long as at least one of them is available. If all three paths fail, the DSM uses D, the standby path. If paths A, B, or C become available, the DSM stops using path D and switches to the available paths among A, B, and C.
Least Queue Depth Load-balancing policy setting that sends I/O down the path with the fewest currently outstanding I/O requests. For example, consider that there is one I/O that is sent to LUN 1 on Path 1, and the other I/O is sent to LUN 2 on Path 1. The cumulative outstanding I/O on Path 1 is 2, and on Path 2, it is 0. Therefore, the next I/O for either LUN will process on Path 2.
Weighted Paths Load-balancing policy setting that assigns a weight to each path. The weight indicates the relative priority of a given path. The larger the number, the lower ranked the priority. The DSM chooses the least-weighted path from among the available paths.
Least Blocks Load-balancing policy setting that sends I/O down the path with the least number of data blocks currently being processed. For example, consider that there are two I/Os: one is 10 bytes and the other is 20 bytes. Both are in process on Path 1, and there are no outstanding I/Os on Path 2. The cumulative outstanding amount of I/O on Path 1 is 30 bytes. On Path 2, it is 0. Therefore, the next I/O will process on Path 2.