Operating Systems and PAE Support
Addressing physical memory above 4 GB requires more than the 32 bits of address offered by the standard operating mode of Intel (32-bit) processors. To this end, Intel introduced the 36-bit physical addressing mode called PAE, starting with the Intel Pentium Pro processor.
This article describes some techniques that Microsoft Windows operating systems and several UNIX operating systems use to provide support to applications using PAE mode addressing. Because processes running in these environments have 32-bit pointers, the operating system must manage and present PAE's 36 bits of address in such a way that the applications can practically use it. The key question is: how does the operating system solve this problem? The performance, functionality, simplicity of programming, and reliability of how these issues are handled will determine the usefulness of the large memory support.
PAE: 32- vs. 64-Bit Systems
PAE is supported only on 32-bit versions of the Windows operating system; 64-bit versions of Windows do not support PAE. For information about device driver and system requirements for 64-bit versions of Windows, see 64-bit System Design. The Address Windowing Extension (AWE) API is supported on 32-bit systems. It is also supported on x64 systems for both native and Wow64 applications.
Although support for PAE memory is typically associated with support for more than 4 GB of RAM, PAE can be enabled on Windows XP SP2, Windows Server 2003, and later 32-bit versions of Windows to support hardware-enforced Data Execution Prevention (DEP).
The information in this article applies to Windows 2000, Windows XP Professional, Windows Server 2003, and later versions of these operating systems, referred to as "Windows" in this paper.
Technical Background
Address Translation in standard 32-bit mode
All IA-32 processors (Intel Pentium, Pentium Pro, Pentium II Xeon, and Pentium III Xeon) support 32 bits of physical address (4 GB), allowing applications to address 4 GB of virtual address when they are running. The system must translate the 32-bit virtual address that the applications and operating system use to the 32-bit physical address used by the hardware. (Pentium Pro was the first processor in the IA-32 family to support PAE, but chipset support is also required for 36-bit physical addresses, which was usually lacking.)
Windows uses two levels of mapping to do the translation, which is facilitated by a set of data structures called page directories and page tables that the memory manager creates and maintains.
PSE Mode
IA-32 supports two methods to access memory above 4 GB (32 bits). PSE (Page Size Extension) was the first method, which shipped with the Pentium II. This method offers a compatibility advantage because it kept the PTE (page table entry) size of 4 bytes. However, the only practical implementation of this is through a driver. This approach suffers from significant performance limitations, due to a buffer copy operation necessary for reading and writing above 4 GB. PSE mode is used in the PSE 36 RAM disk usage model.
PSE uses a standard 1K directory and no page tables to extend the page size 4-MB (eliminating one level of indirection for that mode). The Page Directory Entries (PDE) contains 14 bits of address, and when combined with the 22-bit byte index, yields the 36 bits of extended physical address. Both 4-KB and 4-MB pages are simultaneously supported below 4 GB, with the 4-KB pages supported in the standard way.
Note that pages located above 4 GB must use PSE mode (with 4-MB page sizes).
PAE Mode
PAE is the second method supported to access memory above 4 GB; this method has been widely implemented. PAE maps up to 64 GB of physical memory into a 32-bit (4 GB) virtual address space using either 4-KB or 2-MB pages. The Page directories and the page tables are extended to 8 byte formats, allowing the extension of the base addresses of page tables and page frames to 24 bits (from 20 bits). This is where the extra four bits are introduced to complete the 36-bit physical address.
Windows supports PAE with 4-KB pages. PAE also supports a mode where 2-MB pages are supported. Many of the UNIX operating systems rely on the 2 MB-page mode. The address translation is done without the use of page tables (the PDE supplies the page frame address directly).
Operating System Implementation and Application Support
The next issue is how the operating system can manage and present PAE's 36 bits of address in such a way that an application (with 32-bit pointers) can practically use the additional memory.
There are five application support models. The first two models (Server Consolidation and Large Cache) are completely handled within the operating system and require no changes to the application. The second two models (Application Windowing and Process Fork) require application changes to support API extensions for large memory. The last model (PSE 36 RAM Disk) requires no changes to the operating system (implemented in a driver), but mandates application changes to support the driver.
1. Server Consolidation
A PAE-enabled operating system should be capable of utilizing all physical memory provided by the system to load multiple applications; for example, App#1, App#2, App #N, each consisting of 4 GB (maximum) of virtual address. In a non-PAE enabled system, the result can be a great deal of paging, since maximum physical memory in the system is limited to 4 GB.
With the additional physical memory supported under PAE mode, an operating system can keep more of these applications in memory without paging. This is valuable in supporting server consolidation configurations, where support of multiple applications in a single server is typically required. Note that no application changes are required to support this capability.
2. Large Cache
Using additional PAE-enabled memory for a data cache is also possible. If the operating system supports this feature, applications need not be recoded to take advantage of it. Windows Advanced Server and Datacenter Server support caching on a PAE platform and can utilize all of the available memory.
3. Application Windowing
A PAE-enabled operating system can introduce an API to allow a properly coded application access to physical memory anywhere in the system, even though it may be above 4 GB. Ideally, the API to allocate "high" physical memory and create or move the window should be quick and simple to code. This is highly advantageous for applications that require fast access to large amounts of data in memory.
Sharing high memory between processes can introduce quite a bit of complexity into the API and the implementation. Windows avoids this kind of sharing.
In addition, the support of paging makes the design and implementation of the operating system much more difficult and makes deterministic performance more difficult to achieve. Windows avoids paging of high memory as well.
4. Process Fork and Shared Memory
This application support model splits the current process into two or more nearly identical copies. A copy is made of the user and system stacks, the allocated data space, and the registers. The major difference is that one has the Process ID (PID) of the parent; the other has a new PID. The fork returns a value that is a PID. The PID is zero for the copy that is the child or for the PID of the child for the copy that is the parent.
5. PSE36 RAM Disk
Through use of a kernel device driver, much like a RAM disk, it is possible to utilize memory above 4 GB with no change whatsoever to the operating system. Compatibility between the base operating system (running in 32-bit mode) and the driver (running in PAE mode) is maintained since the page tables are kept at 4 bytes wide. The trade-offs for this very low development impact are several:
Performance degrades due to all I/O being forced to perform double buffering.
Application development impact is not appreciably less than that required for current APIs.
It cannot be used as a "consolidation server" because all applications share the same 4 GB physical memory space.
Design Implementation
The operating system implementations for large memory support must directly address these issues in order to be successful. The simplicity, reliability, and performance of the operating system will be directly impacted, based on the design choices made in handling these issues.
Technical Issues with Large Memory Support in IA32
Memory Sharing and Inter-Process Communications
In all cases where memory remap is being used for allocating memory to processes, which is common to many PAE variants, memory sharing is problematic. The physical memory being remapped is "outside" the process virtual address space. Thus, the physical memory is less connected to the process in the sense of sharing the process's internal access and security controls, as well as those provided by the operating system.
To apply access and security controls, it is necessary to greatly increase the bookkeeping required of the operating system memory manager as well as the API set the application developer must use. This negatively impacts the high performance possible using very fast remap operations. It is also important to remember that IPC/memory sharing may still take place between two processes' virtual address spaces in any case, regardless of the physically mapped memory each may be using.
TLB Shoot-down
Translation Look-aside Buffers (TLBs) are processor registers, or a cache, that provides a direct logical-to-physical mapping of page table entries. Once loaded, the processor has to read the page directories very infrequently (TLB misses) unless a task switch occurs.
During a remap operation, it is necessary to ensure that all processors have valid logical-to-physical mapping on chip. Therefore, remap operations require a TLB shoot-down, because the logical-to-physical association is invalidated by the remap (where "logical" = the application/process view of memory).
There is a performance impact while the processor (or processors) reload the TLB. All operating systems have this issue, and in the case of PAE memory support, they ameliorate the issue in different ways:
Windows provides the ability for a single application to "batch" the remap operations required so that all happen simultaneously and only cause one TLB shoot-down and one performance dip instead of random remaps, each of which would impact performance. This is quite adequate for large applications, which are typically running on single-purpose systems.
Other operating systems provide "victim" buffers or allow one process to share another process's mappings, but at a cost of more synchronization and API complexity.
Windows XP also provides this "batch" or Scatter/Gather functionality. Additionally, performance of these operations has been improved for Windows Server 2003, Enterprise Edition and Datacenter Edition.
I/O
At one level or another, all the PAE variants support both 32-bit and 64-bit DMA I/O devices with the attendant drivers. However, there are a number of provisos and conditions.
Kernel and memory organization
Typically, kernel memory space organization is unchanged from the standard kernel for the operating system. In many cases, items such as the memory pool size remain the same. For backward compatibility, PCI base address registers (BARs) remain the same. Larger memory sizes cause some shifting of kernel address space, usually when between 16 GB and 32 GB of memory is physically present in the system.
One difference between operating systems is whether memory allocations are dynamic:
Some operating systems require the administrator to configure the amount of memory used for various purposes (caching, mapping, consolidation, and so on).
Windows does not require the administrator to configure memory allocations, because the usage is dynamic, within the constraints of the APIs used.
Hardware Support
The PCI standard provides a method whereby adapters may physically address more than 4 GB of memory by sending the high 32 bits of address and the low 32 bits of address in two separate sends. This is called Dual Address Cycle (DAC) and is used both for 32-bit adapters that understand 64-bit addresses but have only 32 address lines and for adapters that do have 64 address lines. This is a backward compatibility feature.
Given the method with which PCI addresses memory beyond 32 bits, there is a failure mode that is subtle. Any I/O range that "spans" across two 4-GB regions must be treated specially. If not, the address range will be correctly decoded for only one part of the transfer and the remaining part will be transposed to an incorrect memory location. This will corrupt memory and will crash the system, crash the application, or silently corrupt data at that location. Applications cannot prevent this because they are only presented virtual addresses and have no visibility to the physical level. All operating systems that use PAE face this problem, but some do not explicitly prevent this from occurring and instead depend on the device driver to take the correct actions.
Windows, however, explicitly prevents this problem. When an I/O range spans in this fashion, Windows returns two separate addresses and ranges to the device and driver. The final special case is the first transition from 4 GB to beyond. No DAC is required for the region below 4 GB, but DAC is required for the rest of the transfer. Again, Windows returns two separate addresses and ranges in this case to prevent memory corruption.
Obviously, DAC or 64-bit adapters and drivers provide the best performance as no buffering of I/O occurs. This buffering is required, however, whenever the adapter and driver cannot utilize more than 32 bits of address information. All operating systems that utilize PAE mode addressing support this "double buffering" in some fashion, as a backward compatibility feature. This buffering does have a performance penalty that is dependent on several factors:
Adapter hardware performance
Driver performance
Operating system support provided for double buffering
Amount of physical memory installed in the system
As the physical memory increases, the relative amount of I/O addresses beyond 32 bits also increases in relation to those addresses below 32 bits. In most cases, the operating system transparently provides double buffering, although some Unix variants do not provide any assistance in this function and require any 32-bit devices and drivers to manage their own double buffering routines and allocations.
Driver Issues
Typically, device drivers must be modified in a number of small ways. Although the actual code changes may be small, they can be difficult. This is because when not using PAE memory addressing, it is possible for a device driver to assume that physical addresses and 32-bit virtual address limits are identical. PAE memory makes this assumption untrue.
Several assumptions and shortcuts that could previously be used safely do not apply. In general, these fall in to three categories:
Buffer alignment in code that allocates and aligns shared memory buffers must be modified so that it does not ignore the upper 32 bits of the physical address.
Truncation of addresses information in the many locations this might be kept must be avoided.
It is necessary to strictly segregate virtual and physical address references so DMA operations do not transfer information to or from random memory locations.
PAE mode can be enabled on Windows XP SP2, Windows Server 2003 SP1 and later versions of Windows to support hardware-enforced DEP. However, many device drivers designed for these systems may not have been tested on system configurations with PAE enabled. In order to limit the impact to device driver compatibility, changes to the hardware abstraction layer (HAL) were made to Windows XP SP2 and Windows Server 2003 SP1 Standard Edition to limit physical address space to 4 GB. Driver developers are encouraged to read about DEP.
Paging
Most operating systems supporting PAE support virtual memory paging of some nature for the physical memory beyond 4 GB. This usually occurs with some restrictions such as limiting the boot/system paging file to 4 GB or spreading the paging file (or files) across multiple operating system-organized volumes (not necessarily physical spindles).
Although this allows the obvious benefits of virtual memory, the downside is the performance impact on applications that have one or more of the following characteristics:
Use a large amount of physical memory for their data sets
Do a great deal of I/O
Have large executable working sets
Finally, paging support typically comes at the expense of increasing the API set and slowing development and version migration.
User APIs
All operating systems supporting PAE have APIs that allow for use of physical memory by processes beyond the virtual address range possible on IA-32 processors. These differ primarily in how much support they provide for the items described earlier: memory sharing, inter-process communications, paging, and so on. A simple and straightforward API set is provided by Windows--the Address Windowing Extensions (AWE) API set--which consists of only five API calls, with the most complex API being four times larger and involving kernel and user-level calls.
The proliferation of proprietary APIs--some of which are tied directly to the processor architecture (kernel level)--makes porting applications from one Unix variant to another expensive, time-consuming, and a constant struggle to balance costs versus performance optimization. Windows provides an API set which is simple, fast, and completely portable between 32-bit and 64-bit hardware platforms, requiring only a recompile in order to function.
Page Size
Almost all operating systems supporting PAE use differing page sizes when providing physical memory beyond 4 GB to an application. The primary exception is Windows, which presents to applications only 4 KB pages on IA-32 platforms (this is different on Itanium-based platform).
The issue with using varying page sizes for applications is related to additional application complexity required to function correctly with differing memory allocation sizes, as well as subtle effects related to the underlying assumptions that almost all applications have with page size. Although research shows a small class of applications can benefit from larger page sizes (2 MB or 4 MB), because each TLB entry spans a greater address range, the general rule is applications don't benefit from larger page sizes.
Windows and PAE
Windows Version | Support |
Windows 2000 Professional, Windows XP | AWE API and 4 GB of physical RAM |
Windows XP SP2 and later | AWE API and 4 GB of physical address space |
Windows 2000 Server, Windows Server 2003, Standard Edition | AWE API and 4 GB of RAM |
Windows Server 2003 SP1, Standard Edition | AWE API and 4 GB of physical address space |
Windows Server 2003, Enterprise Edition | 8 processors and 32 GB RAM |
Windows Server 2003 SP1, Enterprise Edition | 8 processors and 64 GB RAM |
Windows 2000 Advanced Server | 8 processors and 8 GB RAM |
Windows 2000 Datacenter Server | 32 processors and 32 GB RAM (support for 64 GB was not offered because of a lack of systems for testing) |
Windows Server 2003, Datacenter Edition | 32 processors and 64 GB RAM |
Windows Server 2003 SP1, Datacenter Edition | 32 processors and 128 GB RAM |
For more information about PAE and Windows, including guidelines for developers, see PAE Memory and Windows.