Overview of ARM32 ABI Conventions

2021-11-11

The application binary interface (ABI) for code compiled for Windows on ARM processors is based on the standard ARM EABI. This article highlights key differences between Windows on ARM and the standard. This document covers the ARM32 ABI. For information about the ARM64 ABI, see Overview of ARM64 ABI conventions. For more information about the standard ARM EABI, see Application Binary Interface (ABI) for the ARM Architecture (external link).

Base Requirements

Windows on ARM always presumes that it's running on an ARMv7 architecture. Floating-point support in the form of VFPv3-D32 or later must be present in hardware. The VFP must support both single-precision and double-precision floating-point in hardware. The Windows runtime doesn't support emulation of floating-point to enable running on non-VFP hardware.

Advanced SIMD Extensions (NEON) support, including both integer and floating-point operations, must also be present in hardware. No run-time support for emulation is provided.

Integer divide support (UDIV/SDIV) is recommended but not required. Platforms that lack integer divide support may incur a performance penalty because these operations have to be trapped and possibly patched.

Endianness

Windows on ARM executes in little-endian mode. Both the MSVC compiler and the Windows runtime always expect little-endian data. The SETEND instruction in the ARM instruction set architecture (ISA) allows even user-mode code to change the current endianness. However, doing so is discouraged because it's dangerous for an application. If an exception is generated in big-endian mode, the behavior is unpredictable. It may lead to an application fault in user mode, or a bugcheck in kernel mode.

Alignment

Although Windows enables the ARM hardware to handle misaligned integer accesses transparently, alignment faults still may be generated in some situations. Follow these rules for alignment:

You don't have to align half-word-sized (16-bit) and word-sized (32-bit) integer loads and stores. The hardware handles them efficiently and transparently.
Floating-point loads and stores should be aligned. The kernel handles unaligned loads and stores transparently, but with significant overhead.
Load or store double (LDRD/STRD) and multiple (LDM/STM) operations should be aligned. The kernel handles most of them transparently, but with significant overhead.
All uncached memory accesses must be aligned, even for integer accesses. Unaligned accesses cause an alignment fault.

Instruction Set

The instruction set for Windows on ARM is strictly limited to Thumb-2. All code executed on this platform is expected to start and always remain in Thumb mode. An attempt to switch into the legacy ARM instruction set may succeed. However, if it does, any exceptions or interrupts that occur may lead to an application fault in user mode, or a bugcheck in kernel mode.

A side-effect of this requirement is that all code pointers must have the low bit set. Then, when they're loaded and branched to via BLX or BX, the processor remains in Thumb mode. It doesn't try to execute the target code as 32-bit ARM instructions.

SDIV/UDIV instructions

The use of integer divide instructions SDIV and UDIV is fully supported, even on platforms without native hardware to handle them. The extra overhead per SDIV or UDIV divide on a Cortex-A9 processor is approximately 80 cycles. That's added to the overall divide time of 20-250 cycles, depending on the inputs.

Integer registers

The ARM processor supports 16 integer registers:

Register	Volatile?	Role
r0	Volatile	Parameter, result, scratch register 1
r1	Volatile	Parameter, result, scratch register 2
r2	Volatile	Parameter, scratch register 3
r3	Volatile	Parameter, scratch register 4
r4	Non-volatile
r5	Non-volatile
r6	Non-volatile
r7	Non-volatile
r8	Non-volatile
r9	Non-volatile
r10	Non-volatile
r11	Non-volatile	Frame pointer
r12	Volatile	Intra-procedure-call scratch register
r13 (SP)	Non-volatile	Stack pointer
r14 (LR)	Non-volatile	Link register
r15 (PC)	Non-volatile	Program counter

For details about how to use the parameter and return value registers, see the Parameter Passing section in this article.

Windows uses r11 for fast-walking of the stack frame. For more information, see the Stack Walking section. Because of this requirement, r11 must always point to the topmost link in the chain. Don't use r11 for general purposes, because your code won't generate correct stack walks during analysis.

VFP registers

Windows only supports ARM variants that have VFPv3-D32 coprocessor support. It means floating-point registers are always present and can be relied on for parameter passing. And, the full set of 32 registers is available for use. The VFP registers and their usage are summarized in this table:

Singles	Doubles	Quads	Volatile?	Role
s0-s3	d0-d1	q0	Volatile	Parameters, result, scratch register
s4-s7	d2-d3	q1	Volatile	Parameters, scratch register
s8-s11	d4-d5	q2	Volatile	Parameters, scratch register
s12-s15	d6-d7	q3	Volatile	Parameters, scratch register
s16-s19	d8-d9	q4	Non-volatile
s20-s23	d10-d11	q5	Non-volatile
s24-s27	d12-d13	q6	Non-volatile
s28-s31	d14-d15	q7	Non-volatile
	d16-d31	q8-q15	Volatile

The next table illustrates the floating-point status and control register (FPSCR) bitfields:

Bits	Meaning	Volatile?	Role
31-28	NZCV	Volatile	Status flags
27	QC	Volatile	Cumulative saturation
26	AHP	Non-volatile	Alternative half-precision control
25	DN	Non-volatile	Default NaN mode control
24	FZ	Non-volatile	Flush-to-zero mode control
23-22	RMode	Non-volatile	Rounding mode control
21-20	Stride	Non-volatile	Vector Stride, must always be 0
18-16	Len	Non-volatile	Vector Length, must always be 0
15, 12-8	IDE, IXE, and so on	Non-volatile	Exception trap enable bits, must always be 0
7, 4-0	IDC, IXC, and so on	Volatile	Cumulative exception flags

Floating-point exceptions

Most ARM hardware doesn't support IEEE floating-point exceptions. On processor variants that do have hardware floating-point exceptions, the Windows kernel silently catches the exceptions and implicitly disables them in the FPSCR register. This action ensures normalized behavior across processor variants. Otherwise, code developed on a platform that doesn't have exception support could receive unexpected exceptions when it's running on a platform that does have exception support.

Parameter passing

The Windows on ARM ABI follows the ARM rules for parameter passing for non-variadic functions. The ABI rules include the VFP and Advanced SIMD extensions. These rules follow the Procedure Call Standard for the ARM Architecture, combined with the VFP extensions. By default, the first four integer arguments and up to eight floating-point or vector arguments are passed in registers. Any further arguments are passed on the stack. Arguments are assigned to registers or the stack by using this procedure:

Stage A: Initialization

Initialization is performed exactly once, before argument processing begins:

The Next Core Register Number (NCRN) is set to r0.
The VFP registers are marked as unallocated.
The Next Stacked Argument Address (NSAA) is set to the current SP.
If a function that returns a result in memory is called, then the address for the result is placed in r0 and the NCRN is set to r1.

Stage B: Pre-padding and extension of arguments

For each argument in the list, the first matching rule from the following list is applied:

If the argument is a composite type whose size cannot be statically determined by both the caller and the callee, the argument is copied to memory and replaced by a pointer to the copy.
If the argument is a byte or 16-bit half-word, then it is zero-extended or sign-extended to a 32-bit full word and treated as a 4-byte argument.
If the argument is a composite type, its size is rounded up to the nearest multiple of 4.

Stage C: Assignment of arguments to registers and stack

For each argument in the list, the following rules are applied in turn until the argument has been allocated:

If the argument is a VFP type and there are enough consecutive unallocated VFP registers of the appropriate type, then the argument is allocated to the lowest-numbered sequence of such registers.
If the argument is a VFP type, all remaining unallocated registers are marked as unavailable. The NSAA is adjusted upwards until it is correctly aligned for the argument type and the argument is copied to the stack at the adjusted NSAA. The NSAA is then incremented by the size of the argument.
If the argument requires 8-byte alignment, the NCRN is rounded up to the next even register number.
If the size of the argument in 32-bit words is not more than r4 minus NCRN, the argument is copied into core registers, starting at the NCRN, with the least significant bits occupying the lower-numbered registers. The NCRN is incremented by the number of registers used.
If the NCRN is less than r4 and the NSAA is equal to the SP, the argument is split between core registers and the stack. The first part of the argument is copied into the core registers, starting at the NCRN, up to and including r3. The rest of the argument is copied onto the stack, starting at the NSAA. The NCRN is set to r4 and the NSAA is incremented by the size of the argument minus the amount passed in registers.
If the argument requires 8-byte alignment, the NSAA is rounded up to the next 8-byte aligned address.
The argument is copied into memory at the NSAA. The NSAA is incremented by the size of the argument.

The VFP registers aren't used for variadic functions, and Stage C rules 1 and 2 are ignored. It means that a variadic function can begin with an optional push {r0-r3} to prepend the register arguments to any additional arguments passed by the caller, and then access the entire argument list directly from the stack.

Integer type values are returned in r0, optionally extended to r1 for 64-bit return values. VFP/NEON floating-point or SIMD type values are returned in s0, d0, or q0, as appropriate.

Stack

The stack must always remain 4-byte aligned, and must be 8-byte aligned at any function boundary. It's required to support the frequent use of interlocked operations on 64-bit stack variables. The ARM EABI states that the stack is 8-byte aligned at any public interface. For consistency, the Windows on ARM ABI considers any function boundary to be a public interface.

Functions that have to use a frame pointer—for example, functions that call alloca or that change the stack pointer dynamically—must set up the frame pointer in r11 in the function prologue and leave it unchanged until the epilogue. Functions that don't require a frame pointer must perform all stack updates in the prologue and leave the stack pointer unchanged until the epilogue.

Functions that allocate 4 KB or more on the stack must ensure that each page prior to the final page is touched in order. This order ensures that no code can "leap over" the guard pages that Windows uses to expand the stack. Typically, the expansion is done by the __chkstk helper, which is passed the total stack allocation in bytes divided by 4 in r4, and which returns the final stack allocation amount in bytes back in r4.

Red zone

The 8-byte area immediately below the current stack pointer is reserved for analysis and dynamic patching. It permits carefully generated code to be inserted, which stores 2 registers at [sp, #-8] and temporarily uses them for arbitrary purposes. The Windows kernel guarantees that those 8 bytes won't be overwritten if an exception or interrupt occurs in both user mode and kernel mode.

Kernel stack

The default kernel-mode stack in Windows is three pages (12 KB). Be careful not to create functions that have large stack buffers in kernel mode. An interrupt could come in with very little stack headroom and cause a stack panic bugcheck.

C/C++ specifics

Enumerations are 32-bit integer types unless at least one value in the enumeration requires 64-bit double-word storage. In that case, the enumeration is promoted to a 64-bit integer type.

wchar_t is defined to be equivalent to unsigned short, to preserve compatibility with other platforms.

Stack walking

Windows code is compiled with frame pointers enabled (/Oy (Frame-Pointer Omission)) to enable fast stack walking. Generally, the r11 register points to the next link in the chain, which is an {r11, lr} pair that specifies the pointer to the previous frame on the stack and the return address. We recommend that your code also enable frame pointers for improved profiling and tracing.

Exception unwinding

Stack unwinding during exception handling is enabled by the use of unwind codes. The unwind codes are a sequence of bytes stored in the .xdata section of the executable image. They describe the operation of the function prologue and epilogue code in an abstract manner, so that the effects of a function's prologue can be undone in preparation for unwinding to the caller's stack frame.

The ARM EABI specifies an exception unwinding model that uses unwind codes. However, this specification isn't sufficient for unwinding in Windows, which must handle cases where the processor is in the middle of the prologue or epilogue of a function. For more information about Windows on ARM exception data and unwinding, see ARM Exception Handling.

We recommend that dynamically generated code be described by using dynamic function tables specified in calls to RtlAddFunctionTable and associated functions, so that the generated code can participate in exception handling.

Cycle counter

ARM processors running Windows are required to support a cycle counter, but using the counter directly may cause problems. To avoid these issues, Windows on ARM uses an undefined opcode to request a normalized 64-bit cycle-counter value. From C or C++, use the __rdpmccntr64 intrinsic to emit the appropriate opcode; from assembly, use the __rdpmccntr64 instruction. Reading the cycle counter takes approximately 60 cycles on a Cortex-A9.

The counter is a true cycle counter, not a clock; therefore, the counting frequency varies with the processor frequency. If you want to measure elapsed clock time, use QueryPerformanceCounter.