Understanding ARM Assembly Part 3

My name is Marion Cole, and I am a Sr. Escalation Engineer in Microsoft Platforms Serviceability group.  This is Part 3 of my series of articles about ARM assembly.  In part 1 we talked about the processor that is supported.  In part 2 we talked about how Windows utilizes that ARM processor.  In this part we will cover Calling Conventions, Prolog/Epilog, and Rebuilding the stack.


Calling Conventions

In ARM there is only one calling convention.  The calling convention for ARM is simple.  The first four 32 bit or smaller variables are passed in R0-R3.  The remaining values go onto the stack.  If any of the first four variables are 8 or 16 bit in size then they will be padded with zeros to fill the 32-bit register.  If any of the first four variables are 64 bit in size then they have to be 64 bit aligned.  That means that the variable will be split across an even/odd register pair.  Example is R0/R1 or R2/R3.  Here is an example:

  1. Registers                                                                                                      Stack

  2. R0











  4. Foo (int I0, int I1, int I2, int I3)

    Registers                                                                                                      Stack












  5. Foo (int I0, double D, int I1)

    Registers                                                                                                      Stack












  6. Foo (int I0, int I1, double D)

    Registers                                                                                                      Stack












In the first example the function Foo takes four integer values.  All of these are passed in the registers R0 - R3.  This one is pretty simple.


In the second example the function Foo takes an integer, a double, and another integer.  The first integer is put into R0.  However note that the double has to be in an even/odd pair and therefore R1 is unused, and the double gets put into R2/R3.  The last integer is pushed onto the stack.  This leaves R1 unused.  Programmers are suggested to not use this type.  Instead organize your variables to where they will fit like in the third example.  Also in this example the stack has to be word aligned, so there will be an additional unused word pushed and popped in order to keep the alignment.  Also note that on ARM that a Byte is 8 bits, a Halfword is 16 bits, and a Word is 32 bits.


In the third example the function Foo takes two integers and a double.  As you can see the first two variables are integers and they go in R0 and R1 respectively.  The last variable the double will then be aligned to go into R2/R3.


The registers R4-R11 are used to hold the values of the local variables of a subroutine.  A subroutine is required to preserve on the stack the contents of the registers R4-R8, R10, R11, and SP.


Return values are always in R0 unless they are 64 bits in size then a combination of R0 and R1 is used.


Calling convention for floating point operations are pretty much the same.  A function can have up to 16 single-precision values in S0-S15, or 8 double-precision values in D0-D7, or 4 SIMD vectors in Q0-Q3.  Example if you have a function that takes the following combination:

Float, double, double, float


They will go into S0, D1, D2, S1 respectively.  These are aggressively back-filled.


Floating point return values are in S0/D0/Q0 as appropriate by size.


This means that S16-S31/D8-D31/Q4-Q15 are volatile.


Prolog and Epilog

The Prolog on an ARM processor does the same thing as the x86 processor, it stores registers on the stack and adjusts the frame pointer.  Let`s look at a simple example from hal!KfLowerIrql.



push        {r3,r4,r11,lr}  ; save non-volatiles regs used, r11, lr
addw        r11,sp,#8       ; new frame pointer value in r11...

...                         ; stack used in prolog is multiple of 8


As you can see the push instruction is different than x86.  On x86 we would have four push instructions to do the same thing that ARM is doing in one instruction.  This stores the registers in consecutive memory locations ending just below the address in SP, and updates SP to point to the start of the stored location.  The lowest numbered register is stored in the lowest memory address, through to the highest numbered register to the highest memory address.  We can see that here:


1: kd> r

r0=0000000f  r1=e1070180  r2=00000000  r3=e0eb3675  r4=e1048cc8  r5=e10651fc

r6=00001000  r7=0000006a  r8=c5561d10  r9=0000000f r10=e10acc80 r11=c5561d08

r12=ef890f1c  sp=c5561cc8  lr=e1298a0f  pc=e0eb3678 psr=400001b3 -Z--- Thumb


1: kd> dds c5561cc8 c5561d08

c5561cc8  e0eb3675   <-- r3

c5561ccc  e1048cc8   <-- r4

c5561cd0  c5561d08   <-- r11

c5561cd4  e1298a0f   <-- lr


The addw instruction is setting up the new frame pointer.  This will add 8 to the value in sp, and store that in r11 which is the frame pointer.  Here is what that looks like in the debugger:


kd> r

r0=0000000f  r1=00000002  r2=00000002  r3=e133b675  r4=77e31f15  r5=02cc9ad5

r6=00000000  r7=e1035580  r8=0000000f  r9=00000000 r10=e22cb710 r11=e22cb5b8

r12=26ebcf96  sp=e22cb5b0  lr=e0f2560b  pc=e133b67c psr=400000b3 -Z--- Thumb



As you can see r11 is now 8 higher than sp.


Now let`s look at the Epilog for hal!KfLowerIrql.  It is pretty simple as it is one command.



pop         {r3,r4,r11,pc}  ; restore non-volatile regs, r11, return


This is going to pop the first three registers from the stack back into their original registers.  However the last one is poping what was the link register (lr) into the program counter (pc).  This acts as a return, performing a similar function as what the RET instruction does on x86 but without using a unique instruction.  Program flow is controlled by manipulating the pc register.  Here is what this looks like in the debugger.


The registers before the pop instruction runs:

kd> r

r0=0000000f  r1=00000006  r2=00000000  r3=e1035000  r4=0000000f  r5=306f0a07

r6=00000000  r7=e1035580  r8=0000000f  r9=00000000 r10=e22c9260 r11=e22c9108

r12=26ebaae6  sp=e22c9100  lr=e0f2560b  pc=e133b6b4 psr=200000b3 --C-- Thumb


e133b6b4 e8bd8818 pop         {r3,r4,r11,pc}


The registers after the pop instruction runs:

kd> r

r0=0000000f  r1=00000006  r2=00000000  r3=e133b675  r4=51cae4a2  r5=2aede545

r6=00000000  r7=e1035580  r8=0000000f  r9=00000000 r10=e22c8d20 r11=e22c8c10

r12=26eba5a6  sp=e22c8bd0  lr=e0f2560b  pc=e0f2560a psr=200000b3 --C—Thumb


Now we are going to complicate this a bit by showing a function that has local variables, NtCreateFile.



push        {r4,r5,r11,lr}  ; save non-volatiles regs used, r11, lr    

addw        r11,sp,#8       ; new frame pointer value in r11
sub         sp,sp,#0x30     ; local variables

...                         ; stack used in prolog is multiple of 8


Notice that this looks the same as the previous prolog, but one line is added.  The sub sp,sp,#0x30 is used to make stack space available for local variables.  This adds one instruction to the Epilog as well.


Epilog :

add          sp,sp,#0x30     ; cleanup local variables
pop         {r4,r5,r11,pc}   ; restore non-volatile regs, r11, return


The add sp,sp,#0x30 is used to clean up the stack of the local variables.


One more prolog/epilog example.  This one is of IopCreateFile.  It saves the arguments that come in to the stack first.


Prolog :

push        {r0-r3}           ; save r0-r3
push        {r4-r11,lr}       ; save non-volatiles r4-r10, r11, lr
addw       r11,sp,#0x1c       ; new frame pointer value in r11
sub          sp,sp,#0x3c      ; local variables

...                           ; stack used in prolog is multiple of 8


As you can see this prolog is mostly the same, there is just one additional line for pushing the r0-r3 argument registers to the stack.


The epilog for this one is a little different.



add         sp,sp,#0x4c        ; cleanup local variables from stack
pop         {r4-r11}           ; restore non-volatiles, frame pointer r11
ldr          pc,[sp],#0x14     ; return and cleanup 0x14 bytes (lr,r0-r3)


Notice that the pop is not putting lr into pc for a return.  Instead the last statement is taking care of the pc register.  This instruction is calculating the pc address by adding 14 to the value in sp, and putting that into pc.  This cleans up the arguments and lr from the stack at the same time.  This ldr instruction is similar to the ret instruction on x86.


The last thing we are going to cover is called a "Leaf function".  A Leaf function executes in the context of the caller.  It does not have a prolog and does not use the stack.  It only uses volatile registers r0-r3, and r12.  It returns via the "bx lr" command.  Example of this is KeGetCurrentIrql.  Here is what it looks like in the debugger.


kd> uf hal!KeGetCurrentIrql

hal!KeGetCurrentIrql  211 e132b650 f3ef8300 mrs         r3,cpsr

  216 e132b654 f0130f80 tst         r3,#0x80

  216 e132b658 d103     bne         hal!KeGetCurrentIrql+0x12 (e132b662)


  216 e132b65a b672     cpsid       i

  216 e132b65c 0000     movs        r0,r0

  216 e132b65e 2201     movs        r2,#1

  216 e132b660 e000     b           hal!KeGetCurrentIrql+0x14 (e132b664)


  216 e132b662 2200     movs        r2,#0


  217 e132b664 ee1d3f90 mrc         p15,#0,r3,c13,c0,#4

  217 e132b668 7f18     ldrb        r0,[r3,#0x1C]

  218 e132b66a b10a     cbz         r2,hal!KeGetCurrentIrql+0x20 (e132b670)


  218 e132b66c b662     cpsie       i

  218 e132b66e 0000     movs        r0,r0


  220 e132b670 4770     bx          lr


The stack must remain 4 byte aligned at all times, and must be 8 byte aligned in any function boundary.  This is due to the frequent use of interlocked operations on 64-bit stack variables.


Functions which need to use a frame pointer (for example, if alloca is used) or which dynamically change the stack pointer within their body, must set up the frame pointer in the function prologue and leave it unchanged until the epilog. Functions which do not need a frame pointer must perform all stack updating in the prolog and leave the SP unchanged until the epilog.


Rebuilding the Stack

Here we are going to discuss how to rebuild the stack from the frame pointer.


The frame pointer points to the top of the stack area for the current function, or it is zero if not being used.  By using the frame pointer and storing it at the same offset for every function call, it creates a singly linked list of activation records.


The frame pointer register points to the stack backtrace structure for the currently executing function. 


The saved frame pointer value is (zero or) a pointer to the stack backtrace structure created by the function which called the current function. 


The saved frame pointer in this structure is a pointer to the stack backtrace structure for the function that called the function that called the current function; and so on back until the first function. 



In the below diagram Main calls Foo which calls Bar



For more information about ARM Debugging check out this article from T.Roy at Code Machine: