From the July 2001 issue of MSDN Magazine.

MSDN Magazine

IA-64 Registers, Part 2
Matt Pietrek
T

his month I'll complete the overview of IA-64 (Itanium) parameter passing, stack frames, and return values that I began last month. In the previous column, I reviewed stack frames and how parameters are passed on the x86 CPU. I then examined the essential IA-64 registers that you'll need to know.
      If you're not familiar with the basic set of IA-64 registers (including the dynamic nature of the general-purpose registers), I suggest you read my previous column (see Under the Hood). To quickly review, the IA-64 defines 128 general-purpose integer registers, each 64 bits wide, and named r0 through r127. The last 96 of the registers are dynamic, meaning that the physical CPU register assigned to a dynamic register name changes when you create a new procedure frame. In any given frame r32 is where the action begins. There are also 128 floating-point registers (named f0-f127), and eight branch registers (b0-b7).

Procedure Frames on the IA-64

      When working with the 96 dynamic general-purpose registers, the IA-64 uses a special register, the Current Frame Marker (CFM), to separate the 96 dynamic registers into several logical regions. These regions are the input, local, and output register regions, as shown in Figure 1. To be completely accurate, the processor considers the input and local regions to be the same thing. However, for the purpose of understanding IA-64 frames, it helps to think of the input and local regions as distinct, so I'll do that in the following discussion. Also, if you're brave enough to write IA-64 assembly code, you'll find that the assembler lets you specify both an input and local size.

Figure 1 Register Regions
Figure 1 Register Regions

      As you'll see in more detail later, a function's parameters reside in the input registers. The local region is where items like local variables and compiler temporary variables reside. The output region registers are where parameter values go in preparation for calling another function.
      The compiler decides how many input, local, and output registers are needed for a particular function. In the function's prolog, the compiler emits an alloc instruction that specifies how big the input, local, and output regions are for the method. For the output region size, the compiler uses the maximum number of parameters passed to any of the subfunctions called by the current function. The actual maximum output region will be eight registers or less, and you'll see why soon.
      Consider a function that takes four parameters, needs five local variables, and calls a subfunction that takes six parameters. The procedure frame would probably look like this: r32-r35 are the input registers, r36-r40 are the local variable registers, and r41-r46 are the output registers.
      The magic of IA-64 procedure frames occurs when changing from one frame to another frame. The IA-64 br.call instruction is the equivalent of the x86 CALL instruction. After executing the br.call, the 96 dynamic registers are renamed, and the CFM register is updated. The net effect is that the called function starts with no input or local registers, and the same output registers that the calling function set up. However, thanks to register renaming, the output registers have different names than in the calling function. Using the previous example, the r41-r46 output registers before the br.call instruction become r32-r37 after the call. Figure 2 illustrates this example.

Figure 2 Register Renaming
Figure 2 Register Renaming

      A natural question here is, "When the subfunction returns, how do the registers get renamed back to their original settings?" The answer is yet another register, ar.pfs (Application Register, Previous Function State). As part of the br.call instruction, the CPU copies the current contents of the CFM register into the ar.pfs register.
      To review, after the br.call instruction executes, the new function has no input/local registers, and has an output area containing the registers that hold the passed parameters. The alloc instruction juggles things around so that the output area of the previous function becomes the input area of the new function. Next, the alloc instruction establishes the new local and output regions. Finally, the alloc instruction saves the contents of the ar.pfs into a general-purpose register.
      An example is helpful. Consider the following real-world alloc instruction:

  alloc   r36 = ar.pfs, 0x7, 0x0, 0x2, 0x0
  

 

      This instruction sets up a frame with an input region of seven input registers, zero local registers, and two output registers. You can safely ignore the last 0x0 operand for this discussion. In addition, the ar.pfs register is copied into r36. The r36 referred to is after the new input, local, and output regions have been set up. Put another way, it looks like the instruction is overwriting one of the input registers with the ar.pfs value. What's the deal?
      Remember earlier when I said that the IA-64 considers input and local registers to be the same thing? This is reflected in the instruction encoding for the alloc instruction, where the size of the input and local registers are combined into a single value. A disassembler has no way of knowing how many registers are actual input registers. However, from looking at the instruction, and assuming that the intent is not to overwrite an input register, you can guess that there are probably four input registers (r32-r35), and three local registers (r36-r38). In IA-64 assembly language, you could rewrite the instruction as:

  alloc   r36 = ar.pfs, 0x4, 0x3, 0x2, 0x0
  

 

      The combination of register renaming, the br.call, and the alloc instructions work together to establish procedure frames for functions. These frames, with their input and output regions, are explicitly designed to allow efficient passing of parameters via registers. This is altogether different from the x86 architecture, which has no formal support on the chip for procedure frames and parameter passing. Yes, the EBP register is typically used for stack frames, but that's just a convention, not a requirement.

Parameter Passing on the IA-64

      Having seen the hardware support for procedure frames and passing values in registers from one function to the next, let's now look at the implementation details. Section 8-5 of the IA-64 Software Conventions and Runtime Architecture Guide (see https://www.intel.com/design/IA-64/Downloads/245256.htm) presents a convention for parameter passing, to which Microsoft compilers adhere. In this convention, parameters may be passed in the general-purpose registers, the floating-point registers, and on a thread's stack.
      Let's see how this parameter passing convention works. For the simple case, assume that all parameters are integers, and that they're eight bytes wide or less. In this case, up to the first eight parameters are passed in the general-purpose registers. Any additional parameters are placed sequentially on the thread stack.
      Let's look at an example. Consider this C++ snippet:

    SomeFunction(void)
  
{
int x = 1, y = 2, z = 3;
FooFunction( x, y, z );
}

void FooFunction( int a, int b, int c )
{
•••
}

 

In this case, three parameters are passed to FooFunction. Inside the parent method (SomeFunction), the output region registers might be r40-r42. In preparation for calling FooFunction, the r40 register would be set to the value of x, r41 to the value of y, and r42 to the value of z. Inside of FooFunction, after its alloc instruction, parameter a would be in r32, parameter b in r33, and parameter c in r34.
      Now, let's look at the more formal definition of parameter passing. For starters, there is the notion of a parameter slot. Each parameter slot is eight bytes wide. A parameter that's bigger than eight bytes takes up multiple parameter slots. The contents of up to the first eight parameter slots are passed in either the general-purpose or floating-point registers, depending on the parameter type. The f8-f15 registers are designated for passing floating-point parameters.
      Things get more complicated when a combination of integer and floating-point parameters are passed. There's a one-to-one mapping between parameter slots and the general-purpose output registers. If a floating-point parameter maps to a particular parameter slot, the general-purpose output register mapped to that slot goes unused. Instead, the parameter will be passed in the next available floating-point register, starting with f8. And don't forget, any parameters beyond the first eight slots are passed on the stack, not in registers.
      Let's look at another example to clarify this:

  SomeOtherFunction(void)
  
{
int x = 1, y = 2;
float fpv1 = 1.2, fpv2 = 3.4;
BarFunction( x, fpv1, fpv2, y );
}

void BarFunction( int a, float b, float c, int d )
{
•••
}

 

In this case, let's say the output registers in SomeOtherFunction are r40-r43. Here's how the registers would be set up in preparation for the call:

  r40     = x
  
r41 = ? // unused
r42 = ? // unused
r43 = y
f8 = fpv1
f9 = fpv2

 

      Parameters beyond those that fit into the first eight slots go onto the calling function's stack. The first parameter slot on the stack is at 0x10 bytes above the stack pointer. Each subsequent parameter slot is eight bytes higher on the stack. As a side note, it's worth mentioning that the br.call instruction doesn't change the stack pointer, unlike in the x86 architecture. Also, note that compilers generally try to avoid putting items on the stack, since they require a memory access which is potentially very slow.
      What about functions with variable argument lists? With all these rules, such a function would find it difficult to know where to look for a particular parameter without knowing its type in advance. For these cases, the convention allows for floating-point registers to be passed in both the general-purpose register slot and in the appropriate floating-point register.
      What about the oddball case in which a parameter is bigger than eight bytes and part of it would fit in the last register slot (slot 7)? The convention says that it's OK for part of the parameter to be passed in a register and the remaining portion to be placed on the stack. Of course, the compiler has to know when this happens and deal with a parameter being split into two locations.
      Believe it or not, what I've described here is the simplified version of the IA-64 parameter passing convention. The Intel documentation is much more complicated and describes many more of the "edge" cases. In it, you'll find rules for parameter alignment within slots, passing floating-point values in general-purpose registers, and other esoteric details. The description I've presented should get you through the vast majority of the cases you'll see.

Returning from an IA-64 Procedure

      So far, you've seen how parameters are passed and calls made on the IA-64. Now let's examine what returning from a function entails. If the function returns any values, they have to be placed into registers. Integer values of eight bytes or less are placed in the r8 register. Put another way, the r8 register on the IA-64 is equivalent to the EAX register on the x86 for the purpose of return values.
      If more than eight bytes are needed for an integer return value, the entire register range of r8-r11 can be used. Floating-point values are returned in f8 if small enough. If more bits are needed, the range of f8-f15 is used.
      In the description of call frames, nowhere did I mention return addresses. The return address is an implicit part of an x86 frame, but on the IA-64 it's not part of the frame at all. Instead, when a call is made via the br.call instruction, the return address is stored in a branch register. For example, consider the following instruction:

  br.call.dptk.many  b0 = b6
  

 

The .dptk.many part of the instruction isn't important here. This instruction calls the address stored in the b6 branch register. The address of the instruction following the br.call is stored in the b0 branch register. Although the return address could be placed in any of the branch registers (b0-b7), the convention is that the return address for a call goes into the b0 register.
      Inside the called function, you'll usually find code that squirrels away the b0 register into a general-purpose register or the stack. This is because the called function may itself call other functions, thus causing the b0 register to be overwritten with yet another return address. The IA-64 architecture places the burden of maintaining the return address on the compiler, rather than building it into the architecture like the x86 CALL and RET instructions do. Incidentally, because you can't assume that the return address is at a known place on the stack, it's significantly more difficult to walk an IA-64 call stack than an x86 call stack.
      Besides setting the return value registers and returning to the caller, the remaining part of returning from a function is to remove its procedure frame and restore the register naming of the calling function. When a function returns (via a br.ret instruction), the contents of the ar.pfs register are implicitly moved back to the CFM register, thus restoring the original register configuration for the calling function.

The Register Stack Engine

      Until now, I've asked you to take it on faith that the 96 dynamic registers are all available to each function. The big question is, what happens when you're nested deeply enough in the function hierarchy that all 96 registers have already been allocated? This is where the Register Stack Engine (RSE) comes into play. Of all the IA-64 concepts I've learned, understanding the RSE was by far the most difficult to really understand.
      The RSE is a feature of the IA-64 architecture that operates silently in the background. Normal applications shouldn't ever have to know that it exists or how it works. The RSE is essentially a hardware background thread that reads and writes the dynamic registers out to memory. This memory is known as the backing store, and the operating system allocates the memory for it.
      As registers are allocated by calling deeper and deeper into a function tree, the registers least recently used are written to memory. This frees up those registers for new frames. A pair of registers (ar.bsp and ar.bspstore) keep track of which registers have been spilled into memory. When a function returns, the RSE automatically knows to reload the previously saved registers from memory, if necessary. It's important to note that registers are spilled to memory in a lazy manner. In general, you don't know (or care) if the registers from a preceding frame are still in a register or have been spilled to the backing store.
      The key thing to understand when investigating registers in the IA-64 architecture is that there's a one-to-one correlation between a given register in a particular function frame and where it will be saved in the backing store memory. The image that helps me visualize the RSE is that of a tank tread going up and down a wall. Each tread plate corresponds to a dynamic register, and the wall is the backing store.
      The spots where the tread touches the wall are where the registers are spilled into memory locations. As the tank goes higher up the wall (that is, as you call into a more deeply nested function), new registers come into contact with the wall. The tank tread can go as high up the wall as there is space in the backing store. Of course, in going up the wall, the tank tread will make multiple complete rotations.
      As your code enters and leaves various frames, the RSE registers automatically keep track of where to spill or restore registers from the backing store. Another way to think of this is that the dynamic registers are like a cache for the backing store. The current frame always works with the dynamic registers, but registers from older frames are spilled out to memory in order to make room for new procedure frames.
      This ends the whirlwind tour of IA-64 registers, procedure frames, and parameter passing. This basic knowledge can be very useful if you're stepping through IA-64 code in the debugger, trying to discern what's going on. If you want more details than I've presented here, Intel's documentation is very thorough. However, even if you know the x86 architecture cold, you may want to block out a big chunk of time and keep a bottle of aspirin nearby. It's a whole different world!

Matt Pietrek is one of the founders of Mindreef LLC, an Internet company. Prior to this, he was the lead architect for Compuware/NuMega's BoundsChecker. His Web site, at https://www.wheaty.net, has a page and information on previous columns and articles.