Library Internals
This topic describes the internal design of the DirectXMath library.
- Calling Conventions
- Graphics Library Type Equivalence
- Global Constants in the DirectXMath Library
- Windows SSE versus SSE2
- Routine Variants
- Platform Inconsistencies
- Platform-specific Extensions
- Related topics
Calling Conventions
To enhance portability and optimize data layout, you need to use the appropriate calling conventions for each platform supported by the DirectXMath Library. Specifically, when you pass XMVECTOR objects as parameters, which are defined as aligned on a 16-byte boundary, there are different sets of calling requirements, depending on the target platform:
For 32-bit Windows
For 32-bit Windows, there are two calling conventions available for efficient passing of __m128 values (which implements XMVECTOR on that platform). The standard is __fastcall, which can pass the first three __m128 values (XMVECTOR instances) as arguments to a function in a SSE/SSE2 register. __fastcall passes remaining arguments via the stack.
Newer Microsoft Visual Studio compilers support a new calling convention, __vectorcall, which can pass up to six __m128 values (XMVECTOR instances) as arguments to a function in a SSE/SSE2 register. It can also pass heterogeneous vector aggregates (also known as XMMATRIX) via SSE/SSE2 registers if there is sufficient room.
For 64-bit editions of Windows
For 64-bit Windows, there are two calling conventions available for efficient passing of __m128 values. The standard is __fastcall, which passes all __m128 values on the stack.
Newer Visual Studio compilers support the __vectorcall calling convention, which can pass up to six __m128 values (XMVECTOR instances) as arguments to a function in a SSE/SSE2 register. It can also pass heterogeneous vector aggregates (also known as XMMATRIX) via SSE/SSE2 registers if there is sufficient room.
For Windows on ARM
The Windows on ARM & ARM64 supports passing the first four __n128 values (XMVECTOR instances) in-register.
DirectXMath solution
The FXMVECTOR, GXMVECTOR, HXMVECTOR, and CXMVECTOR aliases support these conventions:
- Use the FXMVECTOR alias to pass up to the first three instances of XMVECTOR used as arguments to a function.
- Use the GXMVECTOR alias to pass the 4th instance of an XMVECTOR used as an argument to a function.
- Use the HXMVECTOR alias to pass the 5th and 6th instances of an XMVECTOR used as an argument to a function. For info about additional considerations, see the __vectorcall documentation.
- Use the CXMVECTOR alias to pass any further instances of XMVECTOR used as arguments.
Note
For output parameters, always use XMVECTOR* or XMVECTOR& and ignore them with respect to the preceding rules for input parameters.
Because of limitations with __vectorcall, we recommend that you not use GXMVECTOR or HXMVECTOR for C++ constructors. Just use FXMVECTOR for the first three XMVECTOR values, then use CXMVECTOR for the rest.
The FXMMATRIX and CXMMATRIX aliases help support taking advantage of the HVA argument passing with __vectorcall.
- Use the FXMMATRIX alias to pass the first XMMATRIX as an argument to the function. This assumes you don't have more than two FXMVECTOR arguments or more than two float, double, or FXMVECTOR arguments to the 'right' of the matrix. For info about additional considerations, see the __vectorcall documentation.
- Use the CXMMATRIX alias otherwise.
Because of limitations with __vectorcall, we recommend that you never use FXMMATRIX for C++ constructors. Just use CXMMATRIX.
In addition to the type aliases, you must also use the XM_CALLCONV annotation to make sure the function uses the appropriate calling convention (__fastcall versus __vectorcall) based on your compiler and architecture. Because of limitations with __vectorcall, we recommend that you not use XM_CALLCONV for C++ constructors.
The following are example declarations that illustrate this convention:
XMMATRIX XM_CALLCONV XMMatrixLookAtLH(FXMVECTOR EyePosition, FXMVECTOR FocusPosition, FXMVECTOR UpDirection);
XMMATRIX XM_CALLCONV XMMatrixTransformation2D(FXMVECTOR ScalingOrigin, float ScalingOrientation, FXMVECTOR Scaling, FXMVECTOR RotationOrigin, float Rotation, GXMVECTOR Translation);
void XM_CALLCONV XMVectorSinCos(XMVECTOR* pSin, XMVECTOR* pCos, FXMVECTOR V);
XMVECTOR XM_CALLCONV XMVectorHermiteV(FXMVECTOR Position0, FXMVECTOR Tangent0, FXMVECTOR Position1, GXMVECTOR Tangent1, HXMVECTOR T);
XMMATRIX(FXMVECTOR R0, FXMVECTOR R1, FXMVECTOR R2, CXMVECTOR R3)
XMVECTOR XM_CALLCONV XMVector2Transform(FXMVECTOR V, FXMMATRIX M);
XMMATRIX XM_CALLCONV XMMatrixMultiplyTranspose(FXMMATRIX M1, CXMMATRIX M2);
To support these calling conventions, these type aliases are defined as follows (parameters must be passed by value for the compiler to consider them for in-register passing):
For 32-bit Windows apps
When you use __fastcall:
typedef const XMVECTOR FXMVECTOR;
typedef const XMVECTOR& GXMVECTOR;
typedef const XMVECTOR& HXMVECTOR;
typedef const XMVECTOR& CXMVECTOR;
typedef const XMMATRIX& FXMMATRIX;
typedef const XMMATRIX& CXMMATRIX;
When you use __vectorcall:
typedef const XMVECTOR FXMVECTOR;
typedef const XMVECTOR GXMVECTOR;
typedef const XMVECTOR HXMVECTOR;
typedef const XMVECTOR& CXMVECTOR;
typedef const XMMATRIX FXMMATRIX;
typedef const XMMATRIX& CXMMATRIX;
For 64-bit native Windows apps
When you use __fastcall:
typedef const XMVECTOR& FXMVECTOR;
typedef const XMVECTOR& GXMVECTOR;
typedef const XMVECTOR& HXMVECTOR;
typedef const XMVECTOR& CXMVECTOR;
typedef const XMMATRIX& FXMMATRIX;
typedef const XMMATRIX& CXMMATRIX;
When you use __vectorcall:
typedef const XMVECTOR FXMVECTOR;
typedef const XMVECTOR GXMVECTOR;
typedef const XMVECTOR HXMVECTOR;
typedef const XMVECTOR& CXMVECTOR;
typedef const XMMATRIX FXMMATRIX;
typedef const XMMATRIX& CXMMATRIX;
Windows on ARM
typedef const XMVECTOR FXMVECTOR;
typedef const XMVECTOR GXMVECTOR;
typedef const XMVECTOR& CXMVECTOR;
typedef const XMMATRIX& FXMMATRIX;
typedef const XMMATRIX& CXMMATRIX;
Note
While all the functions are declared inline and in many cases the compiler won't need to use calling conventions for these functions, there are cases where the compiler may decide it's more efficient to not inline the function and in these cases we want the best calling convention possible for each platform.
Graphics Library Type Equivalence
To support the use of the DirectXMath Library, many DirectXMath Library types and structures are equivalent to the Windows implementations of the D3DDECLTYPE and D3DFORMAT types, as well as the DXGI_FORMAT types.
DirectXMath | D3DDECLTYPE | D3DFORMAT | DXGI_FORMAT |
---|---|---|---|
XMBYTE2 | DXGI_FORMAT_R8G8_SINT | ||
XMBYTE4 | D3DDECLTYPE_BYTE4 (Xbox Only) | D3DFMT_x8x8x8x8 | DXGI_FORMAT_x8x8x8x8_SINT |
XMBYTEN2 | D3DFMT_V8U8 | DXGI_FORMAT_R8G8_SNORM | |
XMBYTEN4 | D3DDECLTYPE_BYTE4N (Xbox Only) | D3DFMT_x8x8x8x8 | DXGI_FORMAT_x8x8x8x8_SNORM |
XMCOLOR | D3DDECLTYPE_D3DCOLOR | D3DFMT_A8R8G8B8 | DXGI_FORMAT_B8G8R8A8_UNORM (DXGI 1.1+) |
XMDEC4 | D3DDECLTYPE_DEC4 (Xbox Only) | D3DDECLTYPE_DEC3 (Xbox Only) | |
XMDECN4 | D3DDECLTYPE_DEC4N (Xbox Only) | D3DDECLTYPE_DEC3N (Xbox Only) | |
XMFLOAT2 | D3DDECLTYPE_FLOAT2 | D3DFMT_G32R32F | DXGI_FORMAT_R32G32_FLOAT |
XMFLOAT2A | D3DDECLTYPE_FLOAT2 | D3DFMT_G32R32F | DXGI_FORMAT_R32G32_FLOAT |
XMFLOAT3 | D3DDECLTYPE_FLOAT3 | DXGI_FORMAT_R32G32B32_FLOAT | |
XMFLOAT3A | D3DDECLTYPE_FLOAT3 | DXGI_FORMAT_R32G32B32_FLOAT | |
XMFLOAT3PK | DXGI_FORMAT_R11G11B10_FLOAT | ||
XMFLOAT3SE | DXGI_FORMAT_R9G9B9E5_SHAREDEXP | ||
XMFLOAT4 | D3DDECLTYPE_FLOAT4 | D3DFMT_A32B32G32R32F | DXGI_FORMAT_R32G32B32A32_FLOAT |
XMFLOAT4A | D3DDECLTYPE_FLOAT4 | D3DFMT_A32B32G32R32F | DXGI_FORMAT_R32G32B32A32_FLOAT |
XMHALF2 | D3DDECLTYPE_FLOAT16_2 | D3DFMT_G16R16F | DXGI_FORMAT_R16G16_FLOAT |
XMHALF4 | D3DDECLTYPE_FLOAT16_4 | D3DFMT_A16B16G16R16F | DXGI_FORMAT_R16G16B16A16_FLOAT |
XMINT2 | DXGI_FORMAT_R32G32_SINT | ||
XMINT3 | DXGI_FORMAT_R32G32B32_SINT | ||
XMINT4 | DXGI_FORMAT_R32G32B32A32_SINT | ||
XMSHORT2 | D3DDECLTYPE_SHORT2 | D3DFMT_V16U16 | DXGI_FORMAT_R16G16_SINT |
XMSHORTN2 | D3DDECLTYPE_SHORT2N | D3DFMT_V16U16 | DXGI_FORMAT_R16G16_SNORM |
XMSHORT4 | D3DDECLTYPE_SHORT4 | D3DFMT_x16x16x16x16 | DXGI_FORMAT_R16G16B16A16_SINT |
XMSHORTN4 | D3DDECLTYPE_SHORT4N | D3DFMT_x16x16x16x16 | DXGI_FORMAT_R16G16B16A16_SNORM |
XMUBYTE2 | DXGI_FORMAT_R8G8_UINT | ||
XMUBYTEN2 | D3DFMT_A8P8, D3DFMT_A8L8 | DXGI_FORMAT_R8G8_UNORM | |
XMUINT2 | DXGI_FORMAT_R32G32_UINT | ||
XMUINT3 | DXGI_FORMAT_R32G32B32_UINT | ||
XMUINT4 | DXGI_FORMAT_R32G32B32A32_UINT | ||
XMU555 | D3DFMT_X1R5G5B5, D3DFMT_A1R5G5B5 | DXGI_FORMAT_B5G5R5A1_UNORM | |
XMU565 | D3DFMT_R5G6B5 | DXGI_FORMAT_B5G6R5_UNORM | |
XMUBYTE4 | D3DDECLTYPE_UBYTE4 | D3DFMT_x8x8x8x8 | DXGI_FORMAT_x8x8x8x8_UINT |
XMUBYTEN4 | D3DDECLTYPE_UBYTE4N | D3DFMT_x8x8x8x8 | DXGI_FORMAT_x8x8x8x8_UNORM DXGI_FORMAT_R10G10B10_XR_BIAS_A2_UNORM (Use XMLoadUDecN4_XR and XMStoreUDecN4_XR.) |
XMUDEC4 | D3DDECLTYPE_UDEC4 (Xbox Only) D3DDECLTYPE_UDEC3 (Xbox Only) |
D3DFMT_A2R10G10B10 D3DFMT_A2B10G10R10 |
DXGI_FORMAT_R10G10B10A2_UINT |
XMUDECN4 | D3DDECLTYPE_UDEC4N (Xbox Only) D3DDECLTYPE_UDEC3N (Xbox Only) |
D3DFMT_A2R10G10B10 D3DFMT_A2B10G10R10 |
DXGI_FORMAT_R10G10B10A2_UNORM |
XMUNIBBLE4 | D3DFMT_A4R4G4B4, D3DFMT_X4R4G4B4 | DXGI_FORMAT_B4G4R4A4_UNORM (DXGI 1.2+) | |
XMUSHORT2 | D3DDECLTYPE_USHORT2 | D3DFMT_G16R16 | DXGI_FORMAT_R16G16_UINT |
XMUSHORTN2 | D3DDECLTYPE_USHORT2N | D3DFMT_G16R16 | DXGI_FORMAT_R16G16_UNORM |
XMUSHORT4 | D3DDECLTYPE_USHORT4 (Xbox Only) | D3DFMT_x16x16x16x16 | DXGI_FORMAT_R16G16B16A16_UINT |
XMUSHORTN4 | D3DDECLTYPE_USHORT4N | D3DFMT_x16x16x16x16 | DXGI_FORMAT_R16G16B16A16_UNORM |
Global Constants in the DirectXMath Library
To reduce the size of the data segment, the DirectXMath Library uses the XMGLOBALCONST macro to make use of a number of global internal constants in its implementation. By convention, such internal global constants are prefixed by g_XM. Typically, they are one of the following types: XMVECTORU32, XMVECTORF32, or XMVECTORI32.
These internal global constants are subject to change in future revisions of the DirectXMath Library. Use public functions that encapsulate the constants when possible rather than direct use of g_XM global values. You can also declare your own global constants using XMGLOBALCONST.
Windows SSE versus SSE2
The SSE instruction set provides support only for single-precision floating-point vectors. DirectXMath must make use of the SSE2 instruction set to provide integer vector support. SSE2 is supported by all Intel processors since the introduction of the Pentium 4, all AMD K8 and later processors, and all x64-capable processors.
Note
Windows 8 for x86 or later requires support for SSE2. All versions of Windows x64 require support for SSE2. Windows on ARM / ARM64 requires ARM_NEON.
Routine Variants
There are several variants of DirectXMath functions that make it easier to do your work:
- Comparison functions to create complicated conditional branching based on a smaller number of vector comparison operations. The name of these functions end in "R" such as XMVector3InBoundsR. The functions return a comparison record as a UINT return value, or as a UINT out parameter. You can use the XMComparision* macros to test the value.
- Batch functions for performing batch-style operations on larger vector arrays. The name of these functions end in "Stream" such as XMVector3TransformStream. The functions operate on an array of inputs, and they generate an array of outputs. Typically, they take an input and output stride.
- Estimation functions that implement a faster estimation instead of a slower, more accurate result. The name of these functions end in "Est" such as XMVector3NormalizeEst. The quality and performance impact of using estimation varies from platform to platform, but we recommend that you use estimation variants for performance-sensitive code.
Platform Inconsistencies
The DirectXMath library is intended for use in performance-sensitive graphics applications and games. Therefore, the implementation is designed for optimal speed doing normal processing on all supported platforms. Results at boundary-conditions, particularly those that generate floating-point specials, are likely to vary from target to target. This behavior will also depend on other run-time settings, such as the x87 control word for the Windows 32-bit no-intrinsics target or the SSE control word for both Windows 32-bit and 64-bit. Furthermore, there will be differences in boundary-conditions between various CPU vendors.
Don't use DirectXMath in scientific or other applications where numerical accuracy is paramount. Also, this limitation is reflected in the lack of support for double or other extended precision computations.
Note
The _XM_NO_INTRINSICS_ scalar code paths generally are written for compliance, not performance. Their boundary-condition results also will vary.
Platform-specific Extensions
The DirectXMath library is intended to simplify C++ SIMD programming providing excellent support for x86, x64, and Windows RT platforms using broadly supported intrinsics instructions (SSE2 and ARM-NEON).
There are times, however, when platform-specific instructions may prove beneficial. Due to the way DirectXMath is implemented, in many cases it is trivial to use DirectXMath types directly in standard compiler-supported intrinsics statements, and to use DirectXMath as the fallback path for platforms that don't support the extended instruction.
For example, here is a simplified example of leveraging the SSE 4.1 dot-product instruction. Note that you must explicitly guard the code-path to avoid generating invalid instruction exceptions at run time. Ensure the code paths do significant enough work to justify the additional cost of branching, complexity of maintaining multiple code-paths, and so on.
#include <Windows.h>
#include <stdio.h>
#include <DirectXMath.h>
#include <intrin.h>
#include <smmintrin.h>
using namespace DirectX;
bool g_bSSE41 = false;
void DetectCPUFeatures()
{
#ifndef _M_ARM
// See __cpuid documentation for more information
int CPUInfo[4] = {-1};
#if defined(__clang__) || defined(__GNUC__)
__cpuid(0, CPUInfo[0], CPUInfo[1], CPUInfo[2], CPUInfo[3]);
#else
__cpuid(CPUInfo, 0);
#endif
if ( CPUInfo[0] >= 1 )
{
#if defined(__clang__) || defined(__GNUC__)
__cpuid(1, CPUInfo[0], CPUInfo[1], CPUInfo[2], CPUInfo[3]);
#else
__cpuid(CPUInfo, 1);
#endif
if ( CPUInfo[2] & 0x80000 )
g_bSSE41 = true;
}
#endif
}
int main()
{
if ( !XMVerifyCPUSupport() )
return -1;
DetectCPUFeatures();
...
XMVECTORF32 v1 = { 1.f, 2.f, 3.f, 4.f };
XMVECTORF32 v2 = { 5.f, 6.f, 7.f, 8.f };
XMVECTOR r2, r3, r4;
if ( g_bSSE41 )
{
#ifndef _M_ARM
r2 = _mm_dp_ps( v1, v2, 0x3f );
r3 = _mm_dp_ps( v1, v2, 0x7f );
r4 = _mm_dp_ps( v1, v2, 0xff );
#endif
}
else
{
r2 = XMVector2Dot( v1, v2 );
r3 = XMVector3Dot( v1, v2 );
r4 = XMVector4Dot( v1, v2 );
}
...
return 0;
}
For more info about platform-specific extensions, see:
DirectXMath: SSE, SSE2, and ARM-NEON
DirectXMath: SSE3 and SSSE3
DirectXMath: SSE4.1 and SSE4.2
DirectXMath: AVX
DirectXMath: F16C and FMA
DirectXMath: AVX2
DirectXMath: ARM64