Data Conversion Rules
The following sections describe how Direct3D handles conversions between data types.
 Data Type Terminology
 Floating Point Conversion
 Integer Conversion
 Fixed Point Integer Conversion
 Related topics
Data Type Terminology
The following set of terms are subsequently used to characterize various format conversions.
Term  Definition 

SNORM  Signed normalized integer, meaning that for an nbit 2's complement number, the maximum value means 1.0f (e.g. the 5bit value 01111 maps to 1.0f), and the minimum value means 1.0f (e.g. the 5bit value 10000 maps to 1.0f). In addition, the secondminimum number maps to 1.0f (e.g. the 5bit value 10001 maps to 1.0f). There are thus two integer representations for 1.0f. There is a single representation for 0.0f, and a single representation for 1.0f. This results in a set of integer representations for evenly spaced floating point values in the range (1.0f...0.0f), and also a complementary set of representations for numbers in the range (0.0f...1.0f) 
UNORM  Unsigned normalized integer, meaning that for an nbit number, all 0's means 0.0f, and all 1's means 1.0f. A sequence of evenly spaced floating point values from 0.0f to 1.0f are represented. e.g. a 2bit UNORM represents 0.0f, 1/3, 2/3, and 1.0f. 
SINT  Signed integer. 2's complement integer. e.g. an 3bit SINT represents the integral values 4, 3, 2, 1, 0, 1, 2, 3. 
UINT  Unsigned integer. e.g. a 3bit UINT represents the integral values 0, 1, 2, 3, 4, 5, 6, 7. 
FLOAT  A floatingpoint value in any of the representations defined by Direct3D. 
SRGB  Similar to UNORM, in that for an nbit number, all 0's means 0.0f and all 1's means 1.0f. However unlike UNORM, with SRGB the sequence of unsigned integer encodings between all 0's to all 1's represent a nonlinear progression in the floating point interpretation of the numbers, between 0.0f to 1.0f. Roughly, if this nonlinear progression, SRGB, is displayed as a sequence of colors, it would appear as a linear ramp of luminosity levels to an "average" observer, under "average" viewing conditions, on an "average" display. For complete detail, refer to the SRGB color standard, IEC 6199621, at IEC (International Electrotechnical Commission). 
Floating Point Conversion
Whenever a floating point conversion between different representations occurs, including to or from nonfloating point representations, the following rules apply.
Conververting from a higher range representation to a lower range representation
 Roundtozero is used during conversion to another float format. If the target is an integer or fixed point format, roundtonearesteven is used, unless the conversion is explicitly documented as using another rounding behavior, such as roundtonearest for FLOAT to SNORM, FLOAT to UNORM or FLOAT to SRGB. Other exceptions are the ftoi and ftou shader instructions, which use roundtozero. Finally, the floattofixed conversions used by the texture sampler and rasterizer have a specified tolerance measured in UnitLastPlace from an infinitely precise ideal.
 For source values greater than the dynamic range of a lower range target format (eg. a large 32bit float value is written into a 16bit float RenderTarget), the maximum representable (appropriately signed) value results, NOT including signed infinity (due to the round to zero described above).
 NaN in a higher range format will be converted to NaN representation in the lower range format if the NaN representation exists in the lower range format. If the lower format does not have a NaN representation, the result will be 0.
 INF in a higher range format will be converted to INF in the lower range format if available. If the lower format does not have an INF representation, it will be converted to the maximum value representable. The sign will be preserved if available in the target format.
 Denorm in a higher range format will be converted to the Denorm representation in the lower range format if available in the lower range format and the conversion is possible, otherwise the result is 0. The sign bit will be preserved if available in the target format.
Converting from a lower range representation to a higher range representation
 NaN in a lower range format will be converted to the NaN representation in the higher range format if available in the higher range format. If the higher range format does not have a NaN representation, it will be converted to 0.
 INF in a lower range format will be converted to the INF representation in the higher range format if available in the higher range format. If the higher format does not have an INF representation, it will be converted to the maximum value representable (MAX_FLOAT in that format). The sign will be preserved if available in the target format.
 Denorm in a lower range format will be converted to a normalized representation in the higher range format if possible, or else to a Denorm representation in the higher range format if the Denorm representation exists. Failing those, if the higher range format does not have a Denorm representation, it will be converted to 0. The sign will be preserved if available in the target format. Note that 32bit float numbers count as a format without a Denorm representation (because Denorms in operations on 32bit floats flush to sign preserved 0).
Integer Conversion
The following table describes conversions from various representations described above to other representations. Only conversions that actually occur in Direct3D are shown.
Source Data Type  Destination Data Type  Conversion Rule 

SNORM  FLOAT  Given an nbit integer value representing the signed range [1.0f to 1.0f], conversion to floatingpoint is as follows.

FLOAT  SNORM  Given a floatingpoint number, conversion to an nbit integer value representing the signed range [1.0f to 1.0f] is as follows.

UNORM  FLOAT  The starting nbit value is converted to float (0.0f, 1.0f, 2.0f, etc.) and then divided by (2ⁿ1). 
FLOAT  UNORM  Let c represent the starting value.

SRGB  FLOAT  The following is the ideal SRGB to FLOAT conversion.

FLOAT  SRGB  The following is the ideal FLOAT > SRGB conversion. Assuming the target SRGB color component has n bits:

SINT  SINT With More Bits  To convert from SINT to an SINT with more bits, the most significant bit (MSB) of the starting number is "signextended" to the additional bits available in the target format. 
UINT  SINT With More Bits  To convert from UINT to an SINT with more bits, the number is copied to the target format's least significant bits (LSBs) and additional MSBs are padded with 0. 
SINT  UINT With More Bits  To convert from SINT to UINT with more bits: If negative, the value is clamped to 0. Otherwise the number is copied to the target format's LSBs and additional MSB's are padded with 0. 
UINT  UINT With More Bits  To convert from UINT to UINT with more bits the number is copied to the target format's LSBs and additional MSB's are padded with 0. 
SINT or UINT  SINT or UINT With Fewer or Equal Bits  To convert from a SINT or UINT to SINT or UINT with fewer or equal bits (and/or change in signedness), the starting value is simply clamped to the range of the target format. 
Fixed Point Integer Conversion
Fixed point integers are simply integers of some bit size that have an implicit decimal point at a fixed location.
The ubiquitous "integer" data type is a special case of a fixed point integer with the decimal at the end of the number.
Fixed point number representations are characterized as: i.f, where i is the number of integer bits and f is the number of fractional bits. e.g. 16.8 means 16 bits integer followed by 8 bits of fraction. The integer part is stored in 2's complement, at least as defined here (though it can be defined equally for unsigned integers as well). The fractional part is stored in unsigned form. The fractional part always represents the positive fraction between the two nearest integral values, starting from the most negative.
Addition and subtraction operations on fixed point numbers are performed simply using standard integer arithmetic, without any consideration for where the implied decimal lies. Adding 1 to a 16.8 fixed point number just means adding 256, since the decimal is 8 places in from the least significant end of the number. Other operations such as multiplication, can be performed as well simply using integer arithmetic, provided the effect on the fixed decimal is accounted for. For example, multiplying two 16.8 integers using an integer multiply produces a 32.16 result.
Fixed point integer representations are used in two ways in Direct3D.
 Postclipped vertex positions in the rasterizer are snapped to fixed point, to uniformly distribute precision across the RenderTarget area. Many rasterizer operations, including face culling as one example, occur on fixed point snapped positions, while other operations, such as attribute interpolator setup, use positions that have been converted back to floating point from the fixed point snapped positions.
 Texture coordinates for sampling operations are snapped to fixed point (after being scaled by texture size), to uniformly distribute precision across texture space, in choosing filter tap locations/weights. Weight values are converted back to floating point before actual filtering arithmetic is performed.
Source Data Type  Destination Data Type  Conversion Rule 

FLOAT  Fixed Point Integer  The following is the general procedure for converting a floating point number n to a fixed point integer i.f, where i is the number of (signed) integer bits and f is the number of fractional bits.

Fixed Point Integer  FLOAT  Assume that the specific fixed point representation being converted to float does not contain more than a total of 24 bits of information, no more than 23 bits of which is in the fractional component. Suppose a given fixed point number, fxp, is in i.f form (i bits integer, f bits fraction). The conversion to float is akin to the following pseudocode. float result = (float)(fxp >> f) + // extract integer

Related topics