Basic data types—floating-point types
In the IAR C/C++ Compiler for RX, floating-point values are represented in standard IEC 60559 format. The sizes for the different floating-point types are:
Type | Size | Range (+/-) | Decimals | Exponent | Mantissa | Alignment |
|---|---|---|---|---|---|---|
16 bits | ±2E-14 to 65504 | 3 | 5 bits | 11 bits | 2 | |
32 bits | ±1.18E-38 to ±3.39E+38 | 7 | 8 bits | 23 bits | 4 | |
32 bits | ±1.18E-38 to ±3.39E+38 | 7 | 8 bits | 23 bits | 4 | |
| 64 bits | ±2.23E-308 to ±1.79E+308 | 15 | 11 bits | 52 bits | 4 |
32 bits | ±1.18E-38 to ±3.39E+38 | 7 | 8 bits | 23 bits | 4 | |
| 64 bits | ±2.23E-308 to ±1.79E+308 | 15 | 11 bits | 52 bits | 4 |
Note
The size of double and long double depends on the ‑‑double={32|64} option, see ‑‑double. The type long double uses the same precision as double.
The __fp16 floating-point type is only a storage type. All numerical operations will operate on values promoted to float.
Floating-point environment
Exception flags are not supported. The feraiseexcept function does not raise any exceptions.
32-bit floating-point format
The representation of a 32-bit floating-point number as an integer is:

The exponent is 8 bits, and the mantissa is 23 bits.
The value of the number is:
(-1)S * 2(Exponent-127) * 1.Mantissa
The range of the number is at least:
±1.18E-38 to ±3.39E+38
The precision of the float operators (+, -, *, and /) is approximately 7 decimal digits.
Representation of special floating-point numbers
This list describes the representation of special floating-point numbers:
Zero is represented by zero mantissa and exponent. The sign bit signifies positive or negative zero.
Infinity is represented by setting the exponent to the highest value and the mantissa to zero. The sign bit signifies positive or negative infinity.
For the float type, Not a number (
NaN) is represented by setting the exponent to the highest positive value and the mantissa to a non-zero value. The value of the sign bit is ignored.For the
doubletype, Not a number (NaN) is represented by setting the exponent to7FFand at least one of the highest twenty bits in the mantissa to non-zero. The lower thirty-two bits of the mantissa are ignored. The value of the sign bit is also ignored.Subnormal numbers are used for representing values smaller than what can be represented by normal values. The drawback is that the precision will decrease with smaller values. The exponent is set to 0 to signify that the number is subnormal, even though the number is treated as if the exponent was 1. Unlike normal numbers, subnormal numbers do not have an implicit 1 as the most significant bit (the MSB) of the mantissa. The value of a subnormal number is:
(-1)S * 2(1-BIAS) * 0.Mantissa
where
BIASis 127.
By default, subnormal numbers are only supported for 64-bit floating-point numbers. However, the RX600 libraries can use the unimplemented processing exceptionof the CPU to support 32-bit floating-point subnormal numbers.
Note
If the 64-bit FPU is used (‑‑fpu=64) subnormal numbers are not supported, neither for 32-bit nor for 64-bit floating-point numbers.
To enable the subnormal number exception handler, use the linker option ‑‑redirect and use this linker command:
‑‑redirect __float_placeholder=__unimpl_processing_handlerSupporting subnormal numbers for 32-bit floating-point numbers this way requires a large overhead, both in size and speed, compared to a normal FPU instruction which requires very few CPU cycles. The subnormal number exception handler will use approximately 900 bytes of code space, and about 50–200 cycles per exception, depending on the operation and the operands. For that reason, if execution speed is important, try to use floating-point algorithms that do not require subnormal number capabilities for 32-bit floating-point numbers.
To remove subnormal number handling for 32-bit floating-point numbers, use this linker command:
‑‑redirect __float_placeholder=__floating_point_handler