Skip to main content

IAR Embedded Workbench for Arm 9.70.x

Basic data types—floating-point types

In this section:

In the IAR C/C++ Compiler for Arm, floating-point values are represented in standard IEC 60559 format. The sizes for the different floating-point types are:

Type

Size

Range (+/-)

Decimals

Exponent

Mantissa

Alignment

__fp16

16 bits

±2E-14 to 65504

3

5 bits

11 bits

2

float

32 bits

±1.18E-38 to ±3.40E+38

7

8 bits

23 bits

4

double

64 bits

±2.23E-308 to ±1.79E+308

15

11 bits

52 bits

8

long double

64 bits

±2.23E-308 to ±1.79E+308

15

11 bits

52 bits

8

Table 98. Floating-point types 


For Cortex-M0 and Cortex-M1, the compiler does not support subnormal numbers. All operations that should produce subnormal numbers will instead generate zero. For information about the representation of subnormal numbers for other cores, see Representation of special floating-point numbers.

The __fp16 floating-point type is only a storage type. All numerical operations will operate on values promoted to float. There is also a standard type _Float16, which is layout-compatible with __fp16. Some cores support numerical operations directly on _Float16 values. For other cores, it is a storage-only type.

Note

The C/C++ standard library does not support the _Float16 type. If you want to use any of the standard library functions on the _Float16 type, you must cast the _Float16 value to single-precision or double-precision, and then use the appropriate library function. Because there is no string format specifier for the _Float16 type, an explicit cast is required.

Floating-point environment

Exception flags for floating-point values are supported for devices with a VFP unit, and they are defined in the fenv.h file. For devices without a VFP unit, the functions defined in the fenv.h file exist but have no functionality.

The feraiseexcept function does not raise an inexact floating-point exception when called with FE_OVERFLOW or FE_UNDERFLOW.

32-bit floating-point format

The representation of a 32-bit floating-point number as an integer is:

32bitFloatFormat_1.png

The exponent is 8 bits, and the mantissa is 23 bits.

The value of the number is:

(-1)S * 2(Exponent-127) * 1.Mantissa

The range of the number is at least:

±1.18E-38 to ±3.39E+38

The precision of the float operators (+, -, *, and /) is approximately 7 decimal digits.

64-bit floating-point format

The representation of a 64-bit floating-point number as an integer is:

64bitFloatFormat_1.png

The exponent is 11 bits, and the mantissa is 52 bits.

The value of the number is:

(-1)S * 2(Exponent-1023) * 1.Mantissa

The range of the number is at least:

±2.23E-308 to ±1.79E+308

The precision of the float operators (+, -, *, and /) is approximately 15 decimal digits.

Representation of special floating-point numbers

This list describes the representation of special floating-point numbers:

  • Zero is represented by zero mantissa and exponent. The sign bit signifies positive or negative zero.

  • Infinity is represented by setting the exponent to the highest value and the mantissa to zero. The sign bit signifies positive or negative infinity.

  • Not a number (NaN) is represented by setting the exponent to the highest positive value and the most significant bit in the mantissa to 1. The value of the sign bit is ignored.

  • Subnormal numbers are used for representing values smaller than what can be represented by normal values. The drawback is that the precision will decrease with smaller values. The exponent is set to 0 to signify that the number is subnormal, even though the number is treated as if the exponent was 1. Unlike normal numbers, subnormal numbers do not have an implicit 1 as the most significant bit (the MSB) of the mantissa. The value of a subnormal number is:

    (-1)S * 2(1-BIAS) * 0.Mantissa

    where BIAS is 127 and 1023 for 32-bit and 64-bit floating-point values, respectively.