浮点标准 IEEE754 精要

1 概述

IEEE754 标准是 IEEE 对浮点数表示的规范，目的在于统一浮点数的编码，提高浮点运算程序的可移植性。

IEEE754有3种浮点数格式：单精度、双精度、扩展双精度。

每种格式皆由3部分组成：符号位(s)、指数(e)和尾数(m)。

single-precision :  | 31 | 30:23 | 22:0 |   (Ns=1, Ne=8, Nm=23)
double-precision:   | 63 | 62:52 | 51:0 |   (Ns=1, Ne=11, Nm=52)
double-extended:    | 79 | 78:64 | 63:0 |   (以x86之80位为例)

所表示值按指数域分为归一化值和未归一化值。

IEEE754-2008 标准引入了半精度浮点(Half-Precision Float) float16 类型:'

gcc 中在 arm/AArch64 (64-bit execution state of the ARMv8 ISA) 中支持这个类型 __fp16，ARM 编译时带参数 -mfp16-format=ieee 即可，AArch64 无需此参数。

arm 需包头文件 <arm_fp16.h>，编译是带参数 -mfpu=neon-fp16 -mfloat-abi=softfp

-mfp16-format=ieee, selects the IEEE 754-2008 format. Normalized values in the range of 2^{-14} to 65504. There are 11 bits of significand precision, approximately 3 decimal digits
-mfp16-format=alternative, selects the ARM alternative format. Normalized values in the range of 2^{-14} to 131008. Similar to the IEEE format, but does not support infinities or NaNs

2 归一化值

当 e != 0 && e != ~0 (全0与全1）所表示值为归一化值

 V = (-1)^s * 2^E * (M+1)

其中 E = e - Bias, Bias = 2^(Ne-1)-1

如单精度浮点数 Bias = 127, V = (-1)^s * 2^(e-127) * (M+1)

3 未归一化值

当 e == 0 || e == ~0 时，所表示值为未归一化值

1. e == 0

m == 0, s == 0  ---> +0.0
m == 0, s == 1  ---> -0.0
m != 0 则V = (-1)^s * 2^E * M，其中E = 1 - Bias, Bias = 2^(Ne-1)-1

如单精度浮点数的话，e==0, m!=0，则 E = 1-127 = -126

2. e == ~0

m == 0, s == 0  ---> +INFINITY
m == 0, s == 1  ---> -INFINITY
如果 m != 0    ----> NaN, Not a Number

例1 二进制单精度浮点数转十进制数

0x80480000 1000 0000 0100 1000 0000 0000 0000 0000

1 00000000 10010000000000000000000

s = 1

e = 0, E = 1 - 127 = -126

因e == 0，则：尾数部分M为（无须加1）：

0.10010000000000000000000=0.5625

该浮点数的十进制为：

(-1)^1 * 2^(-126) * 0.5625 = -6.612156e-39

可以使用如下 C 程序验证之：

#include <stdio.h>

union FI
{
    float f;
    struct
    {
        unsigned char b0;
        unsigned char b1;
        unsigned char b2;
        unsigned char b3;
    };
}u;


int main()
{
    u.b3 = 0x80;
    u.b2 = 0x48;
    u.b1 = 0x00;
    u.b0 = 0x00;

    printf ("x = %e ", u.f);
    return 0;
}

更简洁的：

#include <stdio.h>

int main()
{
    int x = 0x80480000;
    float y = *(float *)&x;

    printf ("x = %e ", y);
    return 0;
}

4 Reference

http://en.wikipedia.org/wiki/IEEE_754-2008