浮点标准 IEEE754 精要

来自Jack's Lab
跳转到: 导航, 搜索

目录

1 概述

IEEE754 标准是 IEEE 对浮点数表示的规范,目的在于统一浮点数的编码,提高浮点运算程序的可移植性。

IEEE754有3种浮点数格式:单精度、双精度、扩展双精度。

每种格式皆由3部分组成: 符号位(s)、指数(e)和尾数(m)。

single-precision :  | 31 | 30:23 | 22:0 |   (Ns=1, Ne=8, Nm=23)
double-precision:   | 63 | 62:52 | 51:0 |   (Ns=1, Ne=11, Nm=52)
double-extended:    | 79 | 78:64 | 63:0 |   (以x86之80位为例)


所表示值按指数域分为归一化值和未归一化值。



2 半精度浮点

IEEE754-2008 标准引入了 半精度浮点(Half-Precision Float) float16 类型:'


IEEE 754 标准指定了一个binary16要有如下的格式:

Float16.png


  • Sign bit(符号位): 1 bit
  • Exponent width(指数): 5 bits
  • Fraction(尾数): 11 bits (10位显式存储,隐含 1 位)


The format is assumed to have an implicit lead bit with value 1 unless the exponent field is stored with all zeros. Thus only 10 bits of the significand appear in the memory format but the total precision is 11 bits. In IEEE 754 parlance, there are 10 bits of significand, but there are 11 bits of significand precision (log10(211) ≈ 3.311 decimal digits, or 4 digits ± slightly less than 5 units in the last place).


exponent 为指数位,5 位长:

The half-precision binary floating-point exponent is encoded using an offset-binary representation, with the zero offset being 15; also known as exponent bias in the IEEE 754 standard.

  • Emin = 000012 − 011112 = −14
  • Emax = 111102 − 011112 = 15
  • Exponent bias = 011112 = 15

Thus, as defined by the offset binary representation, in order to get the true exponent the offset of 15 has to be subtracted from the stored exponent.

The stored exponents 000002 and 111112 are interpreted specially.

Exponent Significand = zero Significand ≠ zero Equation
000002 zero, −0 subnormal numbers (−1)signbit × 2−14 × 0.significantbits2
000012, ..., 111102 normalized value (−1)signbit × 2exponent−15 × 1.significantbits2
111112 ±infinity NaN (quiet, signalling)


2.1 例子

0 00000 0000000001 = 0x0001 = 2^-14 x (0 + 1/1024)  ≈ 0.000000059605 (smallest positive subnormal number)
0 00000 1111111111 = 0x03ff = 2^-14 x (0 + 1023/1024) ≈ 0.000060976 (largest subnormal number)
0 00001 0000000000 = 0x0400 = 2^-14 x (1 + 0/1024) ≈ 0.000061035  (smallest positive normal number)
0 11110 1111111111 = 0x7bff = 2^15 x (1 + 1023/1024) = 65504 (largest normal number)
0 01110 1111111111 = 0x3bff = 2^-1 x (1 + 1023/1024) ≈ 0.99951 (largest number less than one)
0 01111 0000000000 = 0x3c00 = 2^0 x (1 + 0/1024) = 1 

0 01111 0000000001 = 0x3c01 = 2^0 x (1 + 1/1024) ≈ 1.001

0 01101 0101010101 = 0x3555 = 2^-2 x (1 + 341/1024) = 0.333251953125 ≈ 1/3

1 10000 0000000000 = 0xc000 = −2

0 00000 0000000000 = 0x0000 = 0
1 00000 0000000000 = 0x8000 = −0

0 11111 0000000000 = 0x7c00 = infinity
1 11111 0000000000 = 0xfc00 = -infinity


2.2 精度限制

Precision limitations on decimal values in [0, 1]:

  • Decimals between 2^−24 (minimum positive subnormal) and 2^−14 (maximum subnormal): fixed interval 2^−24
  • Decimals between 2^−14 (minimum positive normal) and 2^−13: fixed interval 2^−24
  • Decimals between 2^−13 and 2^−12: fixed interval 2^−23
  • Decimals between 2^−12 and 2^−11: fixed interval 2^−22
  • Decimals between 2^−11 and 2^−10: fixed interval 2^−21
  • Decimals between 2^−10 and 2^−9: fixed interval 2^−20
  • Decimals between 2^−9 and 2^−8: fixed interval 2^−19
  • Decimals between 2^−8 and 2^−7: fixed interval 2^−18
  • Decimals between 2^−7 and 2^−6: fixed interval 2^−17
  • Decimals between 2^−6 and 2^−5: fixed interval 2^−16
  • Decimals between 2^−5 and 2^−4: fixed interval 2^−15
  • Decimals between 2^−4 and 2^−3: fixed interval 2^−14
  • Decimals between 2^−3 and 2^−2: fixed interval 2^−13
  • Decimals between 2^−2 and 2^−1: fixed interval 2^−12
  • Decimals between 2^−1 and 2^−0: fixed interval 2^−11


Precision limitations on decimal values in [1, 2048]:

  • Decimals between 1 and 2: fixed interval 2^−10 (1+2^−10 is the next largest float after 1) ≈ 0.001
  • Decimals between 2 and 4: fixed interval 2^−9 ≈ 0.002
  • Decimals between 4 and 8: fixed interval 2^−8 ≈ 0.004
  • Decimals between 8 and 16: fixed interval 2^−7 ≈ 0.008
  • Decimals between 16 and 32: fixed interval 2^−6 ≈ 0.016
  • Decimals between 32 and 64: fixed interval 2^−5 ≈ 0.031
  • Decimals between 64 and 128: fixed interval 2^−4 = 0.0625
  • Decimals between 128 and 256: fixed interval 2^−3 = 0.125
  • Decimals between 256 and 512: fixed interval 2^−2 = 0.25
  • Decimals between 512 and 1024: fixed interval 2^−1 = 0.5
  • Decimals between 1024 and 2048: fixed interval 2^0 = 1


Precision limitations on integer values

  • Integers between 0 and 2048 can be exactly represented (and also between −2048 and 0)
  • Integers between 2048 and 4096 round to a multiple of 2 (even number)
  • Integers between 4096 and 8192 round to a multiple of 4
  • Integers between 8192 and 16384 round to a multiple of 8
  • Integers between 16384 and 32768 round to a multiple of 16
  • Integers between 32768 and 65519 round to a multiple of 32
  • Integers above 65519 are rounded to "infinity" if using round-to-even, or above 65535 if using round-to-zero, or above 65504 if using round-to-infinity.


2.3 编译器支持

gcc 中在 arm/AArch64 (64-bit execution state of the ARMv8 ISA) 中支持这个类型 __fp16,ARM 编译时带参数 -mfp16-format=ieee 即可,AArch64 无需此参数。

arm 需包头文件 <arm_fp16.h>,编译是带参数 -mfpu=neon-fp16 -mfloat-abi=softfp

  • -mfp16-format=ieee, selects the IEEE 754-2008 format. Normalized values in the range of 2^{-14} to 65504. There are 11 bits of significand precision, approximately 3 decimal digits
  • -mfp16-format=alternative, selects the ARM alternative format. Normalized values in the range of 2^{-14} to 131008. Similar to the IEEE format, but does not support infinities or NaNs


2.4 类型互转通用函数

与 float / double 类型互相转换的跨平台通用函数:

import the file "ieeehalfprecision.c" into your project and use it like this :

#include  "ieeehalfprecision.c"

void test_fp16()
{
	float myFloat = 1.245;

	uint16_t myfp16 = 0;

	float2halfp(&myfp16, &myFloat, 1);		// convert 1 float to fp16

	Serial.println("Testing the float16 function... ");

	Serial.print("Convert fp32=1.245 to fp16, HEX: ");
	Serial.println(myfp16, HEX);
	 
	float myfp32 = 0;
	halfp2float(&myfp32, &myfp16, 1);		// recover from 1 fp16 to float

	Serial.print("fp16 to float: ");
	Serial.println((double)myfp32, 3);
}

void setup()
{
	Serial.setRouteLoc(1);
	Serial.begin(115200);
} 

void loop()
{
	__fp16 f1 = 0.232;
	
	Serial.println("Testing the float16, supported by gcc... ");

	Serial.print("sizeof(fp16): ");
	Serial.println(sizeof(f1));

	Serial.print("f1 = ");
	Serial.println((double)f1, 4);

	// test the function
	test_fp16();

	delay(5000);
}

输出:

---------------------------------------------------
Testing the float16, supported by gcc... 
sizeof(fp16): 2
f1 = 0.2321
---------------------------------------------------
Testing the general function of float16... 
Convert fp32=13.245 to fp16, HEX: 4A9F
fp16 to float: 13.242
---------------------------------------------------
----- Test the [0, 1] -----
0.0000000000 = 0x0
0x2 = 0.0000001192
0x3 = 0.0000001788
0x4 = 0.0000002384
0x80 = 0.0000076294
0x800 = 0.0001220703
0x2000 = 0.0078125000
0x2001 = 0.0078201294
0x2002 = 0.0078277588
0x3000 = 0.1250000000
0x3BFD = 0.9985351563
0x3BFE = 0.9990234375
0x3BFF = 0.9995117188
1.0000000000 = 0x3C00
----- Test the [-1, 0] -----
0x8002 = -0.0000001192
0x8003 = -0.0000001788
0x8004 = -0.0000002384
0x8080 = -0.0000076294
0x8800 = -0.0001220703
0xA000 = -0.0078125000
0xA001 = -0.0078201294
0xA002 = -0.0078277588
0xB000 = -0.1250000000
0xBBFD = -0.9985351563
0xBBFE = -0.9990234375
0xBBFF = -0.9995117188
----- Test the [1, 2048] -----
0x3C01 = 1.0009765625
0x3C02 = 1.0019531250
0x3CFD = 1.2470703125
0x3CFE = 1.2480468750
0x3CFF = 1.2490234375
3.6760001183 = 0x435A
3.6770000458 = 0x435B
3.6779999733 = 0x435B
3.5439999104 = 0x4317
3.5450000763 = 0x4317
3.5460000038 = 0x4318
0x3D00 = 1.2500000000
0x3E00 = 1.5000000000
0x4000 = 2.0000000000
0x4001 = 2.0019531250
0x4002 = 2.0039062500
0x5000 = 32.0000000000
0x5001 = 32.0312500000
0x5002 = 32.0625000000
32.1234016418 = 0x5004
32.2234001160 = 0x5007
32.2253990173 = 0x5007
0x6000 = 512.0000000000
0x6001 = 512.5000000000
0x6002 = 513.0000000000
0x67FE = 2046.0000000000
0x67FF = 2047.0000000000
2048.0000000000 = 0x6800
----- Test the [2048, NaN] -----
4096.0000000000 = 0x6C00
8192.0000000000 = 0x7000
16384.0000000000 = 0x7400
32768.0000000000 = 0x7800
65504.0000000000 = 0x7BFF
0x7BFD = 65440.0000000000
0x7BFE = 65472.0000000000
0x7BFF = 65504.0000000000
0x7C00 = inf
0x7C01 = nan
0x7D00 = nan
0x7E00 = nan
0x7F00 = nan
0x8000 = 0.0000000000

更多参考:https://www.mathworks.com/matlabcentral/fileexchange/23173-ieee-754r-half-precision-floating-point-converter


3 归一化值

当 e != 0 && e != ~0 (全0与全1)所表示值为归一化值

 V = (-1)^s * 2^E * (M+1)


其中 E = e - Bias, Bias = 2^(Ne-1)-1

如单精度浮点数 Bias = 127, V = (-1)^s * 2^(e-127) * (M+1)



4 未归一化值

当 e == 0 || e == ~0 时,所表示值为未归一化值


1. e == 0
m == 0, s == 0  ---> +0.0
m == 0, s == 1  ---> -0.0
m != 0 则V = (-1)^s * 2^E * M,其中E = 1 - Bias, Bias = 2^(Ne-1)-1

如单精度浮点数的话,e==0, m!=0, 则 E = 1-127 = -126


2. e == ~0
m == 0, s == 0  ---> +INFINITY
m == 0, s == 1  ---> -INFINITY
如果 m != 0    ----> NaN, Not a Number


例1 二进制单精度浮点数转十进制数

0x80480000 1000 0000 0100 1000 0000 0000 0000 0000

1 00000000 10010000000000000000000


s = 1

e = 0, E = 1 - 127 = -126

因e == 0,则:尾数部分M为(无须加1):

0.10010000000000000000000=0.5625

该浮点数的十进制为:

(-1)^1 * 2^(-126) * 0.5625 = -6.612156e-39


可以使用如下 C 程序验证之:

#include <stdio.h>

union FI
{
    float f;
    struct
    {
        unsigned char b0;
        unsigned char b1;
        unsigned char b2;
        unsigned char b3;
    };
}u;


int main()
{
    u.b3 = 0x80;
    u.b2 = 0x48;
    u.b1 = 0x00;
    u.b0 = 0x00;

    printf ("x = %e ", u.f);
    return 0;
}


更简洁的:

#include <stdio.h>

int main()
{
    int x = 0x80480000;
    float y = *(float *)&x;

    printf ("x = %e ", y);
    return 0;
}



5 Reference















个人工具
名字空间

变换
操作
导航
工具箱