Skip to content

Commit 81c7dbc

Browse files
committed
design doc for float16
1 parent ee11f00 commit 81c7dbc

File tree

1 file changed

+46
-0
lines changed

1 file changed

+46
-0
lines changed

doc/design/float16.md

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
# Design Doc: float16
2+
3+
## Why float16
4+
Half precision (float16) is a binary floating-point format that occupies 16 bits / 2 bytes in memory. float16 is half the size of traditional 32-bit single precision format (float) and has lower precision and smaller range.
5+
6+
When high precision computation is not required, using float16 data type could potentially
7+
8+
- reduce storage space, memory bandwidth, and power usages;
9+
- increase the chance of data fitting into a smaller cache of lower latency;
10+
- provide arithmetic speed up if supported by hardware.
11+
12+
A brief survey of float16 support on different hardwares can be found [here](https://github.com/PaddlePaddle/Paddle/issues/4853). A brief survey of existing float16 implementations can be found [here](https://github.com/Xreki/Xreki.github.io/blob/master/multi_data_types_in_dl_framework/ppt/float16_and_quantized_type.md).
13+
14+
There are various natively supported float16 implementations on different hardwares/linear algebra libraries including half on cuda, __fp16/float16_t on ARM processor, and Eigen::half on Eigen.
15+
16+
The goal of float16 is to serve as a key for the executor to find and run the correct version of operator kernel compute method specialized for float16. It should be compatible with half on cuda, __fp16 on ARM, and Eigen::half on Eigen to make writing customized float16 kernels easier.
17+
18+
## Implementation
19+
The float16 class holds a 2-byte uint16_t data internally.
20+
```
21+
struct float16 {
22+
uint16_t x;
23+
};
24+
```
25+
26+
float16 supports the following features:
27+
- constructors / assignment operators that take input from primitive data types including bool, integers of various length, float, and double.
28+
- constructors / assignment operators that take input from half on cuda, __fp16 on ARM, and Eigen::half on Eigen.
29+
- conversion operators to primitive data types and half precision data types on cuda, ARM and Eigen.
30+
- overloaded arithmetic operators (e.g., +, -, *, /) for cuda, arm, and non-arm cpu, respectively. These operators will take advantage of the cuda and ARM intrinsics on the corresponding hardware.
31+
32+
To support the above features, two fundamental conversion functions are provided:
33+
```
34+
float16 float_to_half_rn(float f); // convert to half precision in round-to-nearest-even mode
35+
float half_to_float(float16 h);
36+
```
37+
which provides one-to-one conversion between float32 and float16. These twos functions will do different conversion routines based on the current hardware. CUDA/ARM instrinsics will be used when the corresonding hardware is available. When the hardware falls back to non-ARM cpu, software emulation will be performed to do the conversion.
38+
39+
## To do
40+
After float16 class is available, some of the future items are below:
41+
42+
- Update pybind/tensor_py.h to bind c++ float16 with numpy float16.
43+
44+
- Modify `IndicateDataType()` method in `framework/operator.h` to make it compatible with float16.
45+
46+
- Create a type-casting operator that can convert the data type in tensor between float16 and other types.

0 commit comments

Comments
 (0)