|
1 | 1 | --- |
2 | | -title: Floating Point Representations |
| 2 | +title: "Floating-Point Representation" |
3 | 3 | weight: 2 |
4 | 4 |
|
5 | 5 | ### FIXED, DO NOT MODIFY |
6 | 6 | layout: learningpathall |
7 | 7 | --- |
8 | 8 |
|
9 | | -## Review of floating point numbers |
| 9 | +## Review of floating-point numbers |
10 | 10 |
|
11 | | -If you are unfamiliar with floating point number representation, you can review [Learn about integer and floating-point conversions](/learning-paths/cross-platform/integer-vs-floats/introduction-integer-float-types/). It covers different data types and explains data type conversions. |
| 11 | +{{% notice Learning tip%}} |
| 12 | +If you are new to floating-point numbers, and would like some further information, see |
| 13 | +the Learning Path [Learn about integer and floating-point conversions](/learning-paths/cross-platform/integer-vs-floats/introduction-integer-float-types/). It covers data types and conversions. |
| 14 | +{{% /notice %}} |
12 | 15 |
|
13 | | -Floating-point numbers are a fundamental representation of real numbers in computer systems, enabling efficient storage and computation of decimal values with varying degrees of precision. In C/C++, floating point variables are created with keywords such as `float` or `double`. The IEEE 754 standard, established in 1985, is the most widely used format for floating-point arithmetic, ensuring consistency across different hardware and software implementations. |
| 16 | +Floating-point numbers represent real numbers using limited precision, enabling efficient storage and computation of decimal values. In C/C++, floating-point variables are created with keywords such as `float` or `double`. The IEEE 754 standard, established in 1985, defines the most widely used format for floating-point arithmetic, ensuring consistency across hardware and software. |
14 | 17 |
|
15 | | -IEEE 754 defines two primary formats: single-precision (32-bit) and double-precision (64-bit). |
| 18 | +IEEE 754 specifies two primary formats: single-precision (32-bit) and double-precision (64-bit). |
16 | 19 |
|
17 | 20 | Each floating-point number consists of three components: |
18 | | -- **sign bit**. (Determining positive or negative value) |
19 | | -- **exponent** (defining the scale or magnitude) |
20 | | -- **significand** (also called the mantissa, representing the significant digits of the number). |
21 | 21 |
|
22 | | -The standard uses a biased exponent to handle both large and small numbers efficiently, and it incorporates special values such as NaN (Not a Number), infinity, and subnormal numbers for robust numerical computation. A key feature of IEEE 754 is its support for rounding modes and exception handling, ensuring predictable behavior in mathematical operations. However, floating-point arithmetic is inherently imprecise due to limited precision, leading to small rounding errors. |
| 22 | +- **Sign bit**: Determines the sign (positive or negative). |
| 23 | +- **Exponent**: Sets the scale or magnitude. |
| 24 | +- **Significand**: Holds the significant digits in binary. |
23 | 25 |
|
24 | | -The graphic below illustrates various forms of floating point representation supported by Arm, each with varying number of bits assigned to the exponent and mantissa. |
| 26 | +The standard uses a biased exponent to handle both large and small numbers efficiently, and it incorporates special values such as NaN (Not a Number), infinity, and subnormal numbers. It supports rounding modes and exception handling, which help ensure predictable results. However, floating-point arithmetic is inherently imprecise, leading to small rounding errors. |
| 27 | + |
| 28 | +The graphic below shows various forms of floating-point representation supported by Arm, each with varying number of bits assigned to the exponent and significand. |
25 | 29 |
|
26 | 30 |  |
27 | 31 |
|
28 | 32 | ## Rounding errors |
29 | 33 |
|
30 | | -Since computers use a finite number of bits to store a continuous range of numbers, rounding errors are introduced. The unit in last place (ULP) is the smallest difference between two consecutive floating-point numbers. It measures floating-point rounding error, which arises because not all real numbers can be exactly represented. |
| 34 | +Because computers use a finite number of bits to store a continuous range of numbers, rounding errors are introduced. The unit in last place (ULP) is the smallest difference between two consecutive floating-point numbers. It quantifies the rounding error, which arises because not all real values can be exactly represented. |
| 35 | + |
| 36 | +Operations round results to the nearest representable value, introducing small discrepancies. This rounding error, often measured in ULPs, reflects how far the computed value may deviate from the exact mathematical result. |
31 | 37 |
|
32 | | -When an operation is performed, the result is rounded to the nearest representable value, introducing a small error. This error, often measured in ULPs, indicates how close the computed value is to the exact result. For a simple example, if a floating-point schema with 3 bits for the mantissa (precision) and an exponent in the range of -1 to 2 is used, the possible values are represented in the graph below. |
| 38 | +For example, with 3 bits for the significand and an exponent range of -1 to 2, only a limited set of values can be represented. The diagram below illustrates these values. |
33 | 39 |
|
34 | 40 |  |
35 | 41 |
|
36 | 42 | Key takeaways: |
37 | 43 |
|
38 | | -- ULP size varies with the number’s magnitude. |
39 | | -- Larger numbers have bigger ULPs due to wider spacing between values. |
40 | | -- Smaller numbers have smaller ULPs, reducing quantization error. |
41 | | -- ULP behavior impacts numerical stability and precision in computations. |
| 44 | +- ULP size increases with magnitude. |
| 45 | +- Larger numbers have wider spacing between values (larger ULPs). |
| 46 | +- Smaller numbers have tighter spacing (smaller ULPs), reducing quantization error. |
| 47 | +- ULP behavior impacts numerical stability and precision. |
| 48 | + |
| 49 | +{{% notice Learning tip %}} |
| 50 | +Keep in mind that rounding and representation issues aren't bugs — they’re a consequence of how floating-point math works at the hardware level. Understanding these fundamentals is essential when porting numerical code across architectures like x86 and Arm. |
| 51 | +{{% /notice %}} |
| 52 | + |
| 53 | + |
| 54 | +In the next section, you'll explore how x86 and Arm differ in how they implement and optimize floating-point operations — and why this matters for writing portable, accurate software. |
0 commit comments