Skip to content

Commit 4475600

Browse files
Ravenwaterclaude
andauthored
V3.99: supporting cross-platform BlockType == uint64_t (#511)
* Incrementing SEMVER to v3.99 * Fix blockbinary operator[] vs test() misuse in posit components blockbinary::operator[] is a block/limb index accessor, not a bit index accessor. Four locations used it with bit indices, causing stack-buffer-overflow for posit configurations where fbits > nrBlocks (e.g., posit<16,1,uint8_t> with fbits=12 and only 2 blocks). Fixed positFraction::operator<<, get_fixed_point(), denormalize(), and posit reciprocal sign extraction to use _block.test(bitIndex). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add comprehensive documentation for all number systems Why/what/how markdown guides for each of the 29 number systems with regression tests, covering integer, fixed-point, rational, configurable floats, posit family, logarithmic, multi-component extended precision, block-scaled AI formats, interval arithmetic, and compressed floating-point. Includes README.md index with category tables and selection guide. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * adding details about uint32_t for limb carry arithmetic * adding uint64_t limb support * fix headers in SORN tests * Fix nibble() UB in all block types for uint64_t limbs 0x0Fu is a 32-bit unsigned int, so shifting it by nibbleIndexInWord*4 when nibbleIndexInWord >= 8 (i.e., shift >= 32) is undefined behavior. On MSVC this caused corrupt to_hex() output and cascading test failures in bb_uint64_limbs. Fix by casting to bt before shifting, ensuring the shift operates on the block type width. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Fix MSVC intrinsic output via reference-derived pointers in carry.hpp _umul128, _addcarry_u64, and _subborrow_u64 write results through pointer parameters. When these pointers were derived from reference parameters via reinterpret_cast, the MSVC optimizer could lose the writes after inlining, causing mul128 to always return hi=0. This produced systematically wrong multiplication results (block[1] only contained the addcarry carry-out bit, not the mul128 high product). Fix: use local variables for all intrinsic output pointers, then assign to the reference parameters after the intrinsic returns. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Fix blockbinary mul with uint64_t limbs passing multi-bit carry to addcarry The multiplication loop accumulated carries as full 64-bit values (hi + c1) then passed them as carry_in to addcarry(). On MSVC, _addcarry_u64 truncates carry_in to unsigned char, silently losing the upper bits. Split into two separate addcarry calls each with carry_in=0 so the multi-bit carry is added as a regular operand. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Fix MSVC build failures for long double ambiguity and M_PI undeclared Posit: add long double constructor/operator= in the #else branch of LONG_DOUBLE_SUPPORT so MSVC (where long double is a distinct type but same precision as double) no longer hits ambiguous overload resolution. Directives: define _USE_MATH_DEFINES for MSVC so M_PI is available from <cmath> without per-file defines. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Fix MSVC posit long double constructor ambiguity and zfpblock shift UB Remove redundant long double constructor/assignment in posit #else branch that caused ambiguous overload on MSVC where long double == double. Replace ternary with if-constexpr helper in zfp_codec encode/decode to avoid C4293 warning when N == 64. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Restore long double overloads for MSVC where long double != double for overload resolution MSVC treats long double and double as distinct types for overload resolution despite identical representation. Without explicit long double overloads, assignment from long double is ambiguous among float/double/integer candidates. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 2e6e678 commit 4475600

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

52 files changed

+4593
-162
lines changed

.github/workflows/cmake.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@ name: CMake
22

33
on:
44
push:
5-
branches: [ v3.98, main ]
5+
branches: [ v3.99, main ]
66
pull_request:
77
branches: [ main ]
88

CHANGELOG.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,13 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
2828
- **MinGW GCC IPA ICF bug**: function splitting + Identical Code Folding incorrectly merges `lns<4>::setbit.part.0` with `lns<8>::setbit.part.0`, causing all negative LNS values to lose their sign bit when multiple `lns<nbits>` instantiations exist in the same translation unit. Fix: `-fno-ipa-icf`
2929
- **MinGW software `std::fma()` precision bug**: off by 1-2 ULPs for some inputs, breaking error-free transformations (`two_prod`) in `floatcascade`. Fix: `-mfma` to use hardware FMA3 instructions
3030

31+
### Fixed
32+
33+
#### 2026-02-13 - Fix blockbinary operator[] vs test() misuse in posit components
34+
35+
- **`positFraction.hpp` stack-buffer-overflow** (ASan CI failure): `blockbinary::operator[]` is a **block/limb** index accessor, but was used with **bit** indices in three locations — `operator<<`, `get_fixed_point()`, and `denormalize()`. For `posit<16,1,uint8_t>` with `fbits=12`, accessing `_block[11]` tried to read block 11 of a 2-block array. Fixed all three to use `_block.test(i)` for proper bit-level access.
36+
- **`posit_impl.hpp` reciprocal sign extraction**: `_block[nbits-1]` used block index instead of bit index to read the sign bit. For `posit<16,1,uint8_t>`, `_block[15]` accessed block 15 of a 2-block array. Fixed to `_block.test(nbits-1)`.
37+
3138
- **All 390 CI_LITE tests pass** on MinGW+Wine after fixes
3239

3340
#### 2026-02-13 - Rewrite Atomic Fused Operators to blocktriple and Extract Quire from posit.hpp

CMakeLists.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ if(NOT DEFINED UNIVERSAL_VERSION_MAJOR)
2020
set(UNIVERSAL_VERSION_MAJOR 3)
2121
endif()
2222
if(NOT DEFINED UNIVERSAL_VERSION_MINOR)
23-
set(UNIVERSAL_VERSION_MINOR 98)
23+
set(UNIVERSAL_VERSION_MINOR 99)
2424
endif()
2525
if(NOT DEFINED UNIVERSAL_VERSION_PATCH)
2626
set(UNIVERSAL_VERSION_PATCH 1)

docs/number-systems/README.md

Lines changed: 133 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,133 @@
1+
# Universal Number Systems Guide
2+
3+
This directory contains comprehensive documentation for each number system in the Universal library. Each document explains **why** the number system exists, **what** it does, and **how** to use it to solve specific problems.
4+
5+
## Number Systems by Category
6+
7+
### Integer and Fixed-Point
8+
9+
| Type | Bits | Description | Best For |
10+
|------|------|-------------|----------|
11+
| [integer](integer.md) | N | Arbitrary-width signed integer | Cryptography, combinatorics, wide counters |
12+
| [fixpnt](fixpnt.md) | N | Binary fixed-point with configurable radix | DSP, control systems, embedded (no FPU) |
13+
| [rational](rational.md) | 2N | Exact numerator/denominator fraction | Symbolic math, exact geometry, financial |
14+
15+
### Configurable Floating-Point
16+
17+
| Type | Bits | Description | Best For |
18+
|------|------|-------------|----------|
19+
| [cfloat](cfloat.md) | 4-256 | Fully parameterized IEEE-compatible float | Mixed-precision research, custom HW design |
20+
| [bfloat16](bfloat16.md) | 16 | Google Brain Float (8-bit exponent, 7-bit fraction) | Neural network training, TPU workloads |
21+
| [areal](areal.md) | N | Faithful float with uncertainty bit | Verified computing, uncertainty tracking |
22+
| [dfloat](dfloat.md) | N | Decimal floating-point (base-10) | Financial systems, regulatory compliance |
23+
24+
### Micro-Precision and Block-Scaled (AI Quantization)
25+
26+
| Type | Bits | Description | Best For |
27+
|------|------|-------------|----------|
28+
| [microfloat](microfloat.md) | 4-8 | OCP MX element types (e2m1, e4m3, e5m2) | AI model elements, quantization validation |
29+
| [e8m0](e8m0.md) | 8 | Exponent-only power-of-two scale | Block scale factor for MX format |
30+
| [mxfloat](mxfloat.md) | Block | OCP Microscaling block format | AI inference, model compression (OCP) |
31+
| [nvblock](nvblock.md) | Block | NVIDIA NVFP4 block format | GPU inference, NVIDIA accelerators |
32+
33+
### Posit Family (UNUM Type III)
34+
35+
| Type | Bits | Description | Best For |
36+
|------|------|-------------|----------|
37+
| [posit](posit.md) | N | Tapered-precision floating-point (current v2) | General numeric, more precision than IEEE |
38+
| [posit1](posit1.md) | N | Original posit implementation (legacy v1) | Quire/FDP support, backward compatibility |
39+
| [posito](posito.md) | N | Experimental posit variant | Differential testing, research |
40+
| [quire](quire.md) | Wide | Super-accumulator for exact dot products | Reproducible linear algebra, BLAS |
41+
| [takum](takum.md) | N | Bounded-range tapered float | General computing, predictable range |
42+
43+
### Interval and Uncertainty Arithmetic
44+
45+
| Type | Bits | Description | Best For |
46+
|------|------|-------------|----------|
47+
| [valid](valid.md) | 2N | Interval arithmetic with posit-encoded bounds | Verified computing with posit precision |
48+
| [interval](interval.md) | 2N | Generic interval over any scalar type | Tolerance analysis, uncertainty propagation |
49+
| [sorn](sorn.md) | N | Set of operand range numbers | Rigorous uncertainty, safety-critical bounds |
50+
| [unum2](unum2.md) | N | Configurable exact-value lattice | Research, custom value distributions |
51+
52+
### Logarithmic Number Systems
53+
54+
| Type | Bits | Description | Best For |
55+
|------|------|-------------|----------|
56+
| [lns](lns.md) | N | Single-base logarithmic (base 2) | DSP, multiply-heavy workloads, low-power HW |
57+
| [dbns](dbns.md) | N | Double-base logarithmic (base 0.5 and 3) | Research, mixed-radix applications |
58+
59+
### Extended Precision (Multi-Component)
60+
61+
| Type | Bits | Decimal Digits | Description | Best For |
62+
|------|------|---------------|-------------|----------|
63+
| [dd](dd.md) | 128 | ~31 | Double-double (2 doubles) | Extended precision, ill-conditioned systems |
64+
| [qd](qd.md) | 256 | ~64 | Quad-double (4 doubles) | Ultra-high precision, constant computation |
65+
| [dd_cascade](dd_cascade.md) | 128 | ~31 | DD via unified cascade framework | Consistent API across precision tiers |
66+
| [td_cascade](td_cascade.md) | 192 | ~48 | Triple-double (3 doubles) | Intermediate precision tier |
67+
| [qd_cascade](qd_cascade.md) | 256 | ~64 | QD via unified cascade framework | Consistent API across precision tiers |
68+
69+
### Compressed Floating-Point
70+
71+
| Type | Description | Best For |
72+
|------|-------------|----------|
73+
| [zfpblock](zfpblock.md) | ZFP block-based float compression (1D/2D/3D) | Scientific data storage, simulation checkpoints |
74+
75+
### Complex Number Support
76+
77+
| Type | Description | Best For |
78+
|------|-------------|----------|
79+
| [complex](complex.md) | Complex arithmetic for any Universal scalar | FFT, signal processing, quantum computing |
80+
81+
## Choosing a Number System
82+
83+
### By Application Domain
84+
85+
| Domain | Recommended Types |
86+
|--------|-------------------|
87+
| **Deep Learning Inference** | microfloat, mxfloat, nvblock, bfloat16, cfloat(fp8) |
88+
| **Deep Learning Training** | bfloat16, cfloat(fp16/fp32), posit |
89+
| **DSP / Signal Processing** | fixpnt, lns, complex |
90+
| **Financial / Accounting** | dfloat, rational, fixpnt |
91+
| **Embedded (no FPU)** | fixpnt, integer |
92+
| **Scientific HPC** | dd, qd, posit, cfloat |
93+
| **Verified / Validated Computing** | interval, valid, areal, sorn |
94+
| **Reproducible Linear Algebra** | posit + quire |
95+
| **Cryptography / Big Numbers** | integer |
96+
| **Data Compression** | zfpblock |
97+
| **Custom Hardware Design** | cfloat, posit, takum, lns |
98+
99+
### By Precision Need
100+
101+
| Precision | Type | Decimal Digits |
102+
|-----------|------|---------------|
103+
| 2 digits | bfloat16 | ~2 |
104+
| 3 digits | cfloat(fp8), microfloat | ~2-3 |
105+
| 7 digits | cfloat(fp32), posit<32,2> | ~7-8 |
106+
| 16 digits | cfloat(fp64), double | ~16 |
107+
| 31 digits | dd, dd_cascade | ~31 |
108+
| 48 digits | td_cascade | ~48 |
109+
| 64 digits | qd, qd_cascade | ~64 |
110+
| Exact | rational, integer, quire | Unlimited (within nbits) |
111+
112+
## Quick Start
113+
114+
Every number system is header-only. Include the type and start computing:
115+
116+
```cpp
117+
#include <universal/number/posit/posit.hpp> // or any type
118+
using namespace sw::universal;
119+
120+
// Plug-in replacement pattern
121+
template<typename Real>
122+
Real my_algorithm(Real a, Real b) {
123+
return (a + b) * (a - b);
124+
}
125+
126+
// Use with any Universal type
127+
auto r1 = my_algorithm(posit<32,2>(3.0), posit<32,2>(4.0));
128+
auto r2 = my_algorithm(cfloat<16,5,uint16_t,true,false,false>(3.0),
129+
cfloat<16,5,uint16_t,true,false,false>(4.0));
130+
auto r3 = my_algorithm(dd(3.0), dd(4.0));
131+
```
132+
133+
For detailed usage patterns, see the `api/api.cpp` test file in each number system's regression test directory under `static/`.

docs/number-systems/areal.md

Lines changed: 112 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,112 @@
1+
# Areal: Faithful Floating-Point with Uncertainty Bit
2+
3+
## Why
4+
5+
IEEE-754 floating-point silently rounds every result to the nearest representable value. After a chain of operations, you have no idea how much rounding error has accumulated -- the final answer looks just as precise as every intermediate result, even if it's completely wrong. The only way to discover the error is to re-run the computation in higher precision, which is expensive and often impractical.
6+
7+
The `areal` type solves this with a single-bit innovation: the **uncertainty bit (ubit)**. The least significant bit of every areal value indicates whether the value is exact or approximate. When an operation produces a result that falls between two representable values, the ubit is set to 1, meaning "the true value lies between this encoding and the next." You get faithful floating-point arithmetic where every result honestly reports whether it was rounded.
8+
9+
## What
10+
11+
`areal<nbits, es, bt>` is a faithful floating-point with an uncertainty bit:
12+
13+
| Parameter | Type | Default | Description |
14+
|-----------|------|---------|-------------|
15+
| `nbits` | `unsigned` | -- | Total bits (minimum: es + 3) |
16+
| `es` | `unsigned` | -- | Exponent bits |
17+
| `bt` | typename | `uint8_t` | Storage block type |
18+
19+
### Encoding
20+
21+
```
22+
[sign : 1 bit] [exponent : es bits] [fraction : fbits] [ubit : 1 bit]
23+
```
24+
25+
Where `fbits = nbits - 2 - es` (the uncertainty bit takes one bit from what would be fraction in a standard float).
26+
27+
### The Uncertainty Bit
28+
29+
- **ubit = 0**: The value is *exactly* the represented floating-point value
30+
- **ubit = 1**: The true value lies *strictly between* this encoding and the next representable value
31+
32+
This provides a faithful bound: the true result is always within one ULP of the stored value, and you *know* when it's not exact.
33+
34+
### Key Properties
35+
36+
- **Faithful rounding**: every result is within 1 ULP of the true value, with ubit indicating exactness
37+
- **Gradual underflow**: subnormal numbers for smooth transition to zero
38+
- **Gradual overflow**: values beyond maxpos mapped with ubit=1
39+
- **No rounding modes**: the ubit replaces the complexity of IEEE rounding modes
40+
- **Configurable precision**: any combination of nbits and es
41+
42+
## How It Works
43+
44+
When an arithmetic operation produces a result that is exactly representable, the result is stored with ubit=0. When the result falls between two consecutive representable values, the lower value is stored with ubit=1, indicating "the true value is between here and the next encoding." This is simpler than IEEE rounding modes and provides strictly more information: you always know whether the result was exact.
45+
46+
The overflow behavior is also graceful: instead of jumping to infinity, an areal beyond maxpos is stored as maxpos with ubit=1, meaning "the true value is somewhere above maxpos." Similarly, underflow toward zero sets the ubit to indicate imprecision near the bottom of the range.
47+
48+
## How to Use It
49+
50+
### Include
51+
52+
```cpp
53+
#include <universal/number/areal/areal.hpp>
54+
using namespace sw::universal;
55+
```
56+
57+
### Basic Usage
58+
59+
```cpp
60+
areal<8, 2> a(1.0f); // Exact: ubit = 0
61+
areal<8, 2> b(0.1f); // Not exactly representable: ubit = 1
62+
63+
auto c = a + b;
64+
std::cout << to_binary(c) << " = " << c << std::endl;
65+
// The ubit tells you whether this result is exact
66+
```
67+
68+
### Verified Computation
69+
70+
```cpp
71+
template<typename Real>
72+
bool is_result_exact(Real a, Real b) {
73+
Real result = a * b;
74+
// Check the uncertainty bit to verify exactness
75+
return !result.test(0); // ubit is bit 0
76+
}
77+
78+
areal<16, 5> x(2.0f), y(3.0f);
79+
// 2.0 * 3.0 = 6.0, which is exactly representable
80+
assert(is_result_exact(x, y));
81+
82+
areal<16, 5> p(1.0f), q(3.0f);
83+
// 1.0 / 3.0 is not exactly representable
84+
// The result will have ubit = 1
85+
```
86+
87+
### Tracking Precision Loss
88+
89+
```cpp
90+
// Count how many operations in a chain produce inexact results
91+
template<typename Real>
92+
size_t count_roundings(const std::vector<Real>& values) {
93+
size_t inexact_count = 0;
94+
Real sum(0);
95+
for (const auto& v : values) {
96+
sum += v;
97+
if (sum.test(0)) ++inexact_count; // ubit set means rounding occurred
98+
}
99+
return inexact_count;
100+
}
101+
```
102+
103+
## Problems It Solves
104+
105+
| Problem | How areal Solves It |
106+
|---------|-----------------------|
107+
| No way to know if a floating-point result was rounded | Uncertainty bit explicitly marks inexact results |
108+
| IEEE rounding modes are complex and rarely used correctly | Single ubit replaces all rounding mode logic |
109+
| Overflow jumps to infinity, destroying information | Gradual overflow with ubit=1 preserves "above maxpos" |
110+
| Underflow flushes to zero prematurely | Gradual underflow with subnormals + ubit |
111+
| Validated numerics requires expensive interval arithmetic | Single extra bit provides faithful bounds |
112+
| Reproducibility debates about rounding mode choices | Ubit is deterministic: no mode selection needed |

0 commit comments

Comments
 (0)