|
| 1 | +# Implementation Plan: dfloat (Decimal FP) and hfloat (Hexadecimal FP) |
| 2 | + |
| 3 | +## Context |
| 4 | + |
| 5 | +Universal is an educational C++ template library for custom arithmetic types. It currently has binary floating-point (`cfloat`) but lacks decimal and hexadecimal floating-point systems. IBM mainframes historically provided all three in hardware: binary (IEEE 754), hexadecimal (System/360, 1964 -- hardware benefits from reduced alignment shifts), and decimal (financial industry, exact base-10 representation). Adding `dfloat` (IEEE 754-2008 decimal) and `hfloat` (IBM HFP) completes the floating-point radix family and serves as an educational resource for comparing encoding tradeoffs. |
| 6 | + |
| 7 | +A `dfloat` skeleton already exists but is entirely stubbed. `hfloat` is greenfield. |
| 8 | + |
| 9 | +## Design Decisions (User-Confirmed) |
| 10 | + |
| 11 | +- **dfloat encoding**: Template parameter `DecimalEncoding` enum `{BID, DPD}` -- both encodings in one type for educational comparison |
| 12 | +- **dfloat template**: `dfloat<ndigits, es, Encoding, bt>` (keep existing param names, add Encoding) |
| 13 | +- **hfloat behavior**: Classic IBM System/360 -- no NaN, no infinity, no subnormals, truncation rounding, overflow saturates |
| 14 | +- **hfloat template**: `hfloat<ndigits, es, bt>` where ndigits = hex fraction digits, es = exponent bits (7 for standard) |
| 15 | +- **Scope**: Full implementation including math library |
| 16 | + |
| 17 | +--- |
| 18 | + |
| 19 | +## Phase 1: dfloat Core Infrastructure (BID) |
| 20 | + |
| 21 | +Rewrite the dfloat skeleton with correct IEEE 754-2008 decimal storage layout and BID encoding. |
| 22 | + |
| 23 | +### Storage Layout |
| 24 | + |
| 25 | +IEEE 754-2008 format: `[sign(1)] [combination(5)] [exponent_continuation(w)] [trailing_significand(t)]` |
| 26 | + |
| 27 | +| Alias | Config | Bits | |
| 28 | +|-------|--------|------| |
| 29 | +| decimal32 | `dfloat<7, 6, BID>` | 1+5+6+20 = 32 | |
| 30 | +| decimal64 | `dfloat<16, 8, BID>` | 1+5+8+50 = 64 | |
| 31 | +| decimal128 | `dfloat<34, 12, BID>` | 1+5+12+110 = 128 | |
| 32 | + |
| 33 | +Key static constexprs: |
| 34 | +```cpp |
| 35 | +static constexpr unsigned p = ndigits; // precision digits |
| 36 | +static constexpr unsigned w = es; // exponent continuation bits |
| 37 | +static constexpr unsigned t = nbits - 1 - 5 - w; // trailing significand bits |
| 38 | +static constexpr int bias = (3 << (w - 1)) + p - 2; |
| 39 | +``` |
| 40 | + |
| 41 | +Combination field (5 bits `abcde`): |
| 42 | +- `ab != 11`: exp MSBs = `ab`, MSD = `0cde` (digit 0-7) |
| 43 | +- `ab == 11, c != 1`: exp MSBs = `cd`, MSD = `100e` (digit 8-9) |
| 44 | +- `11110`: infinity; `11111`: NaN |
| 45 | + |
| 46 | +### Files to Modify |
| 47 | + |
| 48 | +| File | Changes | |
| 49 | +|------|---------| |
| 50 | +| `include/sw/universal/number/dfloat/dfloat_fwd.hpp` | Add `DecimalEncoding` enum, 4-param forward decl | |
| 51 | +| `include/sw/universal/number/dfloat/dfloat_impl.hpp` | Complete rewrite: fix storage calc, combination field encode/decode, BID significand pack/unpack, `clear/setzero/setinf/setnan/setsign`, `iszero/isinf/isnan/sign/scale`, `maxpos/minpos/zero/minneg/maxneg`, `convert_ieee754` (double->dfloat), `convert_to_ieee754` (dfloat->double), `convert_signed/convert_unsigned`, comparison operators | |
| 52 | +| `include/sw/universal/number/dfloat/dfloat.hpp` | Add Encoding template param to aliases, uncomment traits/numeric_limits includes | |
| 53 | +| `include/sw/universal/number/dfloat/manipulators.hpp` | Implement `to_binary()` showing field boundaries, `type_tag()` with encoding name | |
| 54 | +| `include/sw/universal/number/dfloat/attributes.hpp` | Fix template params, implement `dynamic_range()` | |
| 55 | +| `static/float/dfloat/api/api.cpp` | Rewrite for 4-param template, test BID `dfloat<7,6>` and `dfloat<16,8>` | |
| 56 | + |
| 57 | +### Files to Create |
| 58 | + |
| 59 | +| File | Purpose | |
| 60 | +|------|---------| |
| 61 | +| `include/sw/universal/traits/dfloat_traits.hpp` | `is_dfloat_trait`, `is_dfloat`, `enable_if_dfloat` (pattern: `traits/cfloat_traits.hpp`) | |
| 62 | +| `include/sw/universal/number/dfloat/numeric_limits.hpp` | `std::numeric_limits<dfloat>` specialization (radix=10) | |
| 63 | + |
| 64 | +### Reference Files |
| 65 | +- `include/sw/universal/number/cfloat/cfloat_impl.hpp` -- class structure pattern |
| 66 | +- `include/sw/universal/traits/cfloat_traits.hpp` -- traits pattern |
| 67 | + |
| 68 | +--- |
| 69 | + |
| 70 | +## Phase 2: dfloat Arithmetic (BID) |
| 71 | + |
| 72 | +Implement all four arithmetic operations for BID encoding. |
| 73 | + |
| 74 | +### Algorithm Outlines |
| 75 | + |
| 76 | +**Addition**: Unpack both operands to `(sign, exponent, significand_integer)`. Align by dividing smaller-exponent significand by `10^shift`. Add/subtract based on signs. Normalize result to p digits. Round per IEEE 754-2008. |
| 77 | + |
| 78 | +**Multiplication**: `result_sig = sig_a * sig_b` (needs 2p-digit intermediate via `__uint128_t` or custom wide int). `result_exp = exp_a + exp_b`. Normalize to p digits. |
| 79 | + |
| 80 | +**Division**: `result_sig = (sig_a * 10^p) / sig_b`. `result_exp = exp_a - exp_b`. Remainder determines rounding. |
| 81 | + |
| 82 | +### Files to Modify |
| 83 | +- `dfloat_impl.hpp`: Implement `operator+=`, `-=`, `*=`, `/=`, `operator-()`, `operator++/--` |
| 84 | + |
| 85 | +### Files to Create |
| 86 | +- `static/float/dfloat/conversion/assignment.cpp` -- native type round-trip tests |
| 87 | +- `static/float/dfloat/conversion/decimal_conversion.cpp` -- string conversion tests (verify 0.1 exact) |
| 88 | +- `static/float/dfloat/logic/logic.cpp` -- comparison tests including NaN semantics |
| 89 | +- `static/float/dfloat/arithmetic/addition.cpp` |
| 90 | +- `static/float/dfloat/arithmetic/subtraction.cpp` |
| 91 | +- `static/float/dfloat/arithmetic/multiplication.cpp` |
| 92 | +- `static/float/dfloat/arithmetic/division.cpp` |
| 93 | + |
| 94 | +--- |
| 95 | + |
| 96 | +## Phase 3: dfloat DPD Encoding |
| 97 | + |
| 98 | +Add DPD (Densely Packed Decimal) as alternate encoding, branching via `if constexpr`. |
| 99 | + |
| 100 | +DPD maps 3 BCD digits to a 10-bit declet. Each declet classifies digits as "small" (0-7) or "large" (8-9), giving 8 encoding patterns. Encode/decode via constexpr lookup tables (1000-entry encode, 1024-entry decode). |
| 101 | + |
| 102 | +### Files to Create |
| 103 | +- `include/sw/universal/number/dfloat/dpd_codec.hpp` -- encode/decode tables + functions |
| 104 | +- `static/float/dfloat/standard/dpd_codec.cpp` -- exhaustive verification of all 1000 encodings |
| 105 | + |
| 106 | +### Files to Modify |
| 107 | +- `dfloat_impl.hpp`: Add `if constexpr (Encoding == DPD)` branches in significand pack/unpack |
| 108 | + |
| 109 | +--- |
| 110 | + |
| 111 | +## Phase 4: dfloat Standard Aliases and Polish |
| 112 | + |
| 113 | +### Files to Modify |
| 114 | +- `dfloat.hpp`: Add aliases: |
| 115 | + ```cpp |
| 116 | + using decimal32 = dfloat<7, 6, DecimalEncoding::BID, uint32_t>; |
| 117 | + using decimal64 = dfloat<16, 8, DecimalEncoding::BID, uint32_t>; |
| 118 | + using decimal128 = dfloat<34, 12, DecimalEncoding::BID, uint32_t>; |
| 119 | + using decimal32_dpd = dfloat<7, 6, DecimalEncoding::DPD, uint32_t>; |
| 120 | + // etc. |
| 121 | + ``` |
| 122 | +- `manipulators.hpp`: Add `color_print`, `pretty_print`, `components` |
| 123 | + |
| 124 | +### Files to Create |
| 125 | +- `static/float/dfloat/standard/decimal32.cpp` -- field width verification, known bit patterns |
| 126 | +- `static/float/dfloat/standard/decimal64.cpp` |
| 127 | +- `static/float/dfloat/standard/decimal128.cpp` |
| 128 | + |
| 129 | +--- |
| 130 | + |
| 131 | +## Phase 5: hfloat Core Infrastructure |
| 132 | + |
| 133 | +Create IBM System/360 hexadecimal floating-point from scratch. |
| 134 | + |
| 135 | +### Storage Layout |
| 136 | + |
| 137 | +Format: `[sign(1)] [exponent(7)] [hex_fraction(ndigits*4)]` |
| 138 | + |
| 139 | +Value: `(-1)^sign * 16^(exponent - 64) * 0.f1f2...fn` |
| 140 | + |
| 141 | +| Alias | Config | Bits | |
| 142 | +|-------|--------|------| |
| 143 | +| hfloat_short | `hfloat<6, 7>` | 1+7+24 = 32 | |
| 144 | +| hfloat_long | `hfloat<14, 7>` | 1+7+56 = 64 | |
| 145 | +| hfloat_extended | `hfloat<28, 7>` | 1+7+112 = 120 (stored in 128) | |
| 146 | + |
| 147 | +Key behaviors: |
| 148 | +- No hidden bit, no NaN, no infinity, no subnormals |
| 149 | +- Truncation rounding only |
| 150 | +- Overflow saturates to maxpos/maxneg |
| 151 | +- Wobbling precision: 0-3 leading zero bits in MSB hex digit |
| 152 | +- Zero: fraction all zeros |
| 153 | + |
| 154 | +### Files to Create (all new) |
| 155 | + |
| 156 | +**Headers** (`include/sw/universal/number/hfloat/`): |
| 157 | +- `hfloat.hpp` -- umbrella header |
| 158 | +- `hfloat_fwd.hpp` -- forward declarations + aliases |
| 159 | +- `exceptions.hpp` -- exception hierarchy (no NaN exceptions; overflow/underflow) |
| 160 | +- `hfloat_impl.hpp` -- main class: template `<ndigits, es, bt>`, storage, constructors, operators, conversions, comparisons. `setinf()` maps to `maxpos()`. SpecificValue `qnan/snan` maps to zero. |
| 161 | +- `numeric_limits.hpp` -- `radix=16`, `has_infinity=false`, `has_quiet_NaN=false` |
| 162 | +- `manipulators.hpp` -- `type_tag`, `to_binary` (show hex digit boundaries), `to_hex` |
| 163 | +- `attributes.hpp` -- `dynamic_range`, `sign`, `scale` (returns `4*(exp-64)`) |
| 164 | + |
| 165 | +**Traits**: `include/sw/universal/traits/hfloat_traits.hpp` |
| 166 | + |
| 167 | +**Tests** (`static/float/hfloat/`): |
| 168 | +- `CMakeLists.txt` |
| 169 | +- `api/api.cpp` |
| 170 | +- `conversion/assignment.cpp` |
| 171 | +- `conversion/hex_conversion.cpp` |
| 172 | +- `logic/logic.cpp` |
| 173 | +- `standard/short.cpp`, `standard/long.cpp`, `standard/extended.cpp` |
| 174 | + |
| 175 | +### CMake Wiring (root `CMakeLists.txt`) |
| 176 | + |
| 177 | +3 insertion points: |
| 178 | +1. ~line 167: `option(UNIVERSAL_BUILD_NUMBER_HFLOATS "Set to ON to build static hfloat tests" OFF)` |
| 179 | +2. ~line 802 in `STATICS` cascade: `set(UNIVERSAL_BUILD_NUMBER_HFLOATS ON)` |
| 180 | +3. ~line 1027 after dfloat: `if(UNIVERSAL_BUILD_NUMBER_HFLOATS) add_subdirectory("static/float/hfloat") endif()` |
| 181 | + |
| 182 | +--- |
| 183 | + |
| 184 | +## Phase 6: hfloat Arithmetic |
| 185 | + |
| 186 | +Implement arithmetic with hex-digit alignment and truncation rounding. |
| 187 | + |
| 188 | +**Addition**: Align by shifting fraction right by `4*(exp_large - exp_small)` bits. Add/subtract. Normalize by shifting hex digits until leading hex digit != 0. Truncate. |
| 189 | + |
| 190 | +**Multiplication**: `result_frac = frac_a * frac_b` (wide multiply). `result_exp = exp_a + exp_b - 64`. Normalize to ndigits hex digits. Truncate. |
| 191 | + |
| 192 | +**Division**: `result_frac = (frac_a << ndigits*4) / frac_b`. `result_exp = exp_a - exp_b + 64`. Normalize. Truncate. |
| 193 | + |
| 194 | +### Files to Create |
| 195 | +- `static/float/hfloat/arithmetic/addition.cpp` |
| 196 | +- `static/float/hfloat/arithmetic/subtraction.cpp` |
| 197 | +- `static/float/hfloat/arithmetic/multiplication.cpp` |
| 198 | +- `static/float/hfloat/arithmetic/division.cpp` |
| 199 | +- `static/float/hfloat/performance/perf.cpp` |
| 200 | + |
| 201 | +--- |
| 202 | + |
| 203 | +## Phase 7: Math Libraries (Both Types) |
| 204 | + |
| 205 | +Initial implementation delegates through `double` for all math functions: |
| 206 | +```cpp |
| 207 | +template<...> TypeName func(TypeName x) { |
| 208 | + return TypeName(std::func(double(x))); |
| 209 | +} |
| 210 | +``` |
| 211 | +
|
| 212 | +Functions: `exp/exp2/exp10/expm1`, `log/log2/log10/log1p`, `pow/sqrt/cbrt/hypot`, `sin/cos/tan/asin/acos/atan/atan2`, `sinh/cosh/tanh/asinh/acosh/atanh`, `trunc/floor/ceil/round`, `fmod/remainder`, `copysign/nextafter/fabs/abs`, `fmin/fmax/fdim`, `erf/erfc/tgamma/lgamma`, `fma` |
| 213 | +
|
| 214 | +For hfloat: `isnan()` always returns false, `isinf()` always returns false, overflow saturates. |
| 215 | +
|
| 216 | +### Files to Create (per type) |
| 217 | +- `math/classify.hpp`, `math/exponent.hpp`, `math/logarithm.hpp`, `math/trigonometry.hpp`, `math/hyperbolic.hpp`, `math/sqrt.hpp`, `math/pow.hpp`, `math/minmax.hpp`, `math/next.hpp`, `math/truncate.hpp`, `math/fractional.hpp`, `math/hypot.hpp`, `math/error_and_gamma.hpp` |
| 218 | +- `mathlib.hpp` umbrella |
| 219 | +- Test files: `static/float/{dfloat,hfloat}/math/*.cpp` |
| 220 | +
|
| 221 | +--- |
| 222 | +
|
| 223 | +## Phase 8: Integration and Testing |
| 224 | +
|
| 225 | +1. Build and test with gcc (`make -j4`) |
| 226 | +2. Build and test with clang (critical -- clang is stricter on UB, uninitialized vars) |
| 227 | +3. Verify `ReportTrivialityOfType` passes for all configurations |
| 228 | +4. Verify `constexpr` correctness (no `std::frexp/ldexp` in constexpr paths) |
| 229 | +5. Key test: `dfloat` represents 0.1 exactly (unlike binary float) |
| 230 | +6. Key test: `hfloat` truncation rounding never rounds up |
| 231 | +
|
| 232 | +### Build Commands |
| 233 | +```bash |
| 234 | +mkdir build_dfloat && cd build_dfloat |
| 235 | +cmake -DUNIVERSAL_BUILD_NUMBER_DFLOATS=ON -DUNIVERSAL_BUILD_NUMBER_HFLOATS=ON .. |
| 236 | +make -j4 |
| 237 | +ctest |
| 238 | +``` |
| 239 | + |
| 240 | +### Safety Reminders |
| 241 | +- ONE build at a time, `make -j4` max (never `-j$(nproc)`) |
| 242 | +- Test with BOTH gcc and clang before considering done |
| 243 | +- Never claim tests pass without running them |
0 commit comments