⚡ Thunderbolt: softmax_v6 — Single FMA for ln(2) range reduction by bugparty · Pull Request #42 · bugparty/cpu_math_kernels_pri

bugparty · 2026-05-25T20:15:16Z

💡 What:
Added softmax_v6 and exp256_ps_v3 which combines the split exact mathematical ln(2) approximation (0.693145751953125f + 1.428606765330187e-06f) into a single float constant 0.6931471805599453f. This allows the critical r = x - n * ln(2) line in the AVX2 exp function to be resolved via a single FMA (_mm256_fnmadd_ps) instead of two.

🎯 Why:
The AVX2 exp256_ps implementation relies heavily on a chained set of _mm256_fmadd_ps calls for the Horner polynomial sequence. By knocking out one instruction in the preceding range reduction, we shorten the critical dependency path slightly, reducing the overall port pressure on the execution units for the N-element map loop.

🏗️ How:

__m256 r = _mm256_fnmadd_ps(n, _mm256_set1_ps(0.6931471805599453f), x);

instead of:

__m256 r = _mm256_fnmadd_ps(n, _mm256_set1_ps(0.693145751953125f), x);
r = _mm256_fnmadd_ps(n, _mm256_set1_ps(1.428606765330187e-06f), r);

Since softmax works on relative differences and normalizes over a denominator sum, the infinitesimally small loss of precision on this step easily stays within the 1e-4 expected testing tolerance.

📊 Impact:
Microbenchmarks show a modest throughput improvement on softmax_v6 compared to softmax_v5.
Example on N=65536:
softmax_v5: ~5.69 GFLOP/s
softmax_v6: ~6.13 GFLOP/s

🖥️ Tested on:
Environment: x86-64 (CI/Local Sandbox with AVX2 support enabled via GCC 13.3)

🔬 How to reproduce:

cd build && make ml_kernel_bench -j$(nproc)
DISABLE_CPU_BINDING=1 ./ml_kernels/ml_kernel_bench --sizes 65536 | grep softmax

PR created automatically by Jules for task 5293043582355494890 started by @bugparty

Summary by CodeRabbit

New Features
- Added optimized softmax implementation variant with reported throughput improvements and enhanced computational efficiency.
Tests
- Added comprehensive test suite validating numerical accuracy, precision tolerance, and output correctness.
Documentation
- Added technical documentation detailing optimization methodology and measured performance improvements.

Co-authored-by: bugparty <1510776+bugparty@users.noreply.github.com>

google-labs-jules · 2026-05-25T20:15:17Z

👋 Jules, reporting for duty! I'm here to lend a hand with this pull request.

When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down.

I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job!

For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with @jules. You can find this option in the Pull Request section of your global Jules UI settings. You can always switch back!

New to Jules? Learn more at jules.google/docs.

For security, I will only act on instructions from the user who triggered this task.

coderabbitai · 2026-05-25T20:15:29Z

📝 Walkthrough

Walkthrough

This PR introduces softmax_v6, a new AVX2 softmax variant with an optimized exponential helper (exp256_ps_v3) that fuses the ln(2) remainder computation into a single FMA operation. The implementation includes 32-wide SIMD loops with vector sum reduction, scalar tail handling, and output normalization, validated by functional tests and measured via microbenchmarks.

Changes

AVX2 Softmax v6 with Optimized Exponential

Layer / File(s)	Summary
AVX2 Exponential and Softmax Implementation `ml_kernels/include/ml_kernels/softmax.h`	`exp256_ps_v3` clamps input, computes `n` via `cvtps_epi32`, fuses the ln(2) remainder into a single `fnmadd` with a combined constant, evaluates a polynomial via Horner's method, and reconstructs `2^n` via exponent-bit shifting. `softmax_v6` performs 32-wide maximum computation, exponentiates `(input - max)` in 32-wide and 8-wide chunks using `exp256_ps_v3`, reduces the vector sum to scalar, handles zero-sum early return, and normalizes using vector-reciprocal multiplication with scalar tail.
Functional Correctness Testing `ml_kernels/src/test_naive_ops.cpp`	`test_softmax_v6()` validates outputs against `softmax_naive` with 1e-4 element-wise tolerance and verifies output probabilities sum to ~1.0. Integration into `main()` executes the test as part of the verification suite.
Performance Benchmarking `ml_kernels/src/kernel_bench.cpp`	`SoftmaxV6Benchmark` subclass measures throughput by repeatedly calling `softmax_v6()` with advancing input indices and registers the benchmark with `REGISTER_BENCHMARK`.
Optimization Notes `.jules/thunderbolt.md`	Dated entry `2026-05-25` documents the single-FMA ln(2) range-reduction technique, reports microbenchmark improvements, and specifies 1e-4 accuracy assertion tolerance.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

bugparty/cpu_math_kernels_pri#31: Introduces exp256_ps_v2/softmax_v5 with similar Horner-based polynomial evaluation and range-reduction patterns in the same softmax header.
bugparty/cpu_math_kernels_pri#28: Implements exp256_ps_estrin/softmax_v4 following the same extension pattern of adding new AVX2 exponential variants and softmax implementations with matching benchmark and test harnesses.

Poem

🐰 Vectors swirl in 32-wide grace,
Horner's steps through exp-space,
One FMA fuses ln's song—
Softmax v6 bounds along,
Summer notes of speed well-earned! 🚀

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 30.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title accurately describes the main change: adding softmax_v6 with a single FMA ln(2) constant for range reduction, which is the core optimization across all modified files.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch thunderbolt/softmax_ln2_fused-5293043582355494890

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

ml_kernels/include/ml_kernels/softmax.h (1)
504-504: ⚡ Quick win

Move function-body opening braces to their own lines.

Line 504 and Line 542 place { on the signature line; this violates the project’s C/C++ brace style rule.

As per coding guidelines, **/*.{c,cpp,cc,h,hpp}: Keep braces on their own lines for function bodies.

Also applies to: 542-542
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ml_kernels/include/ml_kernels/softmax.h` at line 504, The function signature
for exp256_ps_v3 currently has the opening brace on the same line; change the
brace so the function-body opening brace is on its own line (i.e., place "{" on
the following line) to conform to the project's C/C++ brace style; apply the
same fix to the other nearby function(s) with braces on the signature line
(e.g., the function at the location referenced around line 542) so all function
definitions put "{" on its own line.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@ml_kernels/include/ml_kernels/softmax.h`:
- Line 504: The function signature for exp256_ps_v3 currently has the opening
brace on the same line; change the brace so the function-body opening brace is
on its own line (i.e., place "{" on the following line) to conform to the
project's C/C++ brace style; apply the same fix to the other nearby function(s)
with braces on the signature line (e.g., the function at the location referenced
around line 542) so all function definitions put "{" on its own line.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b1b25e34-ac8b-4f28-8fc1-af153ef7f02f

📥 Commits

Reviewing files that changed from the base of the PR and between acca01e and e96d471.

📒 Files selected for processing (4)

.jules/thunderbolt.md
ml_kernels/include/ml_kernels/softmax.h
ml_kernels/src/kernel_bench.cpp
ml_kernels/src/test_naive_ops.cpp

⚡ Thunderbolt: softmax_v6 — Single FMA for ln(2) range reduction

e96d471

Co-authored-by: bugparty <1510776+bugparty@users.noreply.github.com>

coderabbitai Bot reviewed May 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

⚡ Thunderbolt: softmax_v6 — Single FMA for ln(2) range reduction#42

⚡ Thunderbolt: softmax_v6 — Single FMA for ln(2) range reduction#42
bugparty wants to merge 1 commit into
mainfrom
thunderbolt/softmax_ln2_fused-5293043582355494890

bugparty commented May 25, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

google-labs-jules Bot commented May 25, 2026

Uh oh!

coderabbitai Bot commented May 25, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bugparty commented May 25, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

google-labs-jules Bot commented May 25, 2026

Uh oh!

coderabbitai Bot commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

bugparty commented May 25, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 25, 2026 •

edited

Loading