Skip to content

⚡ Thunderbolt: softmax_v6 — Single FMA for ln(2) range reduction#42

Open
bugparty wants to merge 1 commit into
mainfrom
thunderbolt/softmax_ln2_fused-5293043582355494890
Open

⚡ Thunderbolt: softmax_v6 — Single FMA for ln(2) range reduction#42
bugparty wants to merge 1 commit into
mainfrom
thunderbolt/softmax_ln2_fused-5293043582355494890

Conversation

@bugparty
Copy link
Copy Markdown
Owner

@bugparty bugparty commented May 25, 2026

💡 What:
Added softmax_v6 and exp256_ps_v3 which combines the split exact mathematical ln(2) approximation (0.693145751953125f + 1.428606765330187e-06f) into a single float constant 0.6931471805599453f. This allows the critical r = x - n * ln(2) line in the AVX2 exp function to be resolved via a single FMA (_mm256_fnmadd_ps) instead of two.

🎯 Why:
The AVX2 exp256_ps implementation relies heavily on a chained set of _mm256_fmadd_ps calls for the Horner polynomial sequence. By knocking out one instruction in the preceding range reduction, we shorten the critical dependency path slightly, reducing the overall port pressure on the execution units for the N-element map loop.

🏗️ How:

__m256 r = _mm256_fnmadd_ps(n, _mm256_set1_ps(0.6931471805599453f), x);

instead of:

__m256 r = _mm256_fnmadd_ps(n, _mm256_set1_ps(0.693145751953125f), x);
r = _mm256_fnmadd_ps(n, _mm256_set1_ps(1.428606765330187e-06f), r);

Since softmax works on relative differences and normalizes over a denominator sum, the infinitesimally small loss of precision on this step easily stays within the 1e-4 expected testing tolerance.

📊 Impact:
Microbenchmarks show a modest throughput improvement on softmax_v6 compared to softmax_v5.
Example on N=65536:
softmax_v5: ~5.69 GFLOP/s
softmax_v6: ~6.13 GFLOP/s

🖥️ Tested on:
Environment: x86-64 (CI/Local Sandbox with AVX2 support enabled via GCC 13.3)

🔬 How to reproduce:

cd build && make ml_kernel_bench -j$(nproc)
DISABLE_CPU_BINDING=1 ./ml_kernels/ml_kernel_bench --sizes 65536 | grep softmax

PR created automatically by Jules for task 5293043582355494890 started by @bugparty

Summary by CodeRabbit

  • New Features

    • Added optimized softmax implementation variant with reported throughput improvements and enhanced computational efficiency.
  • Tests

    • Added comprehensive test suite validating numerical accuracy, precision tolerance, and output correctness.
  • Documentation

    • Added technical documentation detailing optimization methodology and measured performance improvements.

Review Change Stack

Co-authored-by: bugparty <1510776+bugparty@users.noreply.github.com>
@google-labs-jules
Copy link
Copy Markdown
Contributor

👋 Jules, reporting for duty! I'm here to lend a hand with this pull request.

When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down.

I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job!

For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with @jules. You can find this option in the Pull Request section of your global Jules UI settings. You can always switch back!

New to Jules? Learn more at jules.google/docs.


For security, I will only act on instructions from the user who triggered this task.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 25, 2026

📝 Walkthrough

Walkthrough

This PR introduces softmax_v6, a new AVX2 softmax variant with an optimized exponential helper (exp256_ps_v3) that fuses the ln(2) remainder computation into a single FMA operation. The implementation includes 32-wide SIMD loops with vector sum reduction, scalar tail handling, and output normalization, validated by functional tests and measured via microbenchmarks.

Changes

AVX2 Softmax v6 with Optimized Exponential

Layer / File(s) Summary
AVX2 Exponential and Softmax Implementation
ml_kernels/include/ml_kernels/softmax.h
exp256_ps_v3 clamps input, computes n via cvtps_epi32, fuses the ln(2) remainder into a single fnmadd with a combined constant, evaluates a polynomial via Horner's method, and reconstructs 2^n via exponent-bit shifting. softmax_v6 performs 32-wide maximum computation, exponentiates (input - max) in 32-wide and 8-wide chunks using exp256_ps_v3, reduces the vector sum to scalar, handles zero-sum early return, and normalizes using vector-reciprocal multiplication with scalar tail.
Functional Correctness Testing
ml_kernels/src/test_naive_ops.cpp
test_softmax_v6() validates outputs against softmax_naive with 1e-4 element-wise tolerance and verifies output probabilities sum to ~1.0. Integration into main() executes the test as part of the verification suite.
Performance Benchmarking
ml_kernels/src/kernel_bench.cpp
SoftmaxV6Benchmark subclass measures throughput by repeatedly calling softmax_v6() with advancing input indices and registers the benchmark with REGISTER_BENCHMARK.
Optimization Notes
.jules/thunderbolt.md
Dated entry 2026-05-25 documents the single-FMA ln(2) range-reduction technique, reports microbenchmark improvements, and specifies 1e-4 accuracy assertion tolerance.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • bugparty/cpu_math_kernels_pri#31: Introduces exp256_ps_v2/softmax_v5 with similar Horner-based polynomial evaluation and range-reduction patterns in the same softmax header.
  • bugparty/cpu_math_kernels_pri#28: Implements exp256_ps_estrin/softmax_v4 following the same extension pattern of adding new AVX2 exponential variants and softmax implementations with matching benchmark and test harnesses.

Poem

🐰 Vectors swirl in 32-wide grace,
Horner's steps through exp-space,
One FMA fuses ln's song—
Softmax v6 bounds along,
Summer notes of speed well-earned! 🚀

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 30.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately describes the main change: adding softmax_v6 with a single FMA ln(2) constant for range reduction, which is the core optimization across all modified files.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch thunderbolt/softmax_ln2_fused-5293043582355494890

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
ml_kernels/include/ml_kernels/softmax.h (1)

504-504: ⚡ Quick win

Move function-body opening braces to their own lines.

Line 504 and Line 542 place { on the signature line; this violates the project’s C/C++ brace style rule.

As per coding guidelines, **/*.{c,cpp,cc,h,hpp}: Keep braces on their own lines for function bodies.

Also applies to: 542-542

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ml_kernels/include/ml_kernels/softmax.h` at line 504, The function signature
for exp256_ps_v3 currently has the opening brace on the same line; change the
brace so the function-body opening brace is on its own line (i.e., place "{" on
the following line) to conform to the project's C/C++ brace style; apply the
same fix to the other nearby function(s) with braces on the signature line
(e.g., the function at the location referenced around line 542) so all function
definitions put "{" on its own line.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@ml_kernels/include/ml_kernels/softmax.h`:
- Line 504: The function signature for exp256_ps_v3 currently has the opening
brace on the same line; change the brace so the function-body opening brace is
on its own line (i.e., place "{" on the following line) to conform to the
project's C/C++ brace style; apply the same fix to the other nearby function(s)
with braces on the signature line (e.g., the function at the location referenced
around line 542) so all function definitions put "{" on its own line.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b1b25e34-ac8b-4f28-8fc1-af153ef7f02f

📥 Commits

Reviewing files that changed from the base of the PR and between acca01e and e96d471.

📒 Files selected for processing (4)
  • .jules/thunderbolt.md
  • ml_kernels/include/ml_kernels/softmax.h
  • ml_kernels/src/kernel_bench.cpp
  • ml_kernels/src/test_naive_ops.cpp

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant