Skip to content

⚡ Thunderbolt: softmax_v6 — Single-FMA shift-invariant exp range reduction#41

Open
bugparty wants to merge 1 commit into
mainfrom
thunderbolt/softmax_v6_optimization-13142993824301933214
Open

⚡ Thunderbolt: softmax_v6 — Single-FMA shift-invariant exp range reduction#41
bugparty wants to merge 1 commit into
mainfrom
thunderbolt/softmax_v6_optimization-13142993824301933214

Conversation

@bugparty
Copy link
Copy Markdown
Owner

@bugparty bugparty commented May 23, 2026

💡 What: The optimization implemented a new softmax_v6 kernel and helper exp256_ps_v3 that computes r = x - n * ln(2) using a single FMA combined with data prefetching on L1 cache bounds.
🎯 Why: Because Softmax's probabilities scale uniformly inside x - max_x, inputs to the exp() function are consistently heavily negative, preventing critical floating-point precision loss. This lets us ditch the two-step precision-retaining ln(2) approach, eliminating instructions in the bottlenecking exp function loop.
🏗️ How: Used _mm256_fnmadd_ps to do single range reduction r and combined the operations with an explicit 4-way independent execution chain and a 8-way normalizer via Horner's method plus _mm_prefetch(ptr + 64, _MM_HINT_T0).
📊 Impact: Increases throughput on Haswell+ configurations on massive arrays (N=1048576) from ~3.85 to ~4.00 GFLOP/s, a measurable 4-5% gain. Maximum relative absolute diff to true output bounded effectively below < 3.5e-06.
🖥️ Tested on: Ubuntu 22, GCC 13.3.0 (-O3 -mavx2 -mfma).
🔬 How to reproduce: Build and test: cd build && DISABLE_CPU_BINDING=1 ./ml_kernels/ml_kernel_bench --sizes 1048576 --iters 100 --warmup 10 --filter 'softmax_.*'


PR created automatically by Jules for task 13142993824301933214 started by @bugparty

Summary by CodeRabbit

  • New Features

    • Introduced Softmax v6 with optimized single-FMA range-reduction exponential approximation and cache prefetching for enhanced computational efficiency.
  • Documentation

    • Added technical guidance on FMA-based range reduction for exponential operations.
  • Tests

    • Added numerical validation tests for Softmax v6 with accuracy assertions.
  • Chores

    • Added performance benchmark variant for the new Softmax implementation.

Review Change Stack

Optimizes the AVX2 vectorized Softmax implementation (`softmax_v6`)
by exploiting its shift-invariant mathematical properties to safely
collapse the split ln(2) range reduction in `exp256_ps` down to a
single FMA instruction. This technique trims instruction overhead
and, when combined with `_mm_prefetch`, yields a measurably higher
GFLOP/s throughput without breaking existing error bounds.

Also registers the benchmark driver and a correctness test.

Co-authored-by: bugparty <1510776+bugparty@users.noreply.github.com>
@google-labs-jules
Copy link
Copy Markdown
Contributor

👋 Jules, reporting for duty! I'm here to lend a hand with this pull request.

When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down.

I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job!

For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with @jules. You can find this option in the Pull Request section of your global Jules UI settings. You can always switch back!

New to Jules? Learn more at jules.google/docs.


For security, I will only act on instructions from the user who triggered this task.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 23, 2026

📝 Walkthrough

Walkthrough

This PR introduces softmax_v6, a new AVX2 softmax kernel variant that optimizes the exponential computation using a single-FMA range-reduction step (exp256_ps_v3), applies L1 prefetching during the main exp/sum loop, and includes benchmarking and test coverage alongside technical documentation of the optimization strategy.

Changes

Softmax v6 Kernel Implementation

Layer / File(s) Summary
Exp optimization contract and implementation
.jules/thunderbolt.md, ml_kernels/include/ml_kernels/softmax.h
New documentation entry describing the single-FMA range-reduction optimization; exp256_ps_v3 helper collapses split ln(2) constants into one fused operation before polynomial evaluation and exponent reconstruction.
Softmax v6 kernel using optimized exp
ml_kernels/include/ml_kernels/softmax.h
softmax_v6 implements the three-phase softmax structure (32-wide unrolled max reduction, exp+sum with L1 prefetching using exp256_ps_v3, then vector normalization); handles zero-sum edge case and scalar tail.
Benchmarking and functional tests
ml_kernels/src/kernel_bench.cpp, ml_kernels/src/test_naive_ops.cpp
SoftmaxV6Benchmark registers the new kernel variant for performance measurement; test_softmax_v6() validates correctness against the naive implementation with 1e-4f tolerance and verifies output probabilities sum near 1.0; main() updated to run the new test.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • bugparty/cpu_math_kernels_pri#31: Adds the preceding softmax_v5 kernel with exp256_ps_v2 optimization, establishing the pattern of incremental exp function refinements and corresponding softmax variants.

Poem

🐰 A kernel of code, optimized with care,
FMA fuses the math through the air,
From exp to softmax, the speed flows just right,
One rabbit's proud whiskers now shine oh so bright! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 30.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and clearly describes the main change: a new softmax_v6 kernel with single-FMA range reduction for exp, which is the primary focus of all modifications across documentation, header, benchmark, and test files.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch thunderbolt/softmax_v6_optimization-13142993824301933214

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (4)
ml_kernels/include/ml_kernels/softmax.h (1)

506-507: ⚡ Quick win

Move function-body braces onto their own lines.

The new function definitions at Line 506 and Line 539 keep { on the signature line, which violates the project C/C++ brace style rule.

🎯 Suggested style-only fix
-inline __m256 exp256_ps_v3(__m256 x) {
+inline __m256 exp256_ps_v3(__m256 x)
+{
...
-inline void softmax_v6(const float *input, float *output, std::size_t n) {
+inline void softmax_v6(const float *input, float *output, std::size_t n)
+{

As per coding guidelines, "Keep braces on their own lines for function bodies".

Also applies to: 539-540

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ml_kernels/include/ml_kernels/softmax.h` around lines 506 - 507, The opening
brace for the function bodies should be moved onto its own line to follow the
project C/C++ brace style: update the definitions of exp256_ps_v3 (and the other
new function defined around lines 539–540) so the `{` is placed on the next line
after the signature rather than on the same line; ensure both function
signatures are followed by a newline and then the opening brace, keeping
indentation consistent with surrounding code.
ml_kernels/src/kernel_bench.cpp (1)

337-342: ⚡ Quick win

Align new benchmark methods with brace style rule.

Line 337 and Line 339 place function-body braces on the same line as the signature.

🎯 Suggested style-only fix
-    const char *name() const override { return "softmax_v6"; }
+    const char *name() const override
+    {
+        return "softmax_v6";
+    }

-    void run() override {
+    void run() override
+    {
         ml_kernels::softmax_v6(inputs_[current_idx_].data(), outputs_[current_idx_].data(), inputs_[0].size());
         current_idx_ = (current_idx_ + 1) % pool_size_;
     }

As per coding guidelines, "Keep braces on their own lines for function bodies".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ml_kernels/src/kernel_bench.cpp` around lines 337 - 342, The function bodies
for name() and run() violate the brace-style rule by placing the opening brace
on the same line as the signature; update the definitions of name() (returning
"softmax_v6") and run() (calling ml_kernels::softmax_v6 with
inputs_[current_idx_], outputs_[current_idx_], inputs_[0].size() and updating
current_idx_ with pool_size_) so that each opening brace is moved to its own
line directly below the function signature, preserving all existing logic and
identifiers (name, run, ml_kernels::softmax_v6, inputs_, outputs_, current_idx_,
pool_size_).
ml_kernels/src/test_naive_ops.cpp (2)

184-184: ⚡ Quick win

Use brace-on-next-line style for the new test function.

Line 184 places the opening brace on the same line as the function signature.

As per coding guidelines, "Keep braces on their own lines for function bodies".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ml_kernels/src/test_naive_ops.cpp` at line 184, The function definition for
test_softmax_v6 has its opening brace on the same line as the signature; move
the brace to its own line to follow the brace-on-next-line style: change "void
test_softmax_v6() {" to have the "{" on the next line (i.e., place the opening
brace on a new line directly below the function signature) so the function body
starts with the brace on its own line.

186-208: ⚡ Quick win

Add at least one non-multiple-of-8 input to cover tail paths.

Current test_softmax_v6 uses exactly 32 elements, so the i + 7 < n loop and scalar tail path are untested. A 33/37-length case would validate the new kernel’s boundary handling.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ml_kernels/src/test_naive_ops.cpp` around lines 186 - 208, The test currently
uses an input vector of length 32 (multiple of 8) so the vectorized branch in
ml_kernels::softmax_v6 never hits its scalar tail; extend the test input (the
variable input used with ml_kernels::softmax_naive and ml_kernels::softmax_v6)
by adding at least one extra element (e.g., push one or a few floats to make
length 33 or 37) so the i + 7 < n loop and scalar tail path are exercised,
leaving the rest of the test (output_naive/output_v6 comparisons and the sum
assertion) unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@ml_kernels/include/ml_kernels/softmax.h`:
- Line 572: The prefetch in softmax_v6 uses input + i + 64 unconditionally,
which can form a pointer past one-past-end; change the prefetch to only run when
(i + 64) <= n (or equivalent size check) so the computed address is valid, and
keep the loop logic intact (e.g., wrap the _mm_prefetch call in an if ((i + 64)
<= n) guard). Also fix brace placement for the function definitions exp256_ps_v3
and softmax_v6 by moving the opening brace to its own line (put the "{" on the
next line after the signature) to follow the brace-on-own-line guideline.

---

Nitpick comments:
In `@ml_kernels/include/ml_kernels/softmax.h`:
- Around line 506-507: The opening brace for the function bodies should be moved
onto its own line to follow the project C/C++ brace style: update the
definitions of exp256_ps_v3 (and the other new function defined around lines
539–540) so the `{` is placed on the next line after the signature rather than
on the same line; ensure both function signatures are followed by a newline and
then the opening brace, keeping indentation consistent with surrounding code.

In `@ml_kernels/src/kernel_bench.cpp`:
- Around line 337-342: The function bodies for name() and run() violate the
brace-style rule by placing the opening brace on the same line as the signature;
update the definitions of name() (returning "softmax_v6") and run() (calling
ml_kernels::softmax_v6 with inputs_[current_idx_], outputs_[current_idx_],
inputs_[0].size() and updating current_idx_ with pool_size_) so that each
opening brace is moved to its own line directly below the function signature,
preserving all existing logic and identifiers (name, run,
ml_kernels::softmax_v6, inputs_, outputs_, current_idx_, pool_size_).

In `@ml_kernels/src/test_naive_ops.cpp`:
- Line 184: The function definition for test_softmax_v6 has its opening brace on
the same line as the signature; move the brace to its own line to follow the
brace-on-next-line style: change "void test_softmax_v6() {" to have the "{" on
the next line (i.e., place the opening brace on a new line directly below the
function signature) so the function body starts with the brace on its own line.
- Around line 186-208: The test currently uses an input vector of length 32
(multiple of 8) so the vectorized branch in ml_kernels::softmax_v6 never hits
its scalar tail; extend the test input (the variable input used with
ml_kernels::softmax_naive and ml_kernels::softmax_v6) by adding at least one
extra element (e.g., push one or a few floats to make length 33 or 37) so the i
+ 7 < n loop and scalar tail path are exercised, leaving the rest of the test
(output_naive/output_v6 comparisons and the sum assertion) unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 03499fd0-0928-44ac-8999-c30ee5e3613d

📥 Commits

Reviewing files that changed from the base of the PR and between acca01e and eb3cc56.

📒 Files selected for processing (4)
  • .jules/thunderbolt.md
  • ml_kernels/include/ml_kernels/softmax.h
  • ml_kernels/src/kernel_bench.cpp
  • ml_kernels/src/test_naive_ops.cpp

__m256 sum3 = _mm256_setzero_ps();

for (; i + 31 < n; i += 32) {
_mm_prefetch((const char*)(input + i + 64), _MM_HINT_T0);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify the prefetch operand and surrounding bounds checks.
rg -n -C4 '_mm_prefetch\(' ml_kernels/include/ml_kernels/softmax.h
rg -n -C3 'for \(\; i \+ 31 < n; i \+= 32\)' ml_kernels/include/ml_kernels/softmax.h

Repository: bugparty/cpu_math_kernels_pri

Length of output: 637


🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE="ml_kernels/include/ml_kernels/softmax.h"

# 1) Show the full loop context around the prefetch line.
nl -ba "$FILE" | sed -n '540,610p'

# 2) Re-run a simpler text search for the exact loop header to avoid regex escaping issues.
rg -n 'for \(\s*;\s*i\s*\+\s*31\s*<\s*n\s*;\s*i\s*\+=\s*32\s*\)' "$FILE" || true
rg -n 'i \+ 31 < n' "$FILE" || true

# 3) Inspect the function formatting around the originally mentioned ranges to validate brace placement.
nl -ba "$FILE" | sed -n '480,560p'
nl -ba "$FILE" | sed -n '560,680p'

Repository: bugparty/cpu_math_kernels_pri

Length of output: 116


🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE="ml_kernels/include/ml_kernels/softmax.h"

echo "== prefetch context (lines 560-590) ==" 
awk 'NR>=560 && NR<=590 {printf "%d:%s\n", NR, $0}' "$FILE"

echo "== function formatting context (lines 480-560) =="
awk 'NR>=480 && NR<=560 {printf "%d:%s\n", NR, $0}' "$FILE"

echo "== function formatting context (lines 560-680) =="
awk 'NR>=560 && NR<=680 {printf "%d:%s\n", NR, $0}' "$FILE"

echo "== all prefetch occurrences =="
rg -n '_mm_prefetch\(' "$FILE" || true

echo "== signatures near exp256_ps_v3 and softmax_v6 =="
rg -n 'exp256_ps_v3|softmax_v6' "$FILE" || true

Repository: bugparty/cpu_math_kernels_pri

Length of output: 8496


Guard prefetch address computation to avoid UB (and fix brace placement).

  • In ml_kernels/include/ml_kernels/softmax.h, softmax_v6 computes input + i + 64 unconditionally inside for (; i + 31 < n; i += 32), so the last iteration can form a pointer beyond one-past-end (UB in C++).
  • exp256_ps_v3 and softmax_v6 use { on the same line as the function signature, violating the brace-on-own-line guideline.
💡 Suggested fix
 for (; i + 31 < n; i += 32) {
-    _mm_prefetch((const char*)(input + i + 64), _MM_HINT_T0);
+    if (i + 64 < n) {
+        _mm_prefetch(reinterpret_cast<const char*>(input + i + 64), _MM_HINT_T0);
+    }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ml_kernels/include/ml_kernels/softmax.h` at line 572, The prefetch in
softmax_v6 uses input + i + 64 unconditionally, which can form a pointer past
one-past-end; change the prefetch to only run when (i + 64) <= n (or equivalent
size check) so the computed address is valid, and keep the loop logic intact
(e.g., wrap the _mm_prefetch call in an if ((i + 64) <= n) guard). Also fix
brace placement for the function definitions exp256_ps_v3 and softmax_v6 by
moving the opening brace to its own line (put the "{" on the next line after the
signature) to follow the brace-on-own-line guideline.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant