⚡ Thunderbolt: softmax_v6 — Single-FMA shift-invariant exp range reduction#41
⚡ Thunderbolt: softmax_v6 — Single-FMA shift-invariant exp range reduction#41bugparty wants to merge 1 commit into
Conversation
Optimizes the AVX2 vectorized Softmax implementation (`softmax_v6`) by exploiting its shift-invariant mathematical properties to safely collapse the split ln(2) range reduction in `exp256_ps` down to a single FMA instruction. This technique trims instruction overhead and, when combined with `_mm_prefetch`, yields a measurably higher GFLOP/s throughput without breaking existing error bounds. Also registers the benchmark driver and a correctness test. Co-authored-by: bugparty <1510776+bugparty@users.noreply.github.com>
|
👋 Jules, reporting for duty! I'm here to lend a hand with this pull request. When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down. I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job! For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with New to Jules? Learn more at jules.google/docs. For security, I will only act on instructions from the user who triggered this task. |
📝 WalkthroughWalkthroughThis PR introduces ChangesSoftmax v6 Kernel Implementation
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (4)
ml_kernels/include/ml_kernels/softmax.h (1)
506-507: ⚡ Quick winMove function-body braces onto their own lines.
The new function definitions at Line 506 and Line 539 keep
{on the signature line, which violates the project C/C++ brace style rule.🎯 Suggested style-only fix
-inline __m256 exp256_ps_v3(__m256 x) { +inline __m256 exp256_ps_v3(__m256 x) +{ ... -inline void softmax_v6(const float *input, float *output, std::size_t n) { +inline void softmax_v6(const float *input, float *output, std::size_t n) +{As per coding guidelines, "Keep braces on their own lines for function bodies".
Also applies to: 539-540
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ml_kernels/include/ml_kernels/softmax.h` around lines 506 - 507, The opening brace for the function bodies should be moved onto its own line to follow the project C/C++ brace style: update the definitions of exp256_ps_v3 (and the other new function defined around lines 539–540) so the `{` is placed on the next line after the signature rather than on the same line; ensure both function signatures are followed by a newline and then the opening brace, keeping indentation consistent with surrounding code.ml_kernels/src/kernel_bench.cpp (1)
337-342: ⚡ Quick winAlign new benchmark methods with brace style rule.
Line 337 and Line 339 place function-body braces on the same line as the signature.
🎯 Suggested style-only fix
- const char *name() const override { return "softmax_v6"; } + const char *name() const override + { + return "softmax_v6"; + } - void run() override { + void run() override + { ml_kernels::softmax_v6(inputs_[current_idx_].data(), outputs_[current_idx_].data(), inputs_[0].size()); current_idx_ = (current_idx_ + 1) % pool_size_; }As per coding guidelines, "Keep braces on their own lines for function bodies".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ml_kernels/src/kernel_bench.cpp` around lines 337 - 342, The function bodies for name() and run() violate the brace-style rule by placing the opening brace on the same line as the signature; update the definitions of name() (returning "softmax_v6") and run() (calling ml_kernels::softmax_v6 with inputs_[current_idx_], outputs_[current_idx_], inputs_[0].size() and updating current_idx_ with pool_size_) so that each opening brace is moved to its own line directly below the function signature, preserving all existing logic and identifiers (name, run, ml_kernels::softmax_v6, inputs_, outputs_, current_idx_, pool_size_).ml_kernels/src/test_naive_ops.cpp (2)
184-184: ⚡ Quick winUse brace-on-next-line style for the new test function.
Line 184 places the opening brace on the same line as the function signature.
As per coding guidelines, "Keep braces on their own lines for function bodies".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ml_kernels/src/test_naive_ops.cpp` at line 184, The function definition for test_softmax_v6 has its opening brace on the same line as the signature; move the brace to its own line to follow the brace-on-next-line style: change "void test_softmax_v6() {" to have the "{" on the next line (i.e., place the opening brace on a new line directly below the function signature) so the function body starts with the brace on its own line.
186-208: ⚡ Quick winAdd at least one non-multiple-of-8 input to cover tail paths.
Current
test_softmax_v6uses exactly 32 elements, so thei + 7 < nloop and scalar tail path are untested. A 33/37-length case would validate the new kernel’s boundary handling.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ml_kernels/src/test_naive_ops.cpp` around lines 186 - 208, The test currently uses an input vector of length 32 (multiple of 8) so the vectorized branch in ml_kernels::softmax_v6 never hits its scalar tail; extend the test input (the variable input used with ml_kernels::softmax_naive and ml_kernels::softmax_v6) by adding at least one extra element (e.g., push one or a few floats to make length 33 or 37) so the i + 7 < n loop and scalar tail path are exercised, leaving the rest of the test (output_naive/output_v6 comparisons and the sum assertion) unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@ml_kernels/include/ml_kernels/softmax.h`:
- Line 572: The prefetch in softmax_v6 uses input + i + 64 unconditionally,
which can form a pointer past one-past-end; change the prefetch to only run when
(i + 64) <= n (or equivalent size check) so the computed address is valid, and
keep the loop logic intact (e.g., wrap the _mm_prefetch call in an if ((i + 64)
<= n) guard). Also fix brace placement for the function definitions exp256_ps_v3
and softmax_v6 by moving the opening brace to its own line (put the "{" on the
next line after the signature) to follow the brace-on-own-line guideline.
---
Nitpick comments:
In `@ml_kernels/include/ml_kernels/softmax.h`:
- Around line 506-507: The opening brace for the function bodies should be moved
onto its own line to follow the project C/C++ brace style: update the
definitions of exp256_ps_v3 (and the other new function defined around lines
539–540) so the `{` is placed on the next line after the signature rather than
on the same line; ensure both function signatures are followed by a newline and
then the opening brace, keeping indentation consistent with surrounding code.
In `@ml_kernels/src/kernel_bench.cpp`:
- Around line 337-342: The function bodies for name() and run() violate the
brace-style rule by placing the opening brace on the same line as the signature;
update the definitions of name() (returning "softmax_v6") and run() (calling
ml_kernels::softmax_v6 with inputs_[current_idx_], outputs_[current_idx_],
inputs_[0].size() and updating current_idx_ with pool_size_) so that each
opening brace is moved to its own line directly below the function signature,
preserving all existing logic and identifiers (name, run,
ml_kernels::softmax_v6, inputs_, outputs_, current_idx_, pool_size_).
In `@ml_kernels/src/test_naive_ops.cpp`:
- Line 184: The function definition for test_softmax_v6 has its opening brace on
the same line as the signature; move the brace to its own line to follow the
brace-on-next-line style: change "void test_softmax_v6() {" to have the "{" on
the next line (i.e., place the opening brace on a new line directly below the
function signature) so the function body starts with the brace on its own line.
- Around line 186-208: The test currently uses an input vector of length 32
(multiple of 8) so the vectorized branch in ml_kernels::softmax_v6 never hits
its scalar tail; extend the test input (the variable input used with
ml_kernels::softmax_naive and ml_kernels::softmax_v6) by adding at least one
extra element (e.g., push one or a few floats to make length 33 or 37) so the i
+ 7 < n loop and scalar tail path are exercised, leaving the rest of the test
(output_naive/output_v6 comparisons and the sum assertion) unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 03499fd0-0928-44ac-8999-c30ee5e3613d
📒 Files selected for processing (4)
.jules/thunderbolt.mdml_kernels/include/ml_kernels/softmax.hml_kernels/src/kernel_bench.cppml_kernels/src/test_naive_ops.cpp
| __m256 sum3 = _mm256_setzero_ps(); | ||
|
|
||
| for (; i + 31 < n; i += 32) { | ||
| _mm_prefetch((const char*)(input + i + 64), _MM_HINT_T0); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify the prefetch operand and surrounding bounds checks.
rg -n -C4 '_mm_prefetch\(' ml_kernels/include/ml_kernels/softmax.h
rg -n -C3 'for \(\; i \+ 31 < n; i \+= 32\)' ml_kernels/include/ml_kernels/softmax.hRepository: bugparty/cpu_math_kernels_pri
Length of output: 637
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="ml_kernels/include/ml_kernels/softmax.h"
# 1) Show the full loop context around the prefetch line.
nl -ba "$FILE" | sed -n '540,610p'
# 2) Re-run a simpler text search for the exact loop header to avoid regex escaping issues.
rg -n 'for \(\s*;\s*i\s*\+\s*31\s*<\s*n\s*;\s*i\s*\+=\s*32\s*\)' "$FILE" || true
rg -n 'i \+ 31 < n' "$FILE" || true
# 3) Inspect the function formatting around the originally mentioned ranges to validate brace placement.
nl -ba "$FILE" | sed -n '480,560p'
nl -ba "$FILE" | sed -n '560,680p'Repository: bugparty/cpu_math_kernels_pri
Length of output: 116
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="ml_kernels/include/ml_kernels/softmax.h"
echo "== prefetch context (lines 560-590) =="
awk 'NR>=560 && NR<=590 {printf "%d:%s\n", NR, $0}' "$FILE"
echo "== function formatting context (lines 480-560) =="
awk 'NR>=480 && NR<=560 {printf "%d:%s\n", NR, $0}' "$FILE"
echo "== function formatting context (lines 560-680) =="
awk 'NR>=560 && NR<=680 {printf "%d:%s\n", NR, $0}' "$FILE"
echo "== all prefetch occurrences =="
rg -n '_mm_prefetch\(' "$FILE" || true
echo "== signatures near exp256_ps_v3 and softmax_v6 =="
rg -n 'exp256_ps_v3|softmax_v6' "$FILE" || trueRepository: bugparty/cpu_math_kernels_pri
Length of output: 8496
Guard prefetch address computation to avoid UB (and fix brace placement).
- In
ml_kernels/include/ml_kernels/softmax.h,softmax_v6computesinput + i + 64unconditionally insidefor (; i + 31 < n; i += 32), so the last iteration can form a pointer beyond one-past-end (UB in C++). exp256_ps_v3andsoftmax_v6use{on the same line as the function signature, violating the brace-on-own-line guideline.
💡 Suggested fix
for (; i + 31 < n; i += 32) {
- _mm_prefetch((const char*)(input + i + 64), _MM_HINT_T0);
+ if (i + 64 < n) {
+ _mm_prefetch(reinterpret_cast<const char*>(input + i + 64), _MM_HINT_T0);
+ }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@ml_kernels/include/ml_kernels/softmax.h` at line 572, The prefetch in
softmax_v6 uses input + i + 64 unconditionally, which can form a pointer past
one-past-end; change the prefetch to only run when (i + 64) <= n (or equivalent
size check) so the computed address is valid, and keep the loop logic intact
(e.g., wrap the _mm_prefetch call in an if ((i + 64) <= n) guard). Also fix
brace placement for the function definitions exp256_ps_v3 and softmax_v6 by
moving the opening brace to its own line (put the "{" on the next line after the
signature) to follow the brace-on-own-line guideline.
💡 What: The optimization implemented a new
softmax_v6kernel and helperexp256_ps_v3that computesr = x - n * ln(2)using a single FMA combined with data prefetching on L1 cache bounds.🎯 Why: Because Softmax's probabilities scale uniformly inside
x - max_x, inputs to theexp()function are consistently heavily negative, preventing critical floating-point precision loss. This lets us ditch the two-step precision-retainingln(2)approach, eliminating instructions in the bottlenecking exp function loop.🏗️ How: Used
_mm256_fnmadd_psto do single range reductionrand combined the operations with an explicit 4-way independent execution chain and a 8-way normalizer via Horner's method plus_mm_prefetch(ptr + 64, _MM_HINT_T0).📊 Impact: Increases throughput on Haswell+ configurations on massive arrays (N=1048576) from ~3.85 to ~4.00 GFLOP/s, a measurable 4-5% gain. Maximum relative absolute diff to true output bounded effectively below
< 3.5e-06.🖥️ Tested on: Ubuntu 22, GCC 13.3.0 (
-O3 -mavx2 -mfma).🔬 How to reproduce: Build and test:
cd build && DISABLE_CPU_BINDING=1 ./ml_kernels/ml_kernel_bench --sizes 1048576 --iters 100 --warmup 10 --filter 'softmax_.*'PR created automatically by Jules for task 13142993824301933214 started by @bugparty
Summary by CodeRabbit
New Features
Documentation
Tests
Chores