⚡ Thunderbolt: softmax_v6 — Single FMA for ln(2) range reduction#42
⚡ Thunderbolt: softmax_v6 — Single FMA for ln(2) range reduction#42bugparty wants to merge 1 commit into
Conversation
Co-authored-by: bugparty <1510776+bugparty@users.noreply.github.com>
|
👋 Jules, reporting for duty! I'm here to lend a hand with this pull request. When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down. I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job! For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with New to Jules? Learn more at jules.google/docs. For security, I will only act on instructions from the user who triggered this task. |
📝 WalkthroughWalkthroughThis PR introduces ChangesAVX2 Softmax v6 with Optimized Exponential
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
ml_kernels/include/ml_kernels/softmax.h (1)
504-504: ⚡ Quick winMove function-body opening braces to their own lines.
Line 504 and Line 542 place
{on the signature line; this violates the project’s C/C++ brace style rule.As per coding guidelines,
**/*.{c,cpp,cc,h,hpp}: Keep braces on their own lines for function bodies.Also applies to: 542-542
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ml_kernels/include/ml_kernels/softmax.h` at line 504, The function signature for exp256_ps_v3 currently has the opening brace on the same line; change the brace so the function-body opening brace is on its own line (i.e., place "{" on the following line) to conform to the project's C/C++ brace style; apply the same fix to the other nearby function(s) with braces on the signature line (e.g., the function at the location referenced around line 542) so all function definitions put "{" on its own line.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@ml_kernels/include/ml_kernels/softmax.h`:
- Line 504: The function signature for exp256_ps_v3 currently has the opening
brace on the same line; change the brace so the function-body opening brace is
on its own line (i.e., place "{" on the following line) to conform to the
project's C/C++ brace style; apply the same fix to the other nearby function(s)
with braces on the signature line (e.g., the function at the location referenced
around line 542) so all function definitions put "{" on its own line.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: b1b25e34-ac8b-4f28-8fc1-af153ef7f02f
📒 Files selected for processing (4)
.jules/thunderbolt.mdml_kernels/include/ml_kernels/softmax.hml_kernels/src/kernel_bench.cppml_kernels/src/test_naive_ops.cpp
💡 What:
Added
softmax_v6andexp256_ps_v3which combines the split exact mathematicalln(2)approximation (0.693145751953125f + 1.428606765330187e-06f) into a single float constant0.6931471805599453f. This allows the criticalr = x - n * ln(2)line in the AVX2expfunction to be resolved via a single FMA (_mm256_fnmadd_ps) instead of two.🎯 Why:
The AVX2
exp256_psimplementation relies heavily on a chained set of_mm256_fmadd_pscalls for the Horner polynomial sequence. By knocking out one instruction in the preceding range reduction, we shorten the critical dependency path slightly, reducing the overall port pressure on the execution units for the N-element map loop.🏗️ How:
instead of:
Since softmax works on relative differences and normalizes over a denominator sum, the infinitesimally small loss of precision on this step easily stays within the
1e-4expected testing tolerance.📊 Impact:
Microbenchmarks show a modest throughput improvement on
softmax_v6compared tosoftmax_v5.Example on N=65536:
softmax_v5: ~5.69 GFLOP/ssoftmax_v6: ~6.13 GFLOP/s🖥️ Tested on:
Environment: x86-64 (CI/Local Sandbox with AVX2 support enabled via GCC 13.3)
🔬 How to reproduce:
PR created automatically by Jules for task 5293043582355494890 started by @bugparty
Summary by CodeRabbit
New Features
Tests
Documentation