Optimize loss calculation with in-place gradients calculation ~40% memory save by yubofredwang · Pull Request #185 · sgl-project/SpecForge

yubofredwang · 2025-08-27T05:58:59Z

Motivation

The loss calculation on TTT steps of logits are taking up huge chunk of memory. According to the memory profiling, it is due to the intermediate tensors and gradients created during backward torch autograd.

??:0:torch::autograd::generated::EmbeddingBackward0::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&)

We would like to save the gradients into the logits instead of creating a separate tensor.

Modifications

Added a triton implementation of log softmax calculation with in-place mod of input logits.

Related Issues

#112

Accuracy Test

Unit test added

Benchmark & Profiling

Config (B,T,V)  PyTorch (ms)    Triton (ms)     Speedup    PyTorch Mem (GB)   Triton Mem (GB) Memory Save 
-------------------------------------------------------------------------------------------------------------------
(1,1024,32000)  449.08          435.22          1.03x      1.85               0.98            46.7%       
(1,1024,64000)  167.10          467.80          0.36x      3.68               2.81            23.4%       
(1,4096,32000)  127.67          7.03            18.15x     7.32               5.62            23.3%       
(1,4096,64000)  20.78           24.35           0.85x      14.65              11.23           23.3%       
(1,8192,32000)  20.48           13.56           1.51x      21.48              14.65           31.8%       
(1,8192,64000)  41.14           48.11           0.86x      29.30              22.46           23.3%       
(1,16384,32000) 41.11           26.95           1.53x      42.97              29.30           31.8%

Also 50% faster

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://sgl-fru7574.slack.com/archives/C09784E3EN6 to discuss your PR.

gemini-code-assist · 2025-08-27T05:59:03Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

yubofredwang added 2 commits August 27, 2025 05:50

in place loss calculation

c835943

revert off

dd305d6

yubofredwang requested review from FlamingoPg, FrankLeeeee, shuaills and sleepcoo as code owners August 27, 2025 05:59

yubofredwang requested a review from zyksir August 27, 2025 06:01

remove comments

5a6ae98

sleepcoo merged commit d852345 into sgl-project:main Aug 27, 2025
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize loss calculation with in-place gradients calculation ~40% memory save#185

Optimize loss calculation with in-place gradients calculation ~40% memory save#185
sleepcoo merged 3 commits intosgl-project:mainfrom
yubofredwang:optimize-loss-calc

yubofredwang commented Aug 27, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Aug 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yubofredwang commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Related Issues

Accuracy Test

Benchmark & Profiling

Checklist

Uh oh!

gemini-code-assist bot commented Aug 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yubofredwang commented Aug 27, 2025 •

edited

Loading