feat: Reduce reference model memory with with parallel logprob computation #608

gitlost-murali · 2025-11-30T18:30:53Z

Summary

When tensor parallelism is enabled, the reference model's logits are sharded across GPUs on the vocabulary dimension. Previously, we called full_tensor() to gather the complete vocab on each GPU before computing log probabilities.

This PR adds compute_logprobs_parallel() that computes log probabilities distributedly using the log-sum-exp trick across shards.

Memory savings (measured)

Scenario	Memory per GPU
Old (full_tensor + compute_logprobs)	58 GB
New (parallel logprobs)	34 GB
Saved	24 GB (~41%)

Old state usage:

New state (Parallel logprobs based) usage:

Tested with batch=4, seq_len=9k (1024 prompt tokens + 8192 response tokens), vocab=150k, TP=2

Changes

New: src/forge/util/parallel_logprobs.py - distributed log-prob computation for vocab-sharded DTensors
New: tests/unit_tests/util/test_parallel_logprobs.py - correctness tests against sequential implementation
Modified: src/forge/actors/reference_model.py - uses parallel version when TP is enabled

Implementation

Uses distributed log-softmax without gathering:

All-reduce MAX for numerical stability
All-reduce SUM of local exp(x - max)
Each rank gathers logits only for tokens in its shard
All-reduce SUM to combine (only owning rank contributes)

Testing

Verified results match compute_logprobs() within 1e-5 tolerance
Tested temperature scaling, alignment modes, numerical stability with extreme values
Tested 2-way vocab sharded config

joecummings

I like this idea!

Could I ask for a few things?

WandB logs that show the memory saved. This is always helpful as a part of verifying the correctness.
Combine the parallel_logprobs and regular logprobs in the same file. No need to split that out just yet.
Look for ways that this code could be simplified and/or factored out. Claude can be very verbose :)

Looking forward to getting this landed!

gitlost-murali · 2025-12-01T23:44:46Z

Thanks for the review @joecummings !

I refactored the code as per feedback. Less Claude footprint now :). Let me know if the code needs to be further simplified/refactored.

I attached wandb chart images in the description. Also attaching it here:

Old state usage:

New state (Parallel logprobs based) usage:

…reference model This update introduces the function to compute log probabilities without gathering the full vocabulary tensor across GPUs.

…ogprobs

…ructure

… logprobs

gitlost-murali · 2025-12-03T20:48:14Z

Hi @felipemello1,

The unit tests were failing as pytz was missing from CI env. I rebased on main now. Looks like #618 (easy - remove pytz) takes care of this

Ran the tests locally. All pass. Can you trigger the tests again please?

Thanks!

felipemello1 · 2025-12-03T20:56:01Z

@gitlost-murali, thanks for opening the PR. Great results!

can you try to run the non-sharded version but compile F.cross_entropy? e.g.

@torch.compile()
def compute_logprobs(...):
    ...

I think that simply compiling it greatly reduces the memory, since it never materializes the intermediate activations. Maybe something to do in addition to your work and not in place of your work. I am skeptical about using the log-sum-exp directly and not F.cross_entropy, since the compiled version is highly optimized.

Also, you might be interested in checking Nathan's old PRs in torchtune: meta-pytorch/torchtune#2782

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 30, 2025

gitlost-murali changed the title ~~feat: Distributed log-prob computation for vocab-sharded reference model~~ feat: Optimize reference model GPU usage by distributed log-prob computation on vocab-sharded logits Nov 30, 2025

gitlost-murali changed the title ~~feat: Optimize reference model GPU usage by distributed log-prob computation on vocab-sharded logits~~ feat: Reduce reference model memory usage with distributed log-probs comp Nov 30, 2025

gitlost-murali changed the title ~~feat: Reduce reference model memory usage with distributed log-probs comp~~ feat: Reduce reference model memory with with parallel logprob computation Nov 30, 2025

joecummings reviewed Dec 1, 2025

View reviewed changes

gitlost-murali requested a review from joecummings December 1, 2025 23:44

gitlost-murali added 12 commits December 3, 2025 20:36

feat: Add parallel log-prob computation for vocab-sharded tensors in …

060b4dd

…reference model This update introduces the function to compute log probabilities without gathering the full vocabulary tensor across GPUs.

refactor: use util function

9526177

refactor: remove redundant code and reuse compute_logprobs function

25fe416

test: add test to verify correctness with compute_logprobs

830be77

chore: make routing explicit

40f8c9b

style: clean up inline comments in parallel_logprobs

616de1e

refactor: move compute_logprobs_parallel into ops alongside compute_l…

70e66d9

…ogprobs

refactor: simplify and refactor logprobs parallel method

b07123e

test: move parallel logprobs test to test_ops.py reflecting folder st…

ae4c880

…ructure

chore: merge single use declarations

352175e

fix: safely handle edgecase of uneven vocab shards in tensor-parallel…

3d4342b

… logprobs

fix: fix compute_logprobs_parallel import

20f59bf

gitlost-murali force-pushed the optimize-ref-model-usage branch from 53ddb5b to 20f59bf Compare December 3, 2025 20:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Reduce reference model memory with with parallel logprob computation #608

feat: Reduce reference model memory with with parallel logprob computation #608

gitlost-murali commented Nov 30, 2025 •

edited

Loading

Uh oh!

joecummings left a comment

Uh oh!

gitlost-murali commented Dec 1, 2025

Uh oh!

gitlost-murali commented Dec 3, 2025

Uh oh!

felipemello1 commented Dec 3, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: Reduce reference model memory with with parallel logprob computation #608

Are you sure you want to change the base?

feat: Reduce reference model memory with with parallel logprob computation #608

Conversation

gitlost-murali commented Nov 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Memory savings (measured)

Changes

Implementation

Testing

Uh oh!

joecummings left a comment

Choose a reason for hiding this comment

Uh oh!

gitlost-murali commented Dec 1, 2025

Uh oh!

gitlost-murali commented Dec 3, 2025

Uh oh!

felipemello1 commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

gitlost-murali commented Nov 30, 2025 •

edited

Loading

felipemello1 commented Dec 3, 2025 •

edited

Loading