fix(finetune): all_reduce val_loss for initial/final evaluation in multi-GPU training by NIK-TIGER-BILL · Pull Request #2224 · Lightning-AI/litgpt

NIK-TIGER-BILL · 2026-03-30T07:38:06Z

What does this PR do?

Both lora.py and full.py have three validation checkpoints:

Checkpoint	Reduces across devices?
Periodic (every `eval.interval` steps)	✅ Yes — `fabric.all_reduce(…, reduce_op="mean")`
Initial (before training starts)	❌ No — single-rank value logged/printed
Final (after training ends)	❌ No — single-rank value logged/printed

In a multi-GPU run this means the initial and final validation metrics only reflect the local data slice of rank-0, which is both incorrect and non-reproducible.

Fix: Apply the same detach / clone / all_reduce pattern used in the periodic loop to both the initial and final validation blocks in lora.py and full.py.

Before submitting

Was this discussed/approved via a GitHub issue? Yes, Initial and final evaluation in finetune scripts do not accumulate over devices #2116
Did you make sure to update the docs?
Did you write any new unit tests?

…al evaluation The periodic validation loop inside `fit()` in both `lora.py` and `full.py` correctly reduces the validation loss across all Fabric devices: val_loss_tensor = val_loss.detach().clone().to(fabric.device) fabric.all_reduce(val_loss_tensor, reduce_op="mean") However, the **initial validation** (run once before training) and the **final validation** (run once after training) were missing this step. In a multi-GPU run the logged and printed loss values would only reflect rank-0'\''s local data slice, producing incorrect (and non-reproducible) metrics. Fix: apply the same `detach / clone / all_reduce` pattern to both the initial and final validation blocks in `lora.py` and `full.py`. The `adapter.py` and `adapter_v2.py` variants do not have an `initial_validation` / `final_validation` guard so they are unaffected. Fixes Lightning-AI#2116 Signed-off-by: NIK-TIGER-BILL <nik.tiger.bill@github.com>

NIK-TIGER-BILL requested review from KaelanDt, andyland, k223kim, lianakoleva and t-vi as code owners March 30, 2026 07:38

Merge branch 'main' into fix/finetune-all-reduce-initial-final-eval

844bc7a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(finetune): all_reduce val_loss for initial/final evaluation in multi-GPU training#2224

fix(finetune): all_reduce val_loss for initial/final evaluation in multi-GPU training#2224
NIK-TIGER-BILL wants to merge 2 commits intoLightning-AI:mainfrom
NIK-TIGER-BILL:fix/finetune-all-reduce-initial-final-eval

NIK-TIGER-BILL commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

NIK-TIGER-BILL commented Mar 30, 2026

What does this PR do?

Before submitting

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants