Skip to content

fix(finetune): all_reduce val_loss for initial/final evaluation in multi-GPU training#2224

Open
NIK-TIGER-BILL wants to merge 2 commits intoLightning-AI:mainfrom
NIK-TIGER-BILL:fix/finetune-all-reduce-initial-final-eval
Open

fix(finetune): all_reduce val_loss for initial/final evaluation in multi-GPU training#2224
NIK-TIGER-BILL wants to merge 2 commits intoLightning-AI:mainfrom
NIK-TIGER-BILL:fix/finetune-all-reduce-initial-final-eval

Conversation

@NIK-TIGER-BILL
Copy link
Copy Markdown

What does this PR do?

Fixes #2116

Both lora.py and full.py have three validation checkpoints:

Checkpoint Reduces across devices?
Periodic (every eval.interval steps) Yesfabric.all_reduce(…, reduce_op="mean")
Initial (before training starts) No — single-rank value logged/printed
Final (after training ends) No — single-rank value logged/printed

In a multi-GPU run this means the initial and final validation metrics only reflect the local data slice of rank-0, which is both incorrect and non-reproducible.

Fix: Apply the same detach / clone / all_reduce pattern used in the periodic loop to both the initial and final validation blocks in lora.py and full.py.

Before submitting

…al evaluation

The periodic validation loop inside `fit()` in both `lora.py` and `full.py`
correctly reduces the validation loss across all Fabric devices:

    val_loss_tensor = val_loss.detach().clone().to(fabric.device)
    fabric.all_reduce(val_loss_tensor, reduce_op="mean")

However, the **initial validation** (run once before training) and the
**final validation** (run once after training) were missing this step.
In a multi-GPU run the logged and printed loss values would only reflect
rank-0'\''s local data slice, producing incorrect (and non-reproducible)
metrics.

Fix: apply the same `detach / clone / all_reduce` pattern to both the
initial and final validation blocks in `lora.py` and `full.py`.  The
`adapter.py` and `adapter_v2.py` variants do not have an
`initial_validation` / `final_validation` guard so they are unaffected.

Fixes Lightning-AI#2116

Signed-off-by: NIK-TIGER-BILL <nik.tiger.bill@github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Initial and final evaluation in finetune scripts do not accumulate over devices

2 participants