[Fix] Reasonable loss for non-distributed training #242

joecummings · 2025-09-26T19:31:10Z

Context

This PR serves to fix an issue we were seeing wherein the first loss step was astronomically high (>500). The issue was tracked to a very high KL divergence value, which just measures the difference in logprobs between the reference model and the training model. For the very first step, this definitely shouldn't be the case b/c they are both the same model at that time!

Therefore, perhaps the weights on the reference model and training model were actually not the same. Further comparison against a forward pass of the Hugging Face model confirmed this hypothesis.

Fix

Load in the reference model weights correctly through the TorchTitan APIs.

Before

After

To-dos

Confirm that loss is still reasonable under the distributed setting. Likely needs Weight loading working correctly with tp: use vllm builtin load_weights() #184 to land first
When running the test script attached to this PR, it shows that while the outputs from the Hugging Face model impl and the TorchTitan model impl are similar, they are not exactly the same. Some difference is expected b/c of the RoPE implementation chosen; however, it's worth some investigating to determine whether this difference is too large.
Formulate a better plan for how to expose Titan "APIs" - this debugging experience was a nightmare b/c there is no way to understand what's going on in Titan without going to that code base and then in addition going to the experiments/forge folder. This is untenable.

JenniferWang · 2025-09-26T19:52:29Z

src/forge/actors/reference_model.py

    async def setup(self):
        engine_config = {f.name: getattr(self, f.name) for f in fields(self)}
        self.engine = ForgeEngine(ForgeJobConfig(**engine_config))
+        self.engine.checkpointer.load()


I was actually staring at this in the trainer side .. It's unclear at a glance how the checkpointer.load is associated with loading the HF model weights

Yeah it's not very clear without digging into the TorchTitan checkpointing code here: https://github.com/pytorch/torchtitan/blob/5b5d46856b400c8550989415bee91473aab4f921/torchtitan/components/checkpoint.py#L523

All the information is taken from the config and instantiated into the CheckpointManager. Then the load call only takes a "step", which in our case isn't needed b/c it should be a static model every time.

Jack-Khuu · 2025-09-26T20:06:36Z

src/forge/actors/reference_model.py

    async def setup(self):
        engine_config = {f.name: getattr(self, f.name) for f in fields(self)}
        self.engine = ForgeEngine(ForgeJobConfig(**engine_config))
+        self.engine.checkpointer.load()


meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 26, 2025

joecummings force-pushed the compare-against-hf-trainer branch from 1148c66 to 1846b85 Compare September 26, 2025 19:34

joecummings added 2 commits September 26, 2025 12:35

Correctly load weights on Titan reference model

45b8b14

Update config

c0c0a35

joecummings force-pushed the compare-against-hf-trainer branch from 1846b85 to c0c0a35 Compare September 26, 2025 19:35

joecummings marked this pull request as ready for review September 26, 2025 19:50

JenniferWang reviewed Sep 26, 2025

View reviewed changes

joecummings requested a review from allenwang28 September 26, 2025 20:05

Jack-Khuu reviewed Sep 26, 2025

View reviewed changes

allenwang28 approved these changes Sep 26, 2025

View reviewed changes

joecummings merged commit afdee53 into meta-pytorch:main Sep 26, 2025
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Fix] Reasonable loss for non-distributed training #242

[Fix] Reasonable loss for non-distributed training #242

Uh oh!

joecummings commented Sep 26, 2025 •

edited

Loading

Uh oh!

JenniferWang Sep 26, 2025

Uh oh!

joecummings Sep 26, 2025

Uh oh!

Jack-Khuu Sep 26, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[Fix] Reasonable loss for non-distributed training #242

[Fix] Reasonable loss for non-distributed training #242

Uh oh!

Conversation

joecummings commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

Fix

Before

After

To-dos

Uh oh!

JenniferWang Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

joecummings Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

Jack-Khuu Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

joecummings commented Sep 26, 2025 •

edited

Loading