Fix loss masking #445

oleksost · 2026-01-09T19:13:42Z

✨ Description

Addresses #442

loss_masks should include padding and image placeholder tokens

TODO:

not sure if loss masking spans and image placeholders should also be ignored in the layer-wise losses

🔍 Type of change

Select all that apply:

🐛 Bug fix (non-breaking change that addresses a specific issue)
🚀 New feature (non-breaking change that adds functionality)
⚠️ Breaking change (a change that could affect existing functionality)
📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
📝 Documentation change (updates documentation, including new content or typo fixes)
🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

📝 Changes

List the key changes introduced in this PR:

Change A
Change B

✅ Checklist

Make sure the following tasks are completed before submitting the PR:

General

📜 I have read and followed the contributing guidelines.
🏷️ I am using a clear and descriptive PR title that summarizes the key change or feature introduced.
🎉 The functionality is complete, and I have tested the changes.
📝 I have updated the documentation if needed.
⚠️ The change does not introduce any new issues (e.g., runtime warnings, type checker errors, linting problems, unhandled edge cases).
🧩 I have commented my code, especially in hard-to-understand areas.

Dependencies and Configuration

🐋 I have updated the Docker configuration or dependencies, if applicable.
🔄 I have ensured compatibility with the existing setup after dependency changes.

Testing

🧪 I have added or updated tests to cover my changes.
✔️ New and existing tests pass locally with my changes.
🚦 I have tested these changes on GPUs and verified training stability.
🏋️ I have tested the changes on realistic training workloads, if applicable.

Performance Impact

📊 I have run benchmarks where applicable to evaluate the performance impact.
✅ The benchmarks show no performance regression.
🚀 The benchmarks indicate a potential performance improvement.
⚠️ The benchmarks indicate a potential performance degradation.
📈 I have provided benchmark results and detailed any performance impact below, if applicable.

📊 Performance Impact Details

If there is any impact on performance, describe it and provide benchmark results, if applicable:

🗒️ Additional Notes

Include any additional context, information, or considerations here, such as known issues, follow-up tasks, or backward compatibility concerns.

…st-LLM into loss_masking_fixes

jlamypoirier · 2026-01-09T22:46:54Z

fast_llm/models/gpt/model.py


                labels = batch.tokens.crop(labels_begin, labels_end).tokens
-
+                loss_mask = labels >= 0


Is this really what we want? We can't train the model to produce these labels, but it might make sense to compute other losses?

Can we skip this when not needed?

jlamypoirier · 2026-01-09T22:48:56Z

fast_llm/models/gpt/model.py


+                if (
+                    self._config.head.distillation_model is not None
+                    or self._config.decoder.block.distillation_model is not None


Activation distillation ignores loss_mask, it uses activation_mask instead.

jlamypoirier · 2026-01-09T23:26:24Z

TODO:

not sure if loss masking spans and image placeholders should also be ignored in the layer-wise losses

Does that even make sense? These refer to token prediction which isn't really a thing at the activation stage. I guess we could take the next token but that raises several concerns (especially with MTP).

Actually I think we shouldn't mask those. They may not be used for next token prediction, but the keys and values resulting from these activations are used in further down in the sequence, which means we do train these activations.

add padding and image placeholder into loss mask

e6560db

oleksost requested a review from jlamypoirier January 9, 2026 19:13

oleksost added 4 commits January 9, 2026 19:27

Merge remote-tracking branch 'origin/main' into loss_masking_fixes

1e228d8

Merge branch 'main' into loss_masking_fixes

d9a96c6

test

0139497

Merge branch 'loss_masking_fixes' of https://github.com/ServiceNow/Fa…

37d1f03

…st-LLM into loss_masking_fixes

jlamypoirier reviewed Jan 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix loss masking #445

Fix loss masking #445

oleksost commented Jan 9, 2026 •

edited

Loading

Uh oh!

jlamypoirier Jan 9, 2026

Uh oh!

jlamypoirier Jan 9, 2026

Uh oh!

jlamypoirier Jan 9, 2026

Uh oh!

jlamypoirier commented Jan 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		labels = batch.tokens.crop(labels_begin, labels_end).tokens

		loss_mask = labels >= 0

Fix loss masking #445

Are you sure you want to change the base?

Fix loss masking #445

Conversation

oleksost commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✨ Description

🔍 Type of change

📝 Changes

✅ Checklist

General

Dependencies and Configuration

Testing

Performance Impact

📊 Performance Impact Details

🗒️ Additional Notes

Uh oh!

jlamypoirier Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

jlamypoirier Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

jlamypoirier Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

jlamypoirier commented Jan 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

oleksost commented Jan 9, 2026 •

edited

Loading