Temporarily use no packing in SFT #614

joecummings · 2025-12-02T21:46:16Z

Context

Our current SFT recipe using packing. However, in order to use packing you need to pass in a block_causal mask in the forward pass. We construct this mask, but it is ignored in titan b/c they only allow additional mask to be passed in if the model definition specifies this and denotes that the model is using flex-attn. Since we are unable to control the exact model definitions ourselves, this is a temporary fix to ensure that our training is correct.

The TRUE fix(es) would be:

Move to nightlies everything and push qwen3_flex and llama3_flex versions to titan, then default to using those
Work with Titan to allow us to override these model definitions for things like this where we might want to try out Flex and normal attention

Changes

SFT main.py

Remove references to packing code
Import and use new padding function
Add validation to confirm that compile cannot be used

Collate.py

Add new function that pads to longest seq in the batch

Configs:

Increased batch size from 1 - 8 now that we are unable to fit multiple sequences in a single sample

Testing

Works with default configs as confirmed by training loss and eval loss decreasing

Wandb logs: https://wandb.ai/jcummings/sft-training

Validation works correctly

See below output with compile=True

[ForgeSFTRecipe-2/8] 2025-12-03 09:47:27 CRITICAL Unhandled exception in actor endpoint
Traceback (most recent call last):
  File "/home/jrcummings/.conda/envs/forge-uv/lib/python3.12/site-packages/monarch/_src/actor/actor_mesh.py", line 935, in handle
    result = await the_method(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jrcummings/projects/joe-forge/apps/sft/main.py", line 100, in setup
    raise ValueError(
ValueError: training.compile=True is not currently supported. Compile is only supported with flex attention enabled, which requires PyTorch nightly. Please set training.compile=false in your config.

Open questions

Why is the number of samples processed different between Llama3 8b and Qwen3 8b? Their seq length should be the same.

JenniferWang · 2025-12-03T19:21:23Z

apps/sft/main.py

+        if self.job_config.training.compile:
+            raise ValueError(
+                "training.compile=True is not currently supported. "
+                "Compile is only supported with flex attention enabled, which requires PyTorch nightly. "


Objection to start a main issue tracking the nightly build?

Sounds good! But can we first nail down the different subtasks via the Google Doc I just shared? Then we can translate to a GI.

JenniferWang · 2025-12-03T20:25:58Z

src/forge/data/collate.py

+            # Flatten if all are lists
+            if all(isinstance(item, list) for item in result[key]):
+                result[key] = [item for sublist in result[key] for item in sublist]


Is this a common practice? Feels like unnecessary operation / tribal knowledge

JenniferWang · 2025-12-03T20:32:56Z

src/forge/data/collate.py

+    Pads 'tokens' with 0 and 'labels' with CROSS_ENTROPY_IGNORE_IDX (-100).
+    Non-tensor fields (like metrics) are collected into lists and flattened
+    if all items are lists.


Is it common practice to assume tokens and labels are the keys for collate_padded?

joecummings added 2 commits December 2, 2025 07:46

Add new feature request template

ad346cd

Normal padding (no packing) in SFT

b7652a9

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Dec 2, 2025

joecummings added 2 commits December 3, 2025 09:50

Add in validation for compile

32c9814

Merge remote-tracking branch 'upstream/main' into temp-fix-sft

bd52608

joecummings marked this pull request as ready for review December 3, 2025 18:17

joecummings requested review from JenniferWang, daniellepintz and felipemello1 December 3, 2025 18:17

Merge remote-tracking branch 'upstream/main' into temp-fix-sft

240abf0

JenniferWang reviewed Dec 3, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Temporarily use no packing in SFT #614

Temporarily use no packing in SFT #614

joecummings commented Dec 2, 2025 •

edited

Loading

Uh oh!

JenniferWang Dec 3, 2025

Uh oh!

joecummings Dec 3, 2025

Uh oh!

JenniferWang Dec 3, 2025

Uh oh!

JenniferWang Dec 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Temporarily use no packing in SFT #614

Are you sure you want to change the base?

Temporarily use no packing in SFT #614

Conversation

joecummings commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

Changes

Testing

Open questions

Uh oh!

JenniferWang Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

joecummings Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

JenniferWang Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

JenniferWang Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

joecummings commented Dec 2, 2025 •

edited

Loading