Adding eval to the SFT #404

HosseinKaviani-H · 2025-10-14T18:09:02Z

Add periodic evaluation during training with epoch-aware synchronization

Added evaluation functionality to the SFT training recipe with proper multi-rank synchronization and epoch completion detection.

Changes:

Core Evaluation Features

Configurable evaluation interval: Added eval_interval and eval_steps parameters to control when and how much to evaluate
- eval_interval: Number of training steps between evaluations (defaults to float('inf') to disable eval when not configured)
- eval_steps: Number of validation batches to evaluate per evaluation run (defaults to 0 for unlimited - runs one full epoch)
Validation dataloader: Set up separate validation dataloader using the last 10% of the train split
Forward-only pass: Implemented forward_only() method for evaluation without gradient computation, supporting both pipeline parallel and non-PP configurations

Epoch-Aware Evaluation with Multi-Rank Synchronization

Epoch completion detection: Evaluates for exactly one complete epoch by monitoring batch["metrics"] for epoch increments
- Extracts num_epochs from batch metadata to detect when validation dataset completes one full pass
- Prevents evaluation from running forever on infinite streaming dataloaders
Non-blocking all_reduce pattern: Synchronizes epoch completion across all ranks without blocking computation
- Prefetch + async pattern: Fetches next batch while processing current batch
- Overlapped communication: Starts async_op=True all_reduce on next batch's epoch while GPU computes current batch's loss
- Deferred checking: Checks all_reduce result in next iteration (should be complete by then)
- Early stopping: All ranks stop when ANY rank completes an epoch (via MAX reduction)

Integration

Training integration: Evaluation runs automatically every eval_interval steps during training
Graceful degradation: If eval_steps > 0, it acts as a cap (useful for quick validation checks or when epoch metadata is unavailable)

Usage:

Configure in your YAML config file:

training:
  eval_interval: 100  # Evaluate every 100 training steps
  eval_steps: 0       # Run one complete epoch (recommended)
  # eval_steps: 50    # Alternative: cap at 50 batches for faster checks

If eval_intervaland eval_steps are not set, evaluation is automatically disabled.

Testing:
Comprehensive test suite (test_evaluate.py) validates:

✅ Epoch extraction from batch metadata
✅ Single epoch completion detection
✅ eval_steps cap enforcement
✅ Empty/single batch edge cases
✅ Async all_reduce pattern behavior
✅ Multi-rank synchronization logic
✅ Prefetch pattern correctness
All 14 tests pass successfully.

Algorithm Details:
The non-blocking evaluation loop follows this pattern:

Iteration N:

Check if previous all_reduce says we should stop
Process current batch (forward pass, compute loss)
Prefetch next batch
Extract epoch from next batch
Start async all_reduce on epoch_increment (returns immediately, doesn't block)

Iteration N+1:

Wait for all_reduce from iteration N (should be done, or very fast)
Check result: if any rank has epoch_increment > 0, stop
Process batch N+1...
This overlaps network communication with GPU computation for better performance, while ensuring all ranks stop at the same point.

This updated description captures:

The epoch detection mechanism
The async all_reduce implementation details
The performance benefits of the overlapped communication pattern
Test coverage
Clear usage examples
Algorithm explanation for reviewers

felipemello1 · 2025-10-14T19:37:11Z

hey @HosseinKaviani-H , thanks for opening the PR. Its a bit tricky to run the validation, because the dataset is infinite. So it doesnt know when to stop. You can retrieve the epoch number for each dataset from batch["metrics'], but we haven't looked into that. On top of that, if you have multiple datasets, they will epoch at different paces. I think that there are a few ways on handling this:

changing the dataset definition for validation / adding some flag / some logic that goes in the dataset
Trying to orchestrate it in the recipe, i.e. checking if the dataset has epoched and stopping then.

It seems that you defined "eval_steps" as a clever solution to not deal with any of that. But i wonder about correctness here, i.e. not going through the entire eval, or going 1.5x times, for example. Any thoughts?

apps/sft/llama3_8b.yaml

HosseinKaviani-H · 2025-10-14T21:57:34Z

hey @HosseinKaviani-H , thanks for opening the PR. Its a bit tricky to run the validation, because the dataset is infinite. So it doesnt know when to stop. You can retrieve the epoch number for each dataset from batch["metrics'], but we haven't looked into that. On top of that, if you have multiple datasets, they will epoch at different paces. I think that there are a few ways on handling this:

changing the dataset definition for validation / adding some flag / some logic that goes in the dataset

Trying to orchestrate it in the recipe, i.e. checking if the dataset has epoched and stopping then.

It seems that you defined "eval_steps" as a clever solution to not deal with any of that. But i wonder about correctness here, i.e. not going through the entire eval, or going 1.5x times, for example. Any thoughts?

Hi @felipemello1 ,

Thanks for your comment. Yeah I think one good solution as you mentioned is to retrieve the epoch number in the training loop and once it hits 0 to 1, it breaks. I'll try to give it some thoughts and implement it.

And yes, counting batches is arbitrary here as if eval_steps is too low it could lead to incomplete evaluation or too high it might cause duplicate evaluation. Hence, checking epoch number sounds a better solution here.

ebsmothers · 2025-10-14T22:10:49Z

Leaving this comment here before a full review since I think it's relevant to the point raised by @felipemello1: previously @DNXie observed challenges with iterable dataset hanging when there are uneven numbers of samples across the ranks. In general this is a pretty hard problem to solve cleanly. So actually I would recommend going with the approach of using a fixed number of steps. You can see the full context in this torchtitan issue: pytorch/torchtitan#1618

felipemello1 · 2025-10-14T22:13:05Z

dataset hanging when there are uneven numbers of samples across the ranks.

@ebsmothers this should never happen to us, since we have inifinite datasets. Thats one of the main args for infinite iterable: you dont have to worry about hanging issues. It just restarts the iter and keeps providing new samples.

ebsmothers · 2025-10-14T22:25:47Z

@felipemello1 sorry maybe I don't fully understand your suggestions then. What is the termination condition for the validation loop? If it is epoch-based in any way I think we will run into this issue, right?

felipemello1 · 2025-10-14T22:31:16Z

@felipemello1 sorry maybe I don't fully understand your suggestions then. What is the termination condition for the validation loop? If it is epoch-based in any way I think we will run into this issue, right?

we can identify the change in epoch and drop last using all_gather + barrier. Dummy example for single dataset:

for batch in dataloader:
      epoch = batch["metrics"]["epoch"]
      is_new_epoch = epoch - prev_epoch

	  # if any rank has a new epoch, we stop, i.e. drop last batch
      is_new_epoch = all_gather(is_new_epoch)
      torch.distributed.barrier()
      if is_new_epoch>0:
            break

In the example above, for bsz=4, maybe rank_0 would have 2 samples from epoch 0 and 2 from epoch 1. But the batch size would always be 4. It would never hang.

Maybe this could be done elegantly inside of the dataset and hide the logic from the recipe? but i dont think that there is a pretty way.

Also not sure how to handle the multidataset situation. Perhaps:

# iterate over one dataset at a time
for dataloader in dataloaders:
      for batch in dataloader:
            ...

does it make sense @ebsmothers ?

ebsmothers · 2025-10-14T22:41:20Z

@felipemello1 that's an interesting idea. In addition to your point about it not being super pretty, I am also wary of the torch.distributed.barrier() usage. I understand why it's necessary here but blocking for all ranks on every single batch is not ideal imo

felipemello1 · 2025-10-14T22:44:01Z

@felipemello1 that's an interesting idea. In addition to your point about it not being super pretty, I am also wary of the torch.distributed.barrier() usage. I understand why it's necessary here but blocking for all ranks on every single batch is not ideal imo

We could add to the ugliness and prefetch + check epoch change on a different stream one epoch in advance, so it would be non blocking. This can be an utility and removed from the recipe. It would also only happen for validation (training is safe).

HosseinKaviani-H · 2025-10-16T21:48:20Z

Add periodic evaluation during training with epoch-aware synchronization

Added evaluation functionality to the SFT training recipe with proper multi-rank synchronization and epoch completion detection.

Changes:

Core Evaluation Features

Configurable evaluation interval: Added eval_interval and eval_steps parameters to control when and how much to evaluate

eval_interval: Number of training steps between evaluations (defaults to float('inf') to disable eval when not configured)

eval_steps: Number of validation batches to evaluate per evaluation run (defaults to 0 for unlimited - runs one full epoch)

Validation dataloader: Set up separate validation dataloader using the last 10% of the train split

Forward-only pass: Implemented forward_only() method for evaluation without gradient computation, supporting both pipeline parallel and non-PP configurations

Epoch-Aware Evaluation with Multi-Rank Synchronization

Epoch completion detection: Evaluates for exactly one complete epoch by monitoring batch["metrics"] for epoch increments

Extracts num_epochs from batch metadata to detect when validation dataset completes one full pass

Prevents evaluation from running forever on infinite streaming dataloaders

Non-blocking all_reduce pattern: Synchronizes epoch completion across all ranks without blocking computation

Prefetch + async pattern: Fetches next batch while processing current batch

Overlapped communication: Starts async_op=True all_reduce on next batch's epoch while GPU computes current batch's loss

Deferred checking: Checks all_reduce result in next iteration (should be complete by then)

Early stopping: All ranks stop when ANY rank completes an epoch (via MAX reduction)

Integration

Training integration: Evaluation runs automatically every eval_interval steps during training

Graceful degradation: If eval_steps > 0, it acts as a cap (useful for quick validation checks or when epoch metadata is unavailable)

Usage:

Configure in your YAML config file:
training:
  eval_interval: 100  # Evaluate every 100 training steps
  eval_steps: 0       # Run one complete epoch (recommended)
  # eval_steps: 50    # Alternative: cap at 50 batches for faster checks
If eval_intervaland eval_steps are not set, evaluation is automatically disabled.

Testing: Comprehensive test suite (test_evaluate.py) validates:

✅ Epoch extraction from batch metadata ✅ Single epoch completion detection ✅ eval_steps cap enforcement ✅ Empty/single batch edge cases ✅ Async all_reduce pattern behavior ✅ Multi-rank synchronization logic ✅ Prefetch pattern correctness All 14 tests pass successfully.

Algorithm Details: The non-blocking evaluation loop follows this pattern:

Iteration N:

Check if previous all_reduce says we should stop

Process current batch (forward pass, compute loss)

Prefetch next batch

Extract epoch from next batch

Start async all_reduce on epoch_increment (returns immediately, doesn't block)

Iteration N+1:

Wait for all_reduce from iteration N (should be done, or very fast)

Check result: if any rank has epoch_increment > 0, stop

Process batch N+1...
This overlaps network communication with GPU computation for better performance, while ensuring all ranks stop at the same point.

This updated description captures:

The epoch detection mechanism

The async all_reduce implementation details

The performance benefits of the overlapped communication pattern

Test coverage

Clear usage examples

Algorithm explanation for reviewers

@ebsmothers @felipemello1 Given our discussion and per Felipe's idea, I have implemented an epoch-based eval with non-blocking all-reduce. I have updated the description and added a test_evaluate script to cover different scenarios.

codecov-commenter · 2025-10-16T22:00:15Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (main@399b20d). Learn more about missing BASE report.

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #404   +/-   ##
=======================================
  Coverage        ?   73.43%           
=======================================
  Files           ?       81           
  Lines           ?     7829           
  Branches        ?        0           
=======================================
  Hits            ?     5749           
  Misses          ?     2080           
  Partials        ?        0

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

felipemello1 · 2025-10-17T01:16:06Z

hey Hossein, thanks! I think that the tests are just mocking distributed and not testing it. @ebsmothers , do we have a decorator for distributed tests in forge? Regarding the implementation, i dont think we need >100 lines to do the sampling + epoch checking. Probably we can shrink it a bit

apps/sft/main.py

ebsmothers · 2025-10-16T19:35:39Z

apps/sft/main.py


        dataset = sft_iterable_dataset(
            model_transform=tokenizer,
            message_transform=AlpacaToMessages(),


Ideally we shouldn't hardcode this either (but it's a bit more work without instantiate)

I agree. We can have this implemented as soon as we fix the main eval functionality

apps/sft/llama3_8b.yaml

apps/sft/main.py

HosseinKaviani-H · 2025-10-17T21:18:23Z

hey Hossein, thanks! I think that the tests are just mocking distributed and not testing it. @ebsmothers , do we have a decorator for distributed tests in forge? Regarding the implementation, i dont think we need >100 lines to do the sampling + epoch checking. Probably we can shrink it a bit

@felipemello1 I have shortened the code a bit. Let me know if the distributed testing so I can have that implemented as well

felipemello1 · 2025-10-20T18:10:39Z

apps/sft/main.py

+        # Prefetch first batch
+        try:
+            next_batch = next(val_dataloader)
+        except StopIteration:


I think we can remove the defensive checks and assume that the dataset is infinite. We have a class for it. I think you can just do an assertion that its TuneIterableDataset (we need to update the name and remove "tune", but dont worry about this on this PR:

torchforge/src/forge/data/datasets/dataset.py

Line 117 in d464193

class InfiniteTuneIterableDataset(TuneIterableDataset):

Then we know its infinite, and we can remove try/except here and later in the loop. wdyt?

felipemello1 · 2025-10-20T18:11:08Z

apps/sft/main.py

+        return None
+
+    async def evaluate(self) -> dict[str, float]:
+        """Run evaluation with async all_reduce for cross-rank epoch synchronization."""


it might be worth enhancing this docstring a bit. Maybe add a small numerical example.

felipemello1 · 2025-10-20T18:14:18Z

apps/sft/main.py

+                if pending_work is not None:
+                    pending_work.wait()
+                    should_break = (
+                        epoch_tensor.item() > 0 if epoch_tensor is not None else False


as a rule of thumb, its not good to use item(), because it requires cpu-gpu synchonization

felipemello1 · 2025-10-20T18:18:36Z

apps/sft/main.py

+                    next_batch = next(val_dataloader)
+                    next_epoch = self._extract_epoch_from_batch(next_batch)
+
+                    if next_epoch is not None and starting_epoch is not None:


i am tempted to say lets remove the if/else check for None. Do you see a strong argument for keeping them? It would only be a problem if someone replaced our dataset abstraction, otherwise we always have the metric

felipemello1 · 2025-10-20T18:27:59Z

apps/sft/main.py

+        with torch.no_grad():
+            while True:
+                # Wait for previous async all_reduce to complete
+                if pending_work is not None:


I am thinking we could abstract most of it into some utility and have this (feel free to change var names)

epoch_incremented, next_max_epoch = False, None with torch.no_grad(): while True: # check if epoch incremented before getting new batch. # If so, stop iterating on the dataset epoch_incremented: bool = check_if_epoch_incremented(batch, next_max_epoch) if epoch_incremented: logger.info("bla bla bla") break # get next batch batch = next_batch next_batch = next(val_dataloader) # start non-blocking all_reduce for next batches epoch next_max_epoch: futures = get_distributed_max_epoch(next_batch)

not 100% sure this works. I think that get_distributed_max_epoch may need to return a tensor and futures?

felipemello1 · 2025-10-20T18:33:01Z

apps/sft/main.py

+        for model_part in self.model_parts:
+            model_part.train()
+
+        avg_loss = total_loss / max(num_batches, 1)


This might be incorrect. If we have one sample with 10 tokens and another sample with 100 tokens, we have to /110, not by 2. @ebsmothers can you confirm?

This kinda depends. One interpretation is that the val loss we report is the average over all batches in the val dataloader. In that case this would be the correct implementation. A second interpretation would be that the val loss we report is actually the aggregate loss over all tokens in the val dataset. This is the more "correct" one, as @felipemello1 points out. It especially matters for training.. btw this is why I did not enable gradient accumulation initially, as it needs some special handling. See e.g. the "Long digression" section in the summary of meta-pytorch/torchtune#1917. Since then titan has made some strides in this direction, see this function and this issue.

So I think the ideal state is that we use token-normalized loss (rather than batch-normalized as has been done here) for both training and validation. But for training it will likely require a little more work to fully support in titan. For validation it's more straightforward, you can see this snippet in torchtune's validation logic.

felipemello1 · 2025-10-20T18:34:19Z

apps/sft/main.py

+        for model_part in self.model_parts:
+            model_part.eval()
+
+        val_dataloader = iter(self.val_dataloader)


did you have a chance to consider the case for multidataset? what happens in this case?

felipemello1

Sorry for the delay. Left some comments/suggestions. We would need to test it in some distributed capacity. Were you able to run it for >1 node and confirm that it stopped right after 1 epoch?

ebsmothers · 2025-10-20T20:41:17Z

do we have a decorator for distributed tests in forge?

@felipemello1 Sorry I missed this before. We do have this utility but not sure if that's sufficient here. Another commonly-used class is FSDPTest, which handles a lot of the setup and teardown logic for a distributed test.

ebsmothers · 2025-10-20T21:41:21Z

apps/sft/main.py

+        dataset_val_config = self.job_config.get("dataset_val", {})
+        self.val_dataloader = self.setup_data(
+            dataset_path=dataset_val_config.get("path", dataset_config.get("path")),
+            dataset_split=dataset_val_config.get("split", dataset_config.get("split")),


nit but I don't like these nested .get calls. It also seems strange that we would fallback to validation on the training set. Personally I would just recommend checking if validation is enabled, and if it's not, don't even set up the validation dataloader at all.

ebsmothers · 2025-10-20T22:29:44Z

apps/sft/main.py

+                    return metric.value
+        return None
+
+    async def evaluate(self) -> dict[str, float]:


In addition to @felipemello1's more detailed comments, one higher-level point: this eval implementation is adding a lot of code to the main.py file, which as the entry point is something that everyone will have to read. (Specifically this PR alone has increased the total LoC by more than 50%, and the evaluate method alone is more than 100 lines due to boundary checking, edge case handling, etc.) I would like to see if we can find a more minimal way to introduce eval that doesn't expose the user to so much code complexity.

I think Felipe's suggestions of offloading to the dataset class, utilities, etc. are valuable. But would also like to re-raise the option of simplifying by only allowing eval for a fixed number of steps (at least for a first pass). Not gonna block on this, if we can do the cross-epoch accounting in a bit more of a clean, minimal way I am all for it.

Hossein Kavianihamedani and others added 4 commits October 8, 2025 10:56

Submitting an interactive notebook to run SFT

3653453

Submitting an interactive notebook to run SFT

baeb35b

Merge branch 'meta-pytorch:main' into main

7550664

Adding eval loop to the sft

a0f62e7

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 14, 2025

joecummings requested a review from daniellepintz October 14, 2025 18:45

HamidShojanazeri reviewed Oct 14, 2025

View reviewed changes

apps/sft/llama3_8b.yaml Outdated Show resolved Hide resolved

Implement Epoch-Based Evaluation with Non-Blocking All-Reduce

53371c6

ebsmothers reviewed Oct 17, 2025

View reviewed changes

Hossein Kavianihamedani added 2 commits October 17, 2025 14:08

Add configurable datasets and validation and shortening the code

4793948

Add configurable datasets and validation and shortening the code

676db88

Removed llama test eval

250c0cd

felipemello1 reviewed Oct 20, 2025

View reviewed changes

ebsmothers reviewed Oct 20, 2025

View reviewed changes

Adding eval to the SFT #404

Are you sure you want to change the base?

Adding eval to the SFT #404

Conversation

HosseinKaviani-H commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Add periodic evaluation during training with epoch-aware synchronization

Changes:

Core Evaluation Features

Epoch-Aware Evaluation with Multi-Rank Synchronization

Integration

Usage:

Uh oh!

felipemello1 commented Oct 14, 2025

Uh oh!

Uh oh!

HosseinKaviani-H commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ebsmothers commented Oct 14, 2025

Uh oh!

felipemello1 commented Oct 14, 2025

Uh oh!

ebsmothers commented Oct 14, 2025

Uh oh!

felipemello1 commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ebsmothers commented Oct 14, 2025

Uh oh!

felipemello1 commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HosseinKaviani-H commented Oct 16, 2025

Add periodic evaluation during training with epoch-aware synchronization

Changes:

Core Evaluation Features

Epoch-Aware Evaluation with Multi-Rank Synchronization

Integration

Usage:

Uh oh!

codecov-commenter commented Oct 16, 2025

Codecov Report

Uh oh!

felipemello1 commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HosseinKaviani-H commented Oct 17, 2025

Uh oh!

felipemello1 Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

felipemello1 Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

felipemello1 Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

felipemello1 Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HosseinKaviani-H commented Oct 14, 2025 •

edited

Loading

HosseinKaviani-H commented Oct 14, 2025 •

edited

Loading

felipemello1 commented Oct 14, 2025 •

edited

Loading

felipemello1 commented Oct 14, 2025 •

edited

Loading

felipemello1 commented Oct 17, 2025 •

edited

Loading

felipemello1 Oct 20, 2025 •

edited

Loading

felipemello1 Oct 20, 2025 •

edited

Loading

felipemello1 Oct 20, 2025 •

edited

Loading

felipemello1 Oct 20, 2025 •

edited

Loading

felipemello1 left a comment •

edited

Loading