[feat] Allow middle checkpoint evaluation in background using lmms-eval http server#127
[feat] Allow middle checkpoint evaluation in background using lmms-eval http server#127
Conversation
Add utilities for merging sharded FSDP2 checkpoints into single consolidated checkpoints for evaluation and inference. Includes base class and FSDP2 implementation with support for both regular and EMA checkpoints.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: cdb5738e4d
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| logger.info("Waiting for pending evaluation jobs to complete...") | ||
| while len(self.eval_backend.pending_evals) > 0: | ||
| for eval_step, metrics in self.eval_backend.check_and_get_completed(): | ||
| if rank == 0: |
There was a problem hiding this comment.
Drain last eval results after pending jobs finish
When wait_until_complete=True, the loop condition only checks pending_evals. If the final job completes between the last check_and_get_completed() call and the next while condition check, the worker thread will move the job into results_queue and remove it from pending_evals, causing the loop to exit without ever logging that last result. This drops the final metrics at end-of-training; you can fix by draining the queue once more after the loop or by looping until both pending_evals is empty and the results queue is empty.
Useful? React with 👍 / 👎.
| for model_state_shard in shard_state_dicts: | ||
| tensor = model_state_shard.pop(key) | ||
| state_dict[key].append(tensor._local_tensor.bfloat16()) |
There was a problem hiding this comment.
Preserve original dtype when consolidating shards
The merger unconditionally converts each shard tensor to bfloat16 before concatenation. That will silently downcast checkpoints trained in fp32 or fp16, which can degrade accuracy or break downstream assumptions about dtype. Since this is a merge utility, it should preserve the original dtype from the shards rather than forcing bfloat16.
Useful? React with 👍 / 👎.
…al http server (#127) * rfc ema utils so that the attribute is being retrieved after the first init * [feat] Add FSDP2 checkpoint merger module Add utilities for merging sharded FSDP2 checkpoints into single consolidated checkpoints for evaluation and inference. Includes base class and FSDP2 implementation with support for both regular and EMA checkpoints. * [feat] Add eval server backend for asynchronous checkpoint evaluation * [feat] Integrate eval server backend into FSDP2 trainer * [feat] Add eval optional dependency with httpx * [feat] Add lmms_engine_kwargs support for checkpoint merging * [feat] Pass checkpoint_type to eval backend in validation_step * [feat] Update version and config for eval/EMA features * [fix] Fix EvalClient import and add eval_output_dir parameter * [refactor] Remove output_dir and check_interval from EvalConfig * [feat] Add eval_strategy check and wait for eval completion * [feat] Define global_step as step_metric for eval metrics in wandb * [feat] Use global_step in metrics for eval results logging * [docs] Add async eval guide and update merge FSDP documentation

Motivation
Modifications
Commit Message Convention
Please follow our standardized commit message format:
[feat]- New features or functionality[fix]- Bug fixes[docs]- Documentation changes only[style]- Code style changes (formatting, missing semicolons, etc.)[refactor]- Code refactoring without changing functionality[perf]- Performance improvements[test]- Adding or updating tests[chore]- Maintenance tasks, dependency updates, etc.[ci]- CI/CD configuration changesExamples:
[feat] add qwen omni iterable dataset support[fix] resolve bagel model configuration error[docs] update training guide with YAML examplesSee CONTRIBUTING.md for more details.
CI/CD Checks
Your PR will automatically run the following checks:
black(line-length=120) and import sorting withisortpre-commit run --all-fileslocally to verify before pushingChecklist
pre-commit run --all-filesand ensure all checks passblack(line-length=120) andisort