Skip to content

Commit 3349d7b

Browse files
authored
[feat] Allow middle checkpoint evaluation in background using lmms-eval http server (#127)
* rfc ema utils so that the attribute is being retrieved after the first init * [feat] Add FSDP2 checkpoint merger module Add utilities for merging sharded FSDP2 checkpoints into single consolidated checkpoints for evaluation and inference. Includes base class and FSDP2 implementation with support for both regular and EMA checkpoints. * [feat] Add eval server backend for asynchronous checkpoint evaluation * [feat] Integrate eval server backend into FSDP2 trainer * [feat] Add eval optional dependency with httpx * [feat] Add lmms_engine_kwargs support for checkpoint merging * [feat] Pass checkpoint_type to eval backend in validation_step * [feat] Update version and config for eval/EMA features * [fix] Fix EvalClient import and add eval_output_dir parameter * [refactor] Remove output_dir and check_interval from EvalConfig * [feat] Add eval_strategy check and wait for eval completion * [feat] Define global_step as step_metric for eval metrics in wandb * [feat] Use global_step in metrics for eval results logging * [docs] Add async eval guide and update merge FSDP documentation
1 parent 0712fef commit 3349d7b

File tree

19 files changed

+1116
-123
lines changed

19 files changed

+1116
-123
lines changed

docs/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@ Welcome to the LMMs Engine documentation! LMMs Engine is a flexible and extensib
1919
user_guide/peak_perf
2020
user_guide/merge_fsdp
2121
user_guide/fsdp2_reduce_dtype
22+
user_guide/async_eval
2223

2324
.. toctree::
2425
:maxdepth: 2

docs/user_guide/async_eval.md

Lines changed: 254 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,254 @@
1+
# Asynchronous Checkpoint Evaluation During Training
2+
3+
LMMs Engine supports asynchronous evaluation of model checkpoints during training. This allows you to evaluate your model without interrupting the training process, by submitting evaluation jobs to a separate LMMS-Eval server.
4+
5+
## Overview
6+
7+
When enabled, the training system:
8+
1. Submits evaluation jobs to an LMMS-Eval server when checkpoints are saved
9+
2. Continues training while evaluations run in the background
10+
3. Polls for evaluation results periodically
11+
4. Logs evaluation metrics when they become available
12+
13+
## Prerequisites
14+
15+
### Start the LMMS-Eval Server
16+
17+
You need to run the LMMS-Eval server before starting training. The server will handle evaluation requests and return results.
18+
19+
```bash
20+
# Start the LMMS-Eval server on your evaluation machine
21+
python -m lmms_eval.entrypoints.server --port 8000
22+
```
23+
24+
The server will listen for evaluation requests and perform evaluations asynchronously.
25+
26+
## Configuration
27+
28+
Enable asynchronous evaluation in your training configuration YAML:
29+
30+
```yaml
31+
trainer_args:
32+
# Enable evaluation at specific intervals
33+
eval_strategy: "steps" # Options: "steps", "epoch", "no"
34+
eval_steps: 500 # Evaluate every N steps (when eval_strategy="steps")
35+
36+
# Evaluation configuration
37+
eval_config:
38+
# Server configuration
39+
server_url: "http://192.168.8.249:8000"
40+
poll_interval: 10.0 # Poll server every 10 seconds
41+
42+
# Model configuration
43+
model: "qwen_vl" # Model name recognized by LMMS-Eval
44+
checkpoint_key: "model" # Key to use in model_args for checkpoint path
45+
46+
# Tasks to evaluate
47+
tasks:
48+
- "mmmu_val"
49+
- "textvqa_val"
50+
- "docvqa_val"
51+
52+
# Model arguments passed to LMMS-Eval
53+
model_args:
54+
num_gpus: 8
55+
batch_size: 256
56+
max_length: 2048
57+
# Additional model-specific arguments
58+
```
59+
60+
### Configuration Parameters
61+
62+
#### `eval_strategy`
63+
64+
- `"steps"`: Evaluate every `eval_steps` training steps
65+
- `"epoch"`: Evaluate at the end of each epoch
66+
- `"no"`: Disable evaluation (default)
67+
68+
#### `eval_config` Parameters
69+
70+
| Parameter | Type | Description |
71+
|-----------|------|-------------|
72+
| `server_url` | string | URL of the LMMS-Eval server (e.g., `"http://localhost:8000"`) |
73+
| `poll_interval` | float | Interval (seconds) to poll for evaluation results (default: `10.0`) |
74+
| `model` | string | Model name recognized by LMMS-Eval (e.g., `"qwen_vl"`) |
75+
| `tasks` | list | List of evaluation tasks (e.g., `["mmmu_val", "textvqa_val"]`) |
76+
| `checkpoint_key` | string | Key used in model_args to specify checkpoint path |
77+
| `model_args` | dict | Additional arguments passed to the model (e.g., `num_gpus`, `batch_size`) |
78+
79+
## How It Works
80+
81+
### 1. Checkpoint Saving
82+
83+
When a checkpoint is saved (according to `save_steps`), the trainer:
84+
- Determines the checkpoint path (e.g., `./output/checkpoint-500`)
85+
- Creates an evaluation output directory (e.g., `./output/checkpoint-500/eval`)
86+
- Submits an evaluation job to the LMMS-Eval server
87+
88+
### 2. Background Polling
89+
90+
A background thread:
91+
- Polls the LMMS-Eval server every `poll_interval` seconds
92+
- Checks if evaluation jobs are completed
93+
- Retrieves results when available
94+
95+
### 3. Metric Logging
96+
97+
When evaluation results are available:
98+
- Metrics are logged to your tracking system (e.g., W&B, TensorBoard)
99+
- Metrics include `global_step` to associate results with the training step
100+
- Example logged metrics: `eval/mmmu_val/accuracy`, `eval/textvqa_val/accuracy`
101+
102+
### 4. Training Completion
103+
104+
At the end of training:
105+
- The trainer waits for all pending evaluation jobs to complete
106+
- All remaining evaluation results are logged
107+
- Training exits only after all evaluations are finished
108+
109+
## Example Configuration
110+
111+
Here's a complete example with asynchronous evaluation enabled:
112+
113+
```yaml
114+
trainer_type: fsdp2_trainer
115+
116+
dataset_config:
117+
dataset_type: vision
118+
dataset_format: yaml
119+
datasets:
120+
- path: data/your_dataset
121+
data_folder: ""
122+
data_type: arrow
123+
124+
processor_config:
125+
processor_name: "Qwen/Qwen3-VL-8B-Instruct"
126+
processor_type: "qwen3_vl"
127+
128+
packing: true
129+
packing_strategy: first_fit
130+
packing_length: 16384
131+
132+
model_config:
133+
load_from_pretrained_path: "Qwen/Qwen3-VL-8B-Instruct"
134+
attn_implementation: "flash_attention_2"
135+
136+
trainer_args:
137+
per_device_train_batch_size: 1
138+
learning_rate: 1.0e-06
139+
num_train_epochs: 1
140+
save_steps: 500
141+
eval_steps: 500 # Must equal save_steps for consistent evaluation
142+
eval_strategy: "steps"
143+
save_total_limit: 2
144+
145+
# Evaluation configuration
146+
eval_config:
147+
server_url: "http://192.168.8.249:8000"
148+
poll_interval: 10.0
149+
checkpoint_key: "model"
150+
model: "qwen_vl"
151+
tasks:
152+
- "mmmu_val"
153+
- "textvqa_val"
154+
model_args:
155+
num_gpus: 8
156+
batch_size: 256
157+
158+
report_to: "wandb"
159+
output_dir: "./output/qwen3_vl"
160+
bf16: true
161+
gradient_checkpointing: true
162+
fsdp2: true
163+
fsdp_config:
164+
transformer_layer_cls_to_wrap: ["Qwen3VLDecoderLayer"]
165+
reshard_after_forward: false
166+
```
167+
168+
## EMA Checkpoint Evaluation
169+
170+
If you have EMA (Exponential Moving Average) enabled, the system will automatically evaluate both regular and EMA checkpoints:
171+
172+
```yaml
173+
trainer_args:
174+
ema_enabled: true
175+
ema_decay: 0.9999
176+
ema_update_every: 1
177+
178+
eval_config:
179+
server_url: "http://192.168.8.249:8000"
180+
# ... other config
181+
```
182+
183+
The trainer will:
184+
- Evaluate regular checkpoints with `checkpoint_type: "regular"`
185+
- Evaluate EMA checkpoints with `checkpoint_type: "ema"`
186+
- Log both sets of metrics separately
187+
188+
## Distributed Training
189+
190+
In distributed training (e.g., with `torchrun`), only rank 0:
191+
- Submits evaluation jobs
192+
- Polls for results
193+
- Logs evaluation metrics
194+
195+
This avoids duplicate submissions and redundant logging.
196+
197+
## Monitoring Evaluation Progress
198+
199+
### Check W&B/TensorBoard
200+
201+
Evaluation metrics appear in your tracking dashboard:
202+
- `eval/mmmu_val/accuracy`
203+
- `eval/textvqa_val/accuracy`
204+
- `eval/textvqa_val/anls`
205+
- etc.
206+
207+
Each metric is associated with the training step via `global_step`.
208+
209+
### Check Evaluation Server Logs
210+
211+
The LMMS-Eval server logs:
212+
- Received evaluation requests
213+
- Evaluation progress
214+
- Completed evaluations
215+
216+
### Check Training Logs
217+
218+
The training process logs:
219+
- When evaluation jobs are submitted
220+
- When results are received
221+
- Any errors during polling or logging
222+
223+
## Troubleshooting
224+
225+
### Evaluations Not Starting
226+
227+
1. Verify the LMMS-Eval server is running at `server_url`
228+
2. Check network connectivity from training machine to evaluation server
229+
3. Verify the checkpoint path exists and contains valid weights
230+
231+
### Evaluation Results Not Appearing
232+
233+
1. Check `poll_interval` - increase if network is slow
234+
2. Check LMMS-Eval server logs for errors
235+
3. Verify task names are correct and supported by LMMS-Eval
236+
237+
### Duplicate Evaluations
238+
239+
Ensure `eval_steps` matches `save_steps` or adjust evaluation frequency to match checkpoint saving frequency.
240+
241+
## Best Practices
242+
243+
1. **Network Bandwidth**: Use a dedicated evaluation machine if network bandwidth is limited
244+
2. **Resource Allocation**: Allocate sufficient GPUs for evaluation in `model_args.num_gpus`
245+
3. **Checkpoint Frequency**: Balance between `save_steps` and evaluation frequency
246+
4. **Task Selection**: Choose representative tasks that don't take too long
247+
5. **Poll Interval**: Adjust `poll_interval` based on your network and evaluation speed
248+
6. **Output Management**: Use `save_total_limit` to manage disk space for checkpoints
249+
250+
## Additional Resources
251+
252+
- [LMMS-Eval Repository](https://github.com/EvolvingLMMs-Lab/lmms-eval)
253+
- [Merge FSDP Checkpoints](merge_fsdp.md)
254+
- [Training Guide](../getting_started/train.md)

docs/user_guide/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,3 +11,4 @@ Comprehensive guides for using LMMs Engine in various scenarios.
1111
peak_perf
1212
merge_fsdp
1313
fsdp2_reduce_dtype
14+
async_eval

0 commit comments

Comments
 (0)