|
| 1 | +# Asynchronous Checkpoint Evaluation During Training |
| 2 | + |
| 3 | +LMMs Engine supports asynchronous evaluation of model checkpoints during training. This allows you to evaluate your model without interrupting the training process, by submitting evaluation jobs to a separate LMMS-Eval server. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +When enabled, the training system: |
| 8 | +1. Submits evaluation jobs to an LMMS-Eval server when checkpoints are saved |
| 9 | +2. Continues training while evaluations run in the background |
| 10 | +3. Polls for evaluation results periodically |
| 11 | +4. Logs evaluation metrics when they become available |
| 12 | + |
| 13 | +## Prerequisites |
| 14 | + |
| 15 | +### Start the LMMS-Eval Server |
| 16 | + |
| 17 | +You need to run the LMMS-Eval server before starting training. The server will handle evaluation requests and return results. |
| 18 | + |
| 19 | +```bash |
| 20 | +# Start the LMMS-Eval server on your evaluation machine |
| 21 | +python -m lmms_eval.entrypoints.server --port 8000 |
| 22 | +``` |
| 23 | + |
| 24 | +The server will listen for evaluation requests and perform evaluations asynchronously. |
| 25 | + |
| 26 | +## Configuration |
| 27 | + |
| 28 | +Enable asynchronous evaluation in your training configuration YAML: |
| 29 | + |
| 30 | +```yaml |
| 31 | +trainer_args: |
| 32 | + # Enable evaluation at specific intervals |
| 33 | + eval_strategy: "steps" # Options: "steps", "epoch", "no" |
| 34 | + eval_steps: 500 # Evaluate every N steps (when eval_strategy="steps") |
| 35 | + |
| 36 | + # Evaluation configuration |
| 37 | + eval_config: |
| 38 | + # Server configuration |
| 39 | + server_url: "http://192.168.8.249:8000" |
| 40 | + poll_interval: 10.0 # Poll server every 10 seconds |
| 41 | + |
| 42 | + # Model configuration |
| 43 | + model: "qwen_vl" # Model name recognized by LMMS-Eval |
| 44 | + checkpoint_key: "model" # Key to use in model_args for checkpoint path |
| 45 | + |
| 46 | + # Tasks to evaluate |
| 47 | + tasks: |
| 48 | + - "mmmu_val" |
| 49 | + - "textvqa_val" |
| 50 | + - "docvqa_val" |
| 51 | + |
| 52 | + # Model arguments passed to LMMS-Eval |
| 53 | + model_args: |
| 54 | + num_gpus: 8 |
| 55 | + batch_size: 256 |
| 56 | + max_length: 2048 |
| 57 | + # Additional model-specific arguments |
| 58 | +``` |
| 59 | + |
| 60 | +### Configuration Parameters |
| 61 | + |
| 62 | +#### `eval_strategy` |
| 63 | + |
| 64 | +- `"steps"`: Evaluate every `eval_steps` training steps |
| 65 | +- `"epoch"`: Evaluate at the end of each epoch |
| 66 | +- `"no"`: Disable evaluation (default) |
| 67 | + |
| 68 | +#### `eval_config` Parameters |
| 69 | + |
| 70 | +| Parameter | Type | Description | |
| 71 | +|-----------|------|-------------| |
| 72 | +| `server_url` | string | URL of the LMMS-Eval server (e.g., `"http://localhost:8000"`) | |
| 73 | +| `poll_interval` | float | Interval (seconds) to poll for evaluation results (default: `10.0`) | |
| 74 | +| `model` | string | Model name recognized by LMMS-Eval (e.g., `"qwen_vl"`) | |
| 75 | +| `tasks` | list | List of evaluation tasks (e.g., `["mmmu_val", "textvqa_val"]`) | |
| 76 | +| `checkpoint_key` | string | Key used in model_args to specify checkpoint path | |
| 77 | +| `model_args` | dict | Additional arguments passed to the model (e.g., `num_gpus`, `batch_size`) | |
| 78 | + |
| 79 | +## How It Works |
| 80 | + |
| 81 | +### 1. Checkpoint Saving |
| 82 | + |
| 83 | +When a checkpoint is saved (according to `save_steps`), the trainer: |
| 84 | +- Determines the checkpoint path (e.g., `./output/checkpoint-500`) |
| 85 | +- Creates an evaluation output directory (e.g., `./output/checkpoint-500/eval`) |
| 86 | +- Submits an evaluation job to the LMMS-Eval server |
| 87 | + |
| 88 | +### 2. Background Polling |
| 89 | + |
| 90 | +A background thread: |
| 91 | +- Polls the LMMS-Eval server every `poll_interval` seconds |
| 92 | +- Checks if evaluation jobs are completed |
| 93 | +- Retrieves results when available |
| 94 | + |
| 95 | +### 3. Metric Logging |
| 96 | + |
| 97 | +When evaluation results are available: |
| 98 | +- Metrics are logged to your tracking system (e.g., W&B, TensorBoard) |
| 99 | +- Metrics include `global_step` to associate results with the training step |
| 100 | +- Example logged metrics: `eval/mmmu_val/accuracy`, `eval/textvqa_val/accuracy` |
| 101 | + |
| 102 | +### 4. Training Completion |
| 103 | + |
| 104 | +At the end of training: |
| 105 | +- The trainer waits for all pending evaluation jobs to complete |
| 106 | +- All remaining evaluation results are logged |
| 107 | +- Training exits only after all evaluations are finished |
| 108 | + |
| 109 | +## Example Configuration |
| 110 | + |
| 111 | +Here's a complete example with asynchronous evaluation enabled: |
| 112 | + |
| 113 | +```yaml |
| 114 | +trainer_type: fsdp2_trainer |
| 115 | + |
| 116 | +dataset_config: |
| 117 | + dataset_type: vision |
| 118 | + dataset_format: yaml |
| 119 | + datasets: |
| 120 | + - path: data/your_dataset |
| 121 | + data_folder: "" |
| 122 | + data_type: arrow |
| 123 | + |
| 124 | + processor_config: |
| 125 | + processor_name: "Qwen/Qwen3-VL-8B-Instruct" |
| 126 | + processor_type: "qwen3_vl" |
| 127 | + |
| 128 | + packing: true |
| 129 | + packing_strategy: first_fit |
| 130 | + packing_length: 16384 |
| 131 | + |
| 132 | +model_config: |
| 133 | + load_from_pretrained_path: "Qwen/Qwen3-VL-8B-Instruct" |
| 134 | + attn_implementation: "flash_attention_2" |
| 135 | + |
| 136 | +trainer_args: |
| 137 | + per_device_train_batch_size: 1 |
| 138 | + learning_rate: 1.0e-06 |
| 139 | + num_train_epochs: 1 |
| 140 | + save_steps: 500 |
| 141 | + eval_steps: 500 # Must equal save_steps for consistent evaluation |
| 142 | + eval_strategy: "steps" |
| 143 | + save_total_limit: 2 |
| 144 | + |
| 145 | + # Evaluation configuration |
| 146 | + eval_config: |
| 147 | + server_url: "http://192.168.8.249:8000" |
| 148 | + poll_interval: 10.0 |
| 149 | + checkpoint_key: "model" |
| 150 | + model: "qwen_vl" |
| 151 | + tasks: |
| 152 | + - "mmmu_val" |
| 153 | + - "textvqa_val" |
| 154 | + model_args: |
| 155 | + num_gpus: 8 |
| 156 | + batch_size: 256 |
| 157 | + |
| 158 | + report_to: "wandb" |
| 159 | + output_dir: "./output/qwen3_vl" |
| 160 | + bf16: true |
| 161 | + gradient_checkpointing: true |
| 162 | + fsdp2: true |
| 163 | + fsdp_config: |
| 164 | + transformer_layer_cls_to_wrap: ["Qwen3VLDecoderLayer"] |
| 165 | + reshard_after_forward: false |
| 166 | +``` |
| 167 | +
|
| 168 | +## EMA Checkpoint Evaluation |
| 169 | +
|
| 170 | +If you have EMA (Exponential Moving Average) enabled, the system will automatically evaluate both regular and EMA checkpoints: |
| 171 | +
|
| 172 | +```yaml |
| 173 | +trainer_args: |
| 174 | + ema_enabled: true |
| 175 | + ema_decay: 0.9999 |
| 176 | + ema_update_every: 1 |
| 177 | + |
| 178 | + eval_config: |
| 179 | + server_url: "http://192.168.8.249:8000" |
| 180 | + # ... other config |
| 181 | +``` |
| 182 | + |
| 183 | +The trainer will: |
| 184 | +- Evaluate regular checkpoints with `checkpoint_type: "regular"` |
| 185 | +- Evaluate EMA checkpoints with `checkpoint_type: "ema"` |
| 186 | +- Log both sets of metrics separately |
| 187 | + |
| 188 | +## Distributed Training |
| 189 | + |
| 190 | +In distributed training (e.g., with `torchrun`), only rank 0: |
| 191 | +- Submits evaluation jobs |
| 192 | +- Polls for results |
| 193 | +- Logs evaluation metrics |
| 194 | + |
| 195 | +This avoids duplicate submissions and redundant logging. |
| 196 | + |
| 197 | +## Monitoring Evaluation Progress |
| 198 | + |
| 199 | +### Check W&B/TensorBoard |
| 200 | + |
| 201 | +Evaluation metrics appear in your tracking dashboard: |
| 202 | +- `eval/mmmu_val/accuracy` |
| 203 | +- `eval/textvqa_val/accuracy` |
| 204 | +- `eval/textvqa_val/anls` |
| 205 | +- etc. |
| 206 | + |
| 207 | +Each metric is associated with the training step via `global_step`. |
| 208 | + |
| 209 | +### Check Evaluation Server Logs |
| 210 | + |
| 211 | +The LMMS-Eval server logs: |
| 212 | +- Received evaluation requests |
| 213 | +- Evaluation progress |
| 214 | +- Completed evaluations |
| 215 | + |
| 216 | +### Check Training Logs |
| 217 | + |
| 218 | +The training process logs: |
| 219 | +- When evaluation jobs are submitted |
| 220 | +- When results are received |
| 221 | +- Any errors during polling or logging |
| 222 | + |
| 223 | +## Troubleshooting |
| 224 | + |
| 225 | +### Evaluations Not Starting |
| 226 | + |
| 227 | +1. Verify the LMMS-Eval server is running at `server_url` |
| 228 | +2. Check network connectivity from training machine to evaluation server |
| 229 | +3. Verify the checkpoint path exists and contains valid weights |
| 230 | + |
| 231 | +### Evaluation Results Not Appearing |
| 232 | + |
| 233 | +1. Check `poll_interval` - increase if network is slow |
| 234 | +2. Check LMMS-Eval server logs for errors |
| 235 | +3. Verify task names are correct and supported by LMMS-Eval |
| 236 | + |
| 237 | +### Duplicate Evaluations |
| 238 | + |
| 239 | +Ensure `eval_steps` matches `save_steps` or adjust evaluation frequency to match checkpoint saving frequency. |
| 240 | + |
| 241 | +## Best Practices |
| 242 | + |
| 243 | +1. **Network Bandwidth**: Use a dedicated evaluation machine if network bandwidth is limited |
| 244 | +2. **Resource Allocation**: Allocate sufficient GPUs for evaluation in `model_args.num_gpus` |
| 245 | +3. **Checkpoint Frequency**: Balance between `save_steps` and evaluation frequency |
| 246 | +4. **Task Selection**: Choose representative tasks that don't take too long |
| 247 | +5. **Poll Interval**: Adjust `poll_interval` based on your network and evaluation speed |
| 248 | +6. **Output Management**: Use `save_total_limit` to manage disk space for checkpoints |
| 249 | + |
| 250 | +## Additional Resources |
| 251 | + |
| 252 | +- [LMMS-Eval Repository](https://github.com/EvolvingLMMs-Lab/lmms-eval) |
| 253 | +- [Merge FSDP Checkpoints](merge_fsdp.md) |
| 254 | +- [Training Guide](../getting_started/train.md) |
0 commit comments