You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/adding-new-models.md
+122-1Lines changed: 122 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -190,4 +190,125 @@ uv run --extra mcore tools/model_diagnostics/3.check_hf_model_embeddings_untrain
190
190
- Thresholds can be adjusted via flags:
191
191
-`--near-zero-threshold` (default: `1e-10`)
192
192
-`--identical-threshold` (default: `1e-8`)
193
-
- If any near-zero or identical rows are reported, the model may have issues of numerical instability (e.g., inf grad norms) during post-training if any of these problematic tokens are encountered. We have observed this happening when special tokens are reserved in the tokenizer and embedding, but none are encountered during pre-training. It may help to initialize these embeddings similar to how they were initialize during pre-training.
193
+
- If any near-zero or identical rows are reported, the model may have issues of numerical instability (e.g., inf grad norms) during post-training if any of these problematic tokens are encountered. We have observed this happening when special tokens are reserved in the tokenizer and embedding, but none are encountered during pre-training. It may help to initialize these embeddings similar to how they were initialize during pre-training.
Tests vLLM precision compilation by comparing log probabilities across different compilation modes and configurations. This script helps diagnose numerical precision issues that commonly arise when using different vLLM compilation settings. **Note that this is not a strict pass/fail test** - it's designed to help you understand and investigate numerical discrepancies.
198
+
199
+
```sh
200
+
# Example run
201
+
uv run --extra vllm tools/model_diagnostics/4.vllm_precision_compilation_test.py --model deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
202
+
203
+
# Typical output shows mixed results:
204
+
# Eager and cuda graph mode lps: FAILED - Arrays are different
205
+
...
206
+
# Eager and cuda graph mode lps with torch inductor precision flag: FAILED - Arrays are different
207
+
...
208
+
# Eager and cuda graph mode lps with use_inductor disabled: PASSED - Arrays are close within tolerance (atol=0.001, rtol=0.001)
209
+
```
210
+
211
+
See example for model `deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B`
-**✅ Usually works well**: This configuration often produces results very close to eager mode
285
+
-**Note**: `use_inductor=False` disables Inductor compilation but keeps CUDA graph capture active for compatible operations
286
+
287
+
**Performance vs Accuracy Trade-offs:**
288
+
289
+
The different compilation modes offer distinct trade-offs between accuracy and performance:
290
+
291
+
-**Eager Mode** (`enforce_eager=True`): Highest accuracy (ground truth) but slowest execution
292
+
-**CUDA Graph Mode with Inductor Disabled** (`enforce_eager=False` and `compilation_config={"use_inductor": False}`): Near-eager accuracy with significant speedup from CUDA graph optimization
293
+
-**CUDA Graph Mode with Inductor Enabled** (`enforce_eager=False` and `compilation_config={"use_inductor": True}`): Potentially fastest execution with custom Triton kernels (since Triton is the current backend of Inductor), but may introduce numerical differences. For accuracy improvement, try the torch inductor precision flag: `export TORCHINDUCTOR_EMULATE_PRECISION_CASTS=1`
294
+
295
+
**Note**: Performance characteristics vary by model. For example, `deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B` shows similar speed performance between `use_inductor=True` and `use_inductor=False`, making the accuracy-preserving option preferable.
296
+
297
+
**Why this matters:**
298
+
299
+
-**Debugging**: Helps identify which compilation settings cause numerical differences
300
+
-**Configuration**: Shows which settings work best for your model
301
+
-**Understanding**: Reveals how compilation affects model outputs
302
+
303
+
**When to use:**
304
+
305
+
-**Model integration** - understand numerical behavior across vLLM configurations
306
+
-**Debugging** - investigate differences between development and production
307
+
-**Research** - study compilation strategy impacts on precision
308
+
309
+
**Interpreting results:**
310
+
311
+
-**Eager vs CUDA Graph failures are normal** - don't panic if this fails
312
+
-**Focus on patterns** - some models are more sensitive than others
313
+
-**Use as guidance** - helps choose reliable compilation settings
314
+
-**Balance precision vs performance** - choose what works for your use case
0 commit comments