Commit 72978cf
committed
Update on "[rl] Add CI for numerics test against vllm native inference"
Test cases:
1. Integration tests:
- single GPU, no compile + cudagraph
- multiple GPU (with TP), no compile + cudagraph
- multiple GPU, with compile + cudagraph
- This test runs on A10G (default CI GPU type)
3. Numerics parity test: vLLM native model vs vLLM + TorchTitan wrapper.
- test_weights_match: max_diff <= 1e-5 (exact weight loading)
- test_attention_module: atol=1e-5 (TP=1)
- test_end_to_end_logits: atol=1e-3 (TP=1)
- We would need to run numerics test for only TP=1. This is because we are assuming both torchtitan and vllm will make sure their multi-GPU implementation is on par with single GPU. And we can add more numerics test under parallelism if needed.
- This test runs on H100, and runs FA3 kernel for attention.
[ghstack-poisoned]File tree
2 files changed
+12
-2
lines changed- torchtitan/experiments/rl
- actors
- tests
2 files changed
+12
-2
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
145 | 145 | | |
146 | 146 | | |
147 | 147 | | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
148 | 154 | | |
149 | 155 | | |
150 | 156 | | |
| |||
211 | 217 | | |
212 | 218 | | |
213 | 219 | | |
| 220 | + | |
| 221 | + | |
214 | 222 | | |
215 | 223 | | |
216 | 224 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
36 | 36 | | |
37 | 37 | | |
38 | 38 | | |
| 39 | + | |
39 | 40 | | |
40 | 41 | | |
41 | 42 | | |
42 | 43 | | |
43 | | - | |
| 44 | + | |
44 | 45 | | |
45 | 46 | | |
46 | 47 | | |
| |||
51 | 52 | | |
52 | 53 | | |
53 | 54 | | |
| 55 | + | |
54 | 56 | | |
55 | 57 | | |
56 | | - | |
| 58 | + | |
57 | 59 | | |
58 | 60 | | |
59 | 61 | | |
| |||
0 commit comments