v0.2.0
Feature
See #25. We speedup the P2P implementation and make it have the same speed of broadcast. Also, we bind each GPU to its corresponding NUMA node to ensure stable H2D transfer speeds, which will speedup the update duration. We update the test result.
| Model | Device Info | GatherMetas | Update (Broadcast) | Update (P2P) |
|---|---|---|---|---|
| GLM-4.5-Air (BF16) | 8xH800 TP8 | 0.12s | 3.47s (3.02GiB) | 4.12s (3.02GiB) |
| Qwen3-235B-A22B-Instruct-2507 (BF16) | 8xH800 TP8 | 0.33s | 6.22s (2.67GiB) | 7.10s (2.68GiB) |
| DeepSeek-V3.1 (FP8) | 16xH20 TP16 | 1.17s | 10.19s (5.39GiB) | 11.80s (5.41GiB) |
| Kimi-K2-Instruct (FP8) | 16xH20 TP16 | 1.33s | 14.36s (5.89GiB) | 17.49s (5.91GiB) |
| DeepSeek-V3.1 (FP8) | 256xH20 TP16 | 0.80s | 11.33s (8.00GiB) | 11.81s (8.00GiB) |
| Kimi-K2-Instruct (FP8) | 256xH20 TP16 | 1.22s | 16.04s (8.00GiB) | 16.75s (8.00GiB) |
What's Changed
- fix: use correct type for _current_global_parameter_metas by @weixiao-huang in #33
- A more reasonable way to obtain RDMA devices by @specture724 in #36
- optimize _update_per_bucket_p2p logic by @specture724 in #28
Full Changelog: v0.1.3...v0.2.0