Skip to content

v0.2.0

Choose a tag to compare

@weixiao-huang weixiao-huang released this 30 Oct 02:27
· 33 commits to main since this release
a291782

Feature

See #25. We speedup the P2P implementation and make it have the same speed of broadcast. Also, we bind each GPU to its corresponding NUMA node to ensure stable H2D transfer speeds, which will speedup the update duration. We update the test result.

Model Device Info GatherMetas Update (Broadcast) Update (P2P)
GLM-4.5-Air (BF16) 8xH800 TP8 0.12s 3.47s (3.02GiB) 4.12s (3.02GiB)
Qwen3-235B-A22B-Instruct-2507 (BF16) 8xH800 TP8 0.33s 6.22s (2.67GiB) 7.10s (2.68GiB)
DeepSeek-V3.1 (FP8) 16xH20 TP16 1.17s 10.19s (5.39GiB) 11.80s (5.41GiB)
Kimi-K2-Instruct (FP8) 16xH20 TP16 1.33s 14.36s (5.89GiB) 17.49s (5.91GiB)
DeepSeek-V3.1 (FP8) 256xH20 TP16 0.80s 11.33s (8.00GiB) 11.81s (8.00GiB)
Kimi-K2-Instruct (FP8) 256xH20 TP16 1.22s 16.04s (8.00GiB) 16.75s (8.00GiB)

What's Changed

Full Changelog: v0.1.3...v0.2.0