Conversation
…ph and parameter shapes into the cache directory naming Signed-off-by: Vadim Gimpelson <vadim.gimpelson@centml.ai>
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@centml.ai>
…ecewise-compilation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
|
This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you! |
|
This pull request has been automatically closed due to inactivity. Please feel free to reopen if you intend to continue working on it. Thank you! |
This PR enables piecewise
torch.compilesupport for PLaMo2. The compilation is only enabled for attention layers, because enablingtorch.compilefor mamba layers requires a redesign of the mamba cache in upstream vLLM.TODO: add benchmark
Illustration
The compilation result is exemplified using the QK-norm in the Attention layer:
vllm/vllm/model_executor/models/plamo2.py
Lines 422 to 427 in 7bc2e28
This is the timeline of the RMSNorm(Q) and RMSNorm(K) ops:
This is the timeline of the same ops with compilation enabled:
The entries for the QK-norm are between
cutlass::Kernel2(/triton_tem_fused_mm) and the one prior toflash::flash_fwd_splitkv_kernel. Without compilation, an individual kernel is launched for every torch op inside the QK-norm. The launch and memory transaction overhead for these kernels is significantly higher than the actual computation.With compilation, these kernels are fused into a single Triton kernel, which drastically reduces the runtime of the QK-norm.
Example
Example inference script with compilation configuration: https://github.com/pfn-attic/vllm-plamo2-plugin/blob/piecewise-compile/example.py
Output:
By setting
compilation_config=0in the above script, the compilation can be disabled. I have confirmed that the model outputs are identical with compilation enabled and disabled.Note: This PR depends on/includes the fix in vllm-project#14913.