Conversation
Signed-off-by: Yu Gong <yu3.gong@gmail.com>
Signed-off-by: Yu Gong <yu3.gong@gmail.com>
Signed-off-by: Yu Gong <yu3.gong@gmail.com>
Signed-off-by: Yu Gong <yu3.gong@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request introduces FP8 LoRA dense kernels, including Triton implementations for shrink and expand operations with FP8 quantization support. The changes are extensive and add new files for FP8 kernel utilities and operations. My review focuses on correctness and maintainability. I've identified a critical bug in lora_shrink_fp8_op.py related to incorrect handling of return values and redundant stride calculations, which could lead to runtime errors. I've also pointed out areas with code duplication in the Triton kernels and confusing function naming, which could impact future maintenance. Addressing these points will improve the robustness and clarity of the new FP8 LoRA implementation.
| _LORA_SCALE_PTR_DICT: dict[tuple[int, ...], tuple] = {} | ||
|
|
||
|
|
||
| def _get_lora_scale_ptr(lora_scale_weights: list[torch.Tensor], device: torch.device): |
There was a problem hiding this comment.
The function _get_lora_scale_ptr is also defined in vllm/lora/ops/triton_ops/lora_expand_fp8_op.py with a different implementation. Having two different functions with the same name can be confusing and lead to maintenance issues.
Consider renaming these functions to be more specific about what they do (e.g., _get_lora_a_scale_ptr and _get_lora_b_scale_ptr) and potentially moving them to a shared utility file if appropriate.
Signed-off-by: Yu Gong <yu3.gong@gmail.com>
Signed-off-by: Yu Gong <yu3.gong@gmail.com>
Signed-off-by: Yu Gong <yu3.gong@gmail.com>
Signed-off-by: Yu Gong <yu3.gong@gmail.com>
Purpose
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.