-
Notifications
You must be signed in to change notification settings - Fork 7
add fp8 quantization kernels #12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds FP8 quantization kernels to the vLLM XPU implementation, providing three different quantization methods for FP8 (float8) data types. The implementation includes comprehensive test coverage and follows the existing kernel architecture pattern.
- Adds static, dynamic per-tensor, and dynamic per-token FP8 quantization kernels
- Provides comprehensive test suite with reference implementations for validation
- Extends the dispatch utilities to support FP8_e5m2 data type in addition to existing FP8_e4m3fn
Reviewed Changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.
Show a summary per file
File | Description |
---|---|
tests/test_fp8_quant.py | Main test file with reference implementations and parametrized test cases |
tests/register_ops.py | Python bindings for the three new FP8 quantization operations |
tests/ops/fp8_quant_op.py | High-level Python wrapper function for FP8 quantization operations |
csrc/xpu/torch_bindings.cpp | PyTorch operator registration for the three FP8 quantization kernels |
csrc/xpu/quantization/fp8/utils.h | Common utilities and type definitions for FP8 quantization |
csrc/xpu/quantization/fp8/fp8_quant.h | Core FP8 quantization kernel implementations and helper functions |
csrc/xpu/quantization/fp8/fp8_quant.cpp | C++ implementation of the three FP8 quantization functions |
csrc/xpu/ops.h | Function declarations for the new FP8 quantization operations |
csrc/xpu/dispatch_utils.h | Extended dispatch macros to include FP8_e5m2 data type |
CMakeLists.txt | Build configuration update to include the new FP8 quantization source file |
Comments suppressed due to low confidence (1)
csrc/xpu/quantization/fp8/fp8_quant.h:73
- Commented-out debug printf statements should be removed to improve code cleanliness and maintainability.
static_cast<float>(input[i]), scale);
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
s_1 = as_float32_tensor(1.0) | ||
s_512 = as_float32_tensor(512.0) | ||
|
||
# For fp8, in order to match the cuda kernel output, we have to do exactly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment mentions 'cuda kernel output' but this is for XPU implementation. It should say 'xpu kernel output' to be consistent with the context.
# For fp8, in order to match the cuda kernel output, we have to do exactly | |
# For fp8, in order to match the xpu kernel output, we have to do exactly |
Copilot uses AI. Check for mistakes.
// quant_type_max_v<fp8_type>: %f, r: %f\n", token_id, tid, val, scale, x, | ||
// static_cast<float>(quant_type_max_v<fp8_type>), | ||
// r); | ||
// sycl::ext::oneapi::experimental::printf("scaled_fp8_conversion: %f\n", r); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commented-out debug printf statements should be removed to improve code cleanliness and maintainability.
// sycl::ext::oneapi::experimental::printf("scaled_fp8_conversion: %f\n", r); |
Copilot uses AI. Check for mistakes.
tests/ops/fp8_quant_op.py
Outdated
|
||
# Add parent directory to Python path | ||
# sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), '..'))) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commented-out sys.path manipulation code should be removed if it's not needed, as it can be confusing and reduces code maintainability.
Copilot uses AI. Check for mistakes.
csrc/xpu/quantization/fp8/utils.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @baodii, can we merge this file with quant_utils.h into one?
Signed-off-by: baodii <[email protected]>
Signed-off-by: baodii <[email protected]>
Signed-off-by: baodii <[email protected]>
…diffence is too huge Signed-off-by: baodii <[email protected]>
15d689b
to
9ff5526
Compare
Signed-off-by: Zhu, Zufang <[email protected]>
Signed-off-by: Zhu, Zufang <[email protected]>
2455acb
to
059a372
Compare
add a) static_scaled_fp8_quant; b) dynamic_scaled_fp8_quant; c) dynamic_per_token_scaled_fp8_quant kernels.