Skip to content

add fp8 quantization kernels #12

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

baodii
Copy link
Collaborator

@baodii baodii commented Aug 13, 2025

add a) static_scaled_fp8_quant; b) dynamic_scaled_fp8_quant; c) dynamic_per_token_scaled_fp8_quant kernels.

@Copilot Copilot AI review requested due to automatic review settings August 13, 2025 04:54
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds FP8 quantization kernels to the vLLM XPU implementation, providing three different quantization methods for FP8 (float8) data types. The implementation includes comprehensive test coverage and follows the existing kernel architecture pattern.

  • Adds static, dynamic per-tensor, and dynamic per-token FP8 quantization kernels
  • Provides comprehensive test suite with reference implementations for validation
  • Extends the dispatch utilities to support FP8_e5m2 data type in addition to existing FP8_e4m3fn

Reviewed Changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/test_fp8_quant.py Main test file with reference implementations and parametrized test cases
tests/register_ops.py Python bindings for the three new FP8 quantization operations
tests/ops/fp8_quant_op.py High-level Python wrapper function for FP8 quantization operations
csrc/xpu/torch_bindings.cpp PyTorch operator registration for the three FP8 quantization kernels
csrc/xpu/quantization/fp8/utils.h Common utilities and type definitions for FP8 quantization
csrc/xpu/quantization/fp8/fp8_quant.h Core FP8 quantization kernel implementations and helper functions
csrc/xpu/quantization/fp8/fp8_quant.cpp C++ implementation of the three FP8 quantization functions
csrc/xpu/ops.h Function declarations for the new FP8 quantization operations
csrc/xpu/dispatch_utils.h Extended dispatch macros to include FP8_e5m2 data type
CMakeLists.txt Build configuration update to include the new FP8 quantization source file
Comments suppressed due to low confidence (1)

csrc/xpu/quantization/fp8/fp8_quant.h:73

  • Commented-out debug printf statements should be removed to improve code cleanliness and maintainability.
        static_cast<float>(input[i]), scale);

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

s_1 = as_float32_tensor(1.0)
s_512 = as_float32_tensor(512.0)

# For fp8, in order to match the cuda kernel output, we have to do exactly
Copy link
Preview

Copilot AI Aug 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment mentions 'cuda kernel output' but this is for XPU implementation. It should say 'xpu kernel output' to be consistent with the context.

Suggested change
# For fp8, in order to match the cuda kernel output, we have to do exactly
# For fp8, in order to match the xpu kernel output, we have to do exactly

Copilot uses AI. Check for mistakes.

// quant_type_max_v<fp8_type>: %f, r: %f\n", token_id, tid, val, scale, x,
// static_cast<float>(quant_type_max_v<fp8_type>),
// r);
// sycl::ext::oneapi::experimental::printf("scaled_fp8_conversion: %f\n", r);
Copy link
Preview

Copilot AI Aug 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented-out debug printf statements should be removed to improve code cleanliness and maintainability.

Suggested change
// sycl::ext::oneapi::experimental::printf("scaled_fp8_conversion: %f\n", r);

Copilot uses AI. Check for mistakes.


# Add parent directory to Python path
# sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), '..')))

Copy link
Preview

Copilot AI Aug 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented-out sys.path manipulation code should be removed if it's not needed, as it can be confusing and reduces code maintainability.

Suggested change

Copilot uses AI. Check for mistakes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @baodii, can we merge this file with quant_utils.h into one?

Signed-off-by: Zhu, Zufang <[email protected]>
Signed-off-by: Zhu, Zufang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants