Conversation
Summary of ChangesHello @ouqingliang, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request adds support for GLM 4.7, including fp8 per-channel quantization and bf16. The changes are comprehensive, touching C++ kernels, Python bindings, and adding new benchmarks and tests. The C++ implementation shows good use of modern features and performance considerations. However, I've identified a few issues that should be addressed. There's a duplicated class definition in kt-kernel/python/utils/loader.py which needs to be resolved. Additionally, a C++ kernel function in kt-kernel/operators/amx/fp8-perchannel-moe.hpp appears to have a non-vectorized implementation that could be a performance bottleneck. I've also left some comments on the new test and benchmark scripts regarding minor improvements for efficiency and code style.
| static inline void unpack_4nk_blocks(const uint8_t* src[4], uint8_t* dst, size_t dst_row_stride) { | ||
| static constexpr int row_map[8] = {0, 16, 4, 20, 8, 24, 12, 28}; | ||
| constexpr int K_STEP = T::K_STEP; // 32 | ||
|
|
||
| // Reinterpret as uint64 arrays for efficient access | ||
| const uint64_t* src0 = reinterpret_cast<const uint64_t*>(src[0]); | ||
| const uint64_t* src1 = reinterpret_cast<const uint64_t*>(src[1]); | ||
| const uint64_t* src2 = reinterpret_cast<const uint64_t*>(src[2]); | ||
| const uint64_t* src3 = reinterpret_cast<const uint64_t*>(src[3]); | ||
|
|
||
| // Process all 32 rows, writing 128 bytes (4 x 32) per row | ||
| for (int packed_i = 0; packed_i < 8; packed_i++) { | ||
| const int base_row = row_map[packed_i]; | ||
|
|
||
| // Process 4 rows at a time | ||
| for (int r = 0; r < 4; r++) { | ||
| uint16_t* row_dst = reinterpret_cast<uint16_t*>(dst + (size_t)(base_row + r) * dst_row_stride); | ||
| const int shift = r * 16; | ||
|
|
||
| // Unroll: process all 4 blocks x 16 columns = 64 uint16 values | ||
| // Block 0: columns 0-15 | ||
| for (int j = 0; j < 16; j++) { | ||
| row_dst[j] = static_cast<uint16_t>(src0[8 * j + packed_i] >> shift); | ||
| } | ||
| // Block 1: columns 16-31 | ||
| for (int j = 0; j < 16; j++) { | ||
| row_dst[16 + j] = static_cast<uint16_t>(src1[8 * j + packed_i] >> shift); | ||
| } | ||
| // Block 2: columns 32-47 | ||
| for (int j = 0; j < 16; j++) { | ||
| row_dst[32 + j] = static_cast<uint16_t>(src2[8 * j + packed_i] >> shift); | ||
| } | ||
| // Block 3: columns 48-63 | ||
| for (int j = 0; j < 16; j++) { | ||
| row_dst[48 + j] = static_cast<uint16_t>(src3[8 * j + packed_i] >> shift); | ||
| } | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
The implementation of unpack_4nk_blocks uses scalar loops, which is likely to be a performance bottleneck. This function should be vectorized using AVX512 intrinsics, similar to how unpack_nk_block is implemented, to improve performance. The current implementation reads and writes data element by element, which underutilizes the CPU's vector processing capabilities.
| class BF16SafeTensorLoader(SafeTensorLoader): | ||
| """Loader for native BF16 expert weights (no quantization, no scales). | ||
|
|
||
| Supported formats: | ||
| - DeepSeek style: {base}.mlp.experts.{id}.{gate,up,down}_proj.weight | ||
| - Mixtral/MiniMax style: {base}.block_sparse_moe.experts.{id}.{w1,w3,w2}.weight | ||
|
|
||
| The format is auto-detected during initialization. | ||
| """ | ||
|
|
||
| MOE_FORMATS = { | ||
| "deepseek": ("{base}.mlp.experts", "gate_proj", "up_proj", "down_proj"), | ||
| "mixtral": ("{base}.block_sparse_moe.experts", "w1", "w3", "w2"), | ||
| } | ||
|
|
||
| def __init__(self, file_path: str): | ||
| super().__init__(file_path) | ||
| self._detected_format = None | ||
| self._detect_format() | ||
|
|
||
| def _detect_format(self): | ||
| """Auto-detect the MoE naming format by checking tensor keys.""" | ||
| sample_keys = list(self.tensor_file_map.keys())[:1000] | ||
|
|
||
| for fmt_name, (path_tpl, gate, up, down) in self.MOE_FORMATS.items(): | ||
| for key in sample_keys: | ||
| if ".experts." in key and f".{gate}.weight" in key: | ||
| if "block_sparse_moe.experts" in key and fmt_name == "mixtral": | ||
| self._detected_format = fmt_name | ||
| print(f"[BF16SafeTensorLoader] Detected format: {fmt_name}") | ||
| return | ||
| elif "mlp.experts" in key and "block_sparse_moe" not in key and fmt_name == "deepseek": | ||
| self._detected_format = fmt_name | ||
| print(f"[BF16SafeTensorLoader] Detected format: {fmt_name}") | ||
| return | ||
|
|
||
| self._detected_format = "deepseek" | ||
| print("[BF16SafeTensorLoader] No MoE format detected, defaulting to: deepseek") | ||
|
|
||
| def _get_experts_prefix(self, base_key: str) -> str: | ||
| """Get the experts prefix based on detected format.""" | ||
| path_tpl, _, _, _ = self.MOE_FORMATS[self._detected_format] | ||
| return path_tpl.format(base=base_key) | ||
|
|
||
| def _get_proj_names(self): | ||
| """Get projection names (gate, up, down) based on detected format.""" | ||
| _, gate, up, down = self.MOE_FORMATS[self._detected_format] | ||
| return gate, up, down | ||
|
|
||
| def load_tensor(self, key: str, device: str = "cpu"): | ||
| if key not in self.tensor_file_map: | ||
| raise KeyError(f"Key {key} not found in Safetensor files") | ||
| file = self.tensor_file_map[key] | ||
| f = self.file_handle_map.get(file) | ||
| if f is None: | ||
| raise FileNotFoundError(f"File {file} not found in Safetensor files") | ||
| tensor = f.get_tensor(key) | ||
| if device == "cpu": | ||
| return tensor | ||
| return tensor.to(device) | ||
|
|
||
| def load_experts(self, base_key: str, device: str = "cpu"): | ||
| """Load BF16 expert weights (no scales needed).""" | ||
| experts_prefix = self._get_experts_prefix(base_key) | ||
| gate_name, up_name, down_name = self._get_proj_names() | ||
|
|
||
| expert_count = 0 | ||
| while self.has_tensor(f"{experts_prefix}.{expert_count}.{gate_name}.weight"): | ||
| expert_count += 1 | ||
|
|
||
| if expert_count == 0: | ||
| raise ValueError(f"No experts found for key {experts_prefix}") | ||
|
|
||
| gate_weights = [None] * expert_count | ||
| up_weights = [None] * expert_count | ||
| down_weights = [None] * expert_count | ||
|
|
||
| for exp_id in range(expert_count): | ||
| gate_w_key = f"{experts_prefix}.{exp_id}.{gate_name}.weight" | ||
| up_w_key = f"{experts_prefix}.{exp_id}.{up_name}.weight" | ||
| down_w_key = f"{experts_prefix}.{exp_id}.{down_name}.weight" | ||
|
|
||
| gate_weights[exp_id] = self.load_tensor(gate_w_key, device).contiguous() | ||
| up_weights[exp_id] = self.load_tensor(up_w_key, device).contiguous() | ||
| down_weights[exp_id] = self.load_tensor(down_w_key, device).contiguous() | ||
|
|
||
| return { | ||
| "gate": gate_weights, | ||
| "up": up_weights, | ||
| "down": down_weights, | ||
| } |
There was a problem hiding this comment.
This adds a new definition for BF16SafeTensorLoader, but another definition for the same class already exists starting at line 512. This duplication will lead to runtime errors or unexpected behavior. Please remove one of the definitions, presumably the older one if this new one is the intended implementation.
| fp8_weights = torch.randint(0, 256, (e, n, k), dtype=torch.uint8, device="cuda").to("cpu").contiguous() | ||
|
|
||
| # Generate random per-channel scales (one per output row) | ||
| # Use reasonable scale range (e.g., 2^-8 to 2^8) | ||
| exponents = torch.randint(-8, 9, (e, n), dtype=torch.int32, device="cuda").to("cpu").contiguous() | ||
| scales = (2.0 ** exponents.float()).to(torch.float32).contiguous() |
There was a problem hiding this comment.
Creating tensors on a CUDA device and then immediately moving them to the CPU is inefficient. It's better to create these tensors directly on the CPU to avoid unnecessary device-to-host data transfers, especially since they are used for CPU-side operations.
| fp8_weights = torch.randint(0, 256, (e, n, k), dtype=torch.uint8, device="cuda").to("cpu").contiguous() | |
| # Generate random per-channel scales (one per output row) | |
| # Use reasonable scale range (e.g., 2^-8 to 2^8) | |
| exponents = torch.randint(-8, 9, (e, n), dtype=torch.int32, device="cuda").to("cpu").contiguous() | |
| scales = (2.0 ** exponents.float()).to(torch.float32).contiguous() | |
| fp8_weights = torch.randint(0, 256, (e, n, k), dtype=torch.uint8, device="cpu").contiguous() | |
| # Generate random per-channel scales (one per output row) | |
| # Use reasonable scale range (e.g., 2^-8 to 2^8) | |
| exponents = torch.randint(-8, 9, (e, n), dtype=torch.int32, device="cpu").contiguous() | |
| scales = (2.0 ** exponents.float()).to(torch.float32).contiguous() |
| gate_q = ( | ||
| torch.randint(0, 256, (expert_num * per_mat_weight_bytes,), dtype=torch.uint8, device="cuda") | ||
| .to("cpu") | ||
| .contiguous() | ||
| ) |
There was a problem hiding this comment.
Creating a tensor on the CUDA device and then immediately moving it to the CPU is inefficient. It's better to create the tensor directly on the CPU to avoid an unnecessary device-to-host data transfer. This applies to all tensor creations in this function.
gate_q = (
torch.randint(0, 256, (expert_num * per_mat_weight_bytes,), dtype=torch.uint8, device="cpu")
.contiguous()
)| with torch.inference_mode(mode=True): | ||
| for i in range(validation_iter): | ||
| torch.manual_seed(100 + i) | ||
| torch.manual_seed(114514 + i) |
There was a problem hiding this comment.
| import os | ||
| import sys | ||
|
|
||
| sys.path.insert(0, os.path.dirname(__file__) + "/../build") |
There was a problem hiding this comment.
| val = min(val, FP8_E4M3_MAX) | ||
|
|
||
| # Find exponent | ||
| import math |
| # Find exponent | ||
| import math | ||
|
|
||
| if val < 2**-9: # Subnormal threshold |
What does this PR do?
support GLM 4.7, including fp8 and bf16.
Before submitting