add accuracy and performance test by ovowei · Pull Request #1643 · kvcache-ai/ktransformers

ovowei · 2025-11-26T09:11:24Z

add accuracy and performance test

gemini-code-assist · 2025-11-26T09:11:54Z

Summary of Changes

Hello @ovowei, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly upgrades the project's testing capabilities by introducing a new, robust CI testing framework. It integrates Pytest and adds a comprehensive suite of accuracy and performance tests for various AMX quantized Mixture-of-Experts operations. Concurrently, the model quantization script has been improved with better memory management options and clearer documentation, ensuring both the reliability and efficiency of the kernel's core functionalities.

Highlights

New CI Testing Framework: A comprehensive CI testing framework has been introduced, adapted from SGLang, enabling hardware-aware test registration (CPU, CUDA, AMD) and execution with timeout control. This framework facilitates organized and robust testing of the kernel's functionalities.
Pytest Integration: Pytest is now integrated into the project, with a dedicated pytest.ini configuration file. This setup defines test paths, naming conventions, and specific markers for different hardware backends (CPU, CUDA, AMD), as well as for slow-running or model-dependent tests.
Accuracy Tests for Quantized MoE Operations: New accuracy tests have been added for various AMX quantized Mixture-of-Experts (MoE) operations, including INT4, INT4_1, INT4_1K (group quantization), and INT8. These tests compare the kernel's output against PyTorch reference implementations to ensure correctness.
Performance Benchmarks for Quantized MoE Operations: Dedicated performance benchmark tests have been implemented for AMX INT4, INT4_1, INT4_1K, and INT8 MoE operations. These benchmarks measure key metrics such as bandwidth and FLOPS, providing insights into the efficiency of the optimized operations.
Enhanced Memory Management for Quantization Script: The convert_gpu_weights.py script has been updated with new command-line arguments (--max_gpu_memory and --max_cpu_memory) to allow users to explicitly control memory allocation during GPTQ quantization. This enhancement helps prevent Out-Of-Memory (OOM) errors and explicitly disables unsupported disk offloading.
Updated Documentation for Quantization: The scripts/README.md file now includes a critical 'Precision Warning' regarding potential accuracy degradation when quantizing directly from FP8 to INT4/INT8. Additionally, a new 'Memory Management' section provides detailed guidance on using the new memory-related arguments in the quantization script.

Ignored Files

Ignored by pattern: .github/workflows/** (1)
- .github/workflows/kt-kernel-tests.yml

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a comprehensive testing framework for accuracy and performance, which is a great addition. The framework includes test registration, a test runner with timeout capabilities, and hardware-specific test execution. The changes to convert_gpu_weights.py to improve memory management and error handling are also very valuable. However, I've found several critical issues in the new test files. The CPU accuracy and benchmark tests have an unnecessary and problematic dependency on CUDA, which will cause them to fail in CPU-only environments. Additionally, the test runner utility has a couple of bugs related to exception handling and race conditions that could make test failures difficult to debug or cause the runner itself to crash. I've left detailed comments with suggestions on how to fix these issues. Once they are addressed, this will be a solid contribution to the project's testing infrastructure.

gemini-code-assist · 2025-11-26T09:15:49Z

kt-kernel/test/per_commit/test_moe_amx_accuracy_int8.py

+        gate_proj = (
+            torch.randn(
+                (expert_num, intermediate_size, hidden_size),
+                dtype=torch.bfloat16,
+                device="cuda",
+            )
+            .to("cpu")
+            .contiguous()
+        )
+        up_proj = (
+            torch.randn(
+                (expert_num, intermediate_size, hidden_size),
+                dtype=torch.bfloat16,
+                device="cuda",
+            )
+            .to("cpu")
+            .contiguous()
+        )
+        down_proj = (
+            torch.randn(
+                (expert_num, hidden_size, intermediate_size),
+                dtype=torch.bfloat16,
+                device="cuda",
+            )
+            .to("cpu")
+            .contiguous()
+        )


This CPU test has a dependency on CUDA. Tensors are created on device="cuda" and then moved to the CPU. This will cause the test to fail in a CPU-only environment. Please create the tensors directly on the CPU by using device="cpu".

gate_proj = ( torch.randn( (expert_num, intermediate_size, hidden_size), dtype=torch.bfloat16, device="cpu", ) .contiguous() ) up_proj = ( torch.randn( (expert_num, intermediate_size, hidden_size), dtype=torch.bfloat16, device="cpu", ) .contiguous() ) down_proj = ( torch.randn( (expert_num, hidden_size, intermediate_size), dtype=torch.bfloat16, device="cpu", ) .contiguous() )

gemini-code-assist · 2025-11-26T09:15:50Z

kt-kernel/test/per_commit/test_moe_amx_bench_int4_1.py

+        input_tensor = (
+            torch.randn((layer_num, qlen, hidden_size), dtype=torch.bfloat16, device="cuda").to("cpu").contiguous()
+        )
+        output_tensor = (
+            torch.empty((layer_num, qlen, hidden_size), dtype=torch.bfloat16, device="cuda").to("cpu").contiguous()
+        )


This CPU benchmark test has a dependency on CUDA. The input_tensor and output_tensor are created on device="cuda" and then moved to the CPU. This will cause the test to fail in a CPU-only environment. Please create these tensors directly on the CPU.

input_tensor = ( torch.randn((layer_num, qlen, hidden_size), dtype=torch.bfloat16, device="cpu").contiguous() ) output_tensor = ( torch.empty((layer_num, qlen, hidden_size), dtype=torch.bfloat16, device="cpu").contiguous() )

gemini-code-assist · 2025-11-26T09:15:50Z

kt-kernel/test/per_commit/test_moe_amx_bench_int8.py

+        input_tensor = (
+            torch.randn((layer_num, qlen, hidden_size), dtype=torch.bfloat16, device="cuda").to("cpu").contiguous()
+        )
+        output_tensor = (
+            torch.empty((layer_num, qlen, hidden_size), dtype=torch.bfloat16, device="cuda").to("cpu").contiguous()
+        )


This CPU benchmark test has a dependency on CUDA. The input_tensor and output_tensor are created on device="cuda" and then moved to the CPU. This will cause the test to fail in a CPU-only environment. Please create these tensors directly on the CPU.

input_tensor = ( torch.randn((layer_num, qlen, hidden_size), dtype=torch.bfloat16, device="cpu").contiguous() ) output_tensor = ( torch.empty((layer_num, qlen, hidden_size), dtype=torch.bfloat16, device="cpu").contiguous() )

gemini-code-assist · 2025-11-26T09:15:50Z

kt-kernel/test/per_commit/test_moe_amx_accuracy_int4.py

+        gate_proj = (
+            torch.randn(
+                (expert_num, intermediate_size, hidden_size),
+                dtype=torch.bfloat16,
+                device="cuda",
+            )
+            .to("cpu")
+            .contiguous()
+        )
+        up_proj = (
+            torch.randn(
+                (expert_num, intermediate_size, hidden_size),
+                dtype=torch.bfloat16,
+                device="cuda",
+            )
+            .to("cpu")
+            .contiguous()
+        )
+        down_proj = (
+            torch.randn(
+                (expert_num, hidden_size, intermediate_size),
+                dtype=torch.bfloat16,
+                device="cuda",
+            )
+            .to("cpu")
+            .contiguous()
+        )


This CPU test has a dependency on CUDA. Tensors are created on device="cuda" and then moved to the CPU. This will cause the test to fail in a CPU-only environment. Please create the tensors directly on the CPU by using device="cpu".

gate_proj = ( torch.randn( (expert_num, intermediate_size, hidden_size), dtype=torch.bfloat16, device="cpu", ) .contiguous() ) up_proj = ( torch.randn( (expert_num, intermediate_size, hidden_size), dtype=torch.bfloat16, device="cpu", ) .contiguous() ) down_proj = ( torch.randn( (expert_num, hidden_size, intermediate_size), dtype=torch.bfloat16, device="cpu", ) .contiguous() )

gemini-code-assist · 2025-11-26T09:15:50Z

kt-kernel/test/per_commit/test_moe_amx_accuracy_int4_1.py

+        gate_proj = (
+            torch.randn(
+                (expert_num, intermediate_size, hidden_size),
+                dtype=torch.bfloat16,
+                device="cuda",
+            )
+            .to("cpu")
+            .contiguous()
+        )
+        up_proj = (
+            torch.randn(
+                (expert_num, intermediate_size, hidden_size),
+                dtype=torch.bfloat16,
+                device="cuda",
+            )
+            .to("cpu")
+            .contiguous()
+        )
+        down_proj = (
+            torch.randn(
+                (expert_num, hidden_size, intermediate_size),
+                dtype=torch.bfloat16,
+                device="cuda",
+            )
+            .to("cpu")
+            .contiguous()
+        )


This CPU test has a dependency on CUDA. Tensors are created on device="cuda" and then moved to the CPU. This will cause the test to fail in a CPU-only environment. Please create the tensors directly on the CPU by using device="cpu".

gate_proj = ( torch.randn( (expert_num, intermediate_size, hidden_size), dtype=torch.bfloat16, device="cpu", ) .contiguous() ) up_proj = ( torch.randn( (expert_num, intermediate_size, hidden_size), dtype=torch.bfloat16, device="cpu", ) .contiguous() ) down_proj = ( torch.randn( (expert_num, hidden_size, intermediate_size), dtype=torch.bfloat16, device="cpu", ) .contiguous() )

gemini-code-assist · 2025-11-26T09:15:52Z

kt-kernel/test/per_commit/test_moe_amx_bench_int4.py

+            gate_proj = (
+                torch.randn((expert_num, intermediate_size, hidden_size), dtype=torch.float32, device="cuda")
+                .to("cpu")
+                .contiguous()
+            )
+            up_proj = (
+                torch.randn((expert_num, intermediate_size, hidden_size), dtype=torch.float32, device="cuda")
+                .to("cpu")
+                .contiguous()
+            )
+            down_proj = (
+                torch.randn((expert_num, hidden_size, intermediate_size), dtype=torch.float32, device="cuda")
+                .to("cpu")
+                .contiguous()
+            )


This CPU benchmark test has a dependency on CUDA. The gate_proj, up_proj, and down_proj tensors are created on device="cuda" and then moved to the CPU. This will cause the test to fail in a CPU-only environment. Please create these tensors directly on the CPU.

gate_proj = ( torch.randn((expert_num, intermediate_size, hidden_size), dtype=torch.float32, device="cpu") .contiguous() ) up_proj = ( torch.randn((expert_num, intermediate_size, hidden_size), dtype=torch.float32, device="cpu") .contiguous() ) down_proj = ( torch.randn((expert_num, hidden_size, intermediate_size), dtype=torch.float32, device="cpu") .contiguous() )

gemini-code-assist · 2025-11-26T09:15:52Z

kt-kernel/test/per_commit/test_moe_amx_bench_int4_1k.py

+                torch.randn((expert_num, intermediate_size, hidden_size), dtype=torch.float32, device="cuda")
+                .to("cpu")
+                .contiguous()
+            )
+            down_proj = (
+                torch.randn((expert_num, hidden_size, intermediate_size), dtype=torch.float32, device="cuda")
+                .to("cpu")
+                .contiguous()
+            )
+            config = kt_kernel_ext.moe.MOEConfig(expert_num, num_experts_per_tok, hidden_size, intermediate_size, 0)
+            config.max_len = max_len
+            config.gate_proj = gate_proj.data_ptr()
+            config.up_proj = up_proj.data_ptr()
+            config.down_proj = down_proj.data_ptr()
+            config.pool = CPUInfer.backend_


This CPU benchmark test has a dependency on CUDA. The gate_proj, up_proj, and down_proj tensors are created on device="cuda" and then moved to the CPU. This will cause the test to fail in a CPU-only environment. Please create these tensors directly on the CPU.

gate_proj = ( torch.randn((expert_num, intermediate_size, hidden_size), dtype=torch.float32, device="cpu") .contiguous() ) up_proj = ( torch.randn((expert_num, intermediate_size, hidden_size), dtype=torch.float32, device="cpu") .contiguous() ) down_proj = ( torch.randn((expert_num, hidden_size, intermediate_size), dtype=torch.float32, device="cpu") .contiguous() )

gemini-code-assist · 2025-11-26T09:15:52Z

kt-kernel/test/ci/ci_utils.py

+def run_with_timeout(
+    func: Callable,
+    args: tuple = (),
+    kwargs: Optional[dict] = None,
+    timeout: float = None,
+):
+    """Run a function with timeout."""
+    ret_value = []
+
+    def _target_func():
+        ret_value.append(func(*args, **(kwargs or {})))
+
+    t = threading.Thread(target=_target_func)
+    t.start()
+    t.join(timeout=timeout)
+    if t.is_alive():
+        raise TimeoutError()
+
+    if not ret_value:
+        raise RuntimeError()
+
+    return ret_value[0]


The current implementation of run_with_timeout swallows exceptions raised by the wrapped function func. If func raises an exception, it is not propagated, and a generic RuntimeError is raised instead, losing the original traceback. This makes debugging test failures very difficult. The function should be modified to capture and re-raise exceptions from the target thread.

def run_with_timeout( func: Callable, args: tuple = (), kwargs: Optional[dict] = None, timeout: float = None, ): """Run a function with timeout.""" ret_value = [None] exception = [None] def _target_func(): try: ret_value[0] = func(*args, **(kwargs or {})) except Exception as e: exception[0] = e t = threading.Thread(target=_target_func) t.start() t.join(timeout=timeout) if t.is_alive(): raise TimeoutError() if exception[0]: raise exception[0] return ret_value[0]

gemini-code-assist · 2025-11-26T09:15:53Z

kt-kernel/test/ci/ci_utils.py

+            else:
+                passed_tests.append(filename)
+        except TimeoutError:
+            kill_process_tree(process.pid)


There is a race condition here. If the timeout occurs before subprocess.Popen is called in the run_one_file thread, the process variable will still be None. This will cause kill_process_tree(process.pid) to raise an AttributeError, crashing the test runner. You should add a check to ensure process is not None before attempting to kill it.

if process: kill_process_tree(process.pid)

gemini-code-assist · 2025-11-26T09:15:53Z

kt-kernel/test/ci/ci_register.py

+        assert (
+            est_time is not None
+        ), "esimation_time is required and should be a constant"


There's a typo in the assertion message. esimation_time should be estimation_time.

assert est_time is not None, "estimation_time is required and should be a constant"

gemini-code-assist · 2025-11-26T09:16:00Z

Warning

Gemini encountered an error creating the review. You can try again by commenting /gemini review.

ovowei added 7 commits November 24, 2025 20:37

update scripts

1b095e7

add ci

6b25aca

fix

827f475

fix

9576cfa

add accuracy and performance test

45fec3f

update performance test

b831f64

update performance test

02ac912

ovowei changed the title ~~add accuracy and performace test~~ add accuracy and performance test Nov 26, 2025

ovowei added the run-ci label Nov 26, 2025

gemini-code-assist bot reviewed Nov 26, 2025

View reviewed changes

ovowei and others added 2 commits November 26, 2025 18:04

update workflow

f9b138b

Merge branch 'main' into add-ci-djw

98467da

ovowei merged commit fef6dd9 into main Nov 27, 2025
9 checks passed

KMSorSMS pushed a commit that referenced this pull request Dec 11, 2025

add accuracy and performance test (#1643)

49c465b

Conversation

ovowei commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Nov 26, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot commented Nov 26, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ovowei commented Nov 26, 2025 •

edited

Loading