Skip to content

add accuracy and performance test#1643

Merged
ovowei merged 9 commits intomainfrom
add-ci-djw
Nov 27, 2025
Merged

add accuracy and performance test#1643
ovowei merged 9 commits intomainfrom
add-ci-djw

Conversation

@ovowei
Copy link
Collaborator

@ovowei ovowei commented Nov 26, 2025

add accuracy and performance test

@ovowei ovowei changed the title add accuracy and performace test add accuracy and performance test Nov 26, 2025
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @ovowei, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly upgrades the project's testing capabilities by introducing a new, robust CI testing framework. It integrates Pytest and adds a comprehensive suite of accuracy and performance tests for various AMX quantized Mixture-of-Experts operations. Concurrently, the model quantization script has been improved with better memory management options and clearer documentation, ensuring both the reliability and efficiency of the kernel's core functionalities.

Highlights

  • New CI Testing Framework: A comprehensive CI testing framework has been introduced, adapted from SGLang, enabling hardware-aware test registration (CPU, CUDA, AMD) and execution with timeout control. This framework facilitates organized and robust testing of the kernel's functionalities.
  • Pytest Integration: Pytest is now integrated into the project, with a dedicated pytest.ini configuration file. This setup defines test paths, naming conventions, and specific markers for different hardware backends (CPU, CUDA, AMD), as well as for slow-running or model-dependent tests.
  • Accuracy Tests for Quantized MoE Operations: New accuracy tests have been added for various AMX quantized Mixture-of-Experts (MoE) operations, including INT4, INT4_1, INT4_1K (group quantization), and INT8. These tests compare the kernel's output against PyTorch reference implementations to ensure correctness.
  • Performance Benchmarks for Quantized MoE Operations: Dedicated performance benchmark tests have been implemented for AMX INT4, INT4_1, INT4_1K, and INT8 MoE operations. These benchmarks measure key metrics such as bandwidth and FLOPS, providing insights into the efficiency of the optimized operations.
  • Enhanced Memory Management for Quantization Script: The convert_gpu_weights.py script has been updated with new command-line arguments (--max_gpu_memory and --max_cpu_memory) to allow users to explicitly control memory allocation during GPTQ quantization. This enhancement helps prevent Out-Of-Memory (OOM) errors and explicitly disables unsupported disk offloading.
  • Updated Documentation for Quantization: The scripts/README.md file now includes a critical 'Precision Warning' regarding potential accuracy degradation when quantizing directly from FP8 to INT4/INT8. Additionally, a new 'Memory Management' section provides detailed guidance on using the new memory-related arguments in the quantization script.
Ignored Files
  • Ignored by pattern: .github/workflows/** (1)
    • .github/workflows/kt-kernel-tests.yml
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@ovowei ovowei added the run-ci label Nov 26, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive testing framework for accuracy and performance, which is a great addition. The framework includes test registration, a test runner with timeout capabilities, and hardware-specific test execution. The changes to convert_gpu_weights.py to improve memory management and error handling are also very valuable. However, I've found several critical issues in the new test files. The CPU accuracy and benchmark tests have an unnecessary and problematic dependency on CUDA, which will cause them to fail in CPU-only environments. Additionally, the test runner utility has a couple of bugs related to exception handling and race conditions that could make test failures difficult to debug or cause the runner itself to crash. I've left detailed comments with suggestions on how to fix these issues. Once they are addressed, this will be a solid contribution to the project's testing infrastructure.

Comment on lines +106 to +132
gate_proj = (
torch.randn(
(expert_num, intermediate_size, hidden_size),
dtype=torch.bfloat16,
device="cuda",
)
.to("cpu")
.contiguous()
)
up_proj = (
torch.randn(
(expert_num, intermediate_size, hidden_size),
dtype=torch.bfloat16,
device="cuda",
)
.to("cpu")
.contiguous()
)
down_proj = (
torch.randn(
(expert_num, hidden_size, intermediate_size),
dtype=torch.bfloat16,
device="cuda",
)
.to("cpu")
.contiguous()
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This CPU test has a dependency on CUDA. Tensors are created on device="cuda" and then moved to the CPU. This will cause the test to fail in a CPU-only environment. Please create the tensors directly on the CPU by using device="cpu".

        gate_proj = (
            torch.randn(
                (expert_num, intermediate_size, hidden_size),
                dtype=torch.bfloat16,
                device="cpu",
            )
            .contiguous()
        )
        up_proj = (
            torch.randn(
                (expert_num, intermediate_size, hidden_size),
                dtype=torch.bfloat16,
                device="cpu",
            )
            .contiguous()
        )
        down_proj = (
            torch.randn(
                (expert_num, hidden_size, intermediate_size),
                dtype=torch.bfloat16,
                device="cpu",
            )
            .contiguous()
        )

Comment on lines +197 to +202
input_tensor = (
torch.randn((layer_num, qlen, hidden_size), dtype=torch.bfloat16, device="cuda").to("cpu").contiguous()
)
output_tensor = (
torch.empty((layer_num, qlen, hidden_size), dtype=torch.bfloat16, device="cuda").to("cpu").contiguous()
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This CPU benchmark test has a dependency on CUDA. The input_tensor and output_tensor are created on device="cuda" and then moved to the CPU. This will cause the test to fail in a CPU-only environment. Please create these tensors directly on the CPU.

        input_tensor = (
            torch.randn((layer_num, qlen, hidden_size), dtype=torch.bfloat16, device="cpu").contiguous()
        )
        output_tensor = (
            torch.empty((layer_num, qlen, hidden_size), dtype=torch.bfloat16, device="cpu").contiguous()
        )

Comment on lines +197 to +202
input_tensor = (
torch.randn((layer_num, qlen, hidden_size), dtype=torch.bfloat16, device="cuda").to("cpu").contiguous()
)
output_tensor = (
torch.empty((layer_num, qlen, hidden_size), dtype=torch.bfloat16, device="cuda").to("cpu").contiguous()
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This CPU benchmark test has a dependency on CUDA. The input_tensor and output_tensor are created on device="cuda" and then moved to the CPU. This will cause the test to fail in a CPU-only environment. Please create these tensors directly on the CPU.

        input_tensor = (
            torch.randn((layer_num, qlen, hidden_size), dtype=torch.bfloat16, device="cpu").contiguous()
        )
        output_tensor = (
            torch.empty((layer_num, qlen, hidden_size), dtype=torch.bfloat16, device="cpu").contiguous()
        )

Comment on lines +106 to +132
gate_proj = (
torch.randn(
(expert_num, intermediate_size, hidden_size),
dtype=torch.bfloat16,
device="cuda",
)
.to("cpu")
.contiguous()
)
up_proj = (
torch.randn(
(expert_num, intermediate_size, hidden_size),
dtype=torch.bfloat16,
device="cuda",
)
.to("cpu")
.contiguous()
)
down_proj = (
torch.randn(
(expert_num, hidden_size, intermediate_size),
dtype=torch.bfloat16,
device="cuda",
)
.to("cpu")
.contiguous()
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This CPU test has a dependency on CUDA. Tensors are created on device="cuda" and then moved to the CPU. This will cause the test to fail in a CPU-only environment. Please create the tensors directly on the CPU by using device="cpu".

        gate_proj = (
            torch.randn(
                (expert_num, intermediate_size, hidden_size),
                dtype=torch.bfloat16,
                device="cpu",
            )
            .contiguous()
        )
        up_proj = (
            torch.randn(
                (expert_num, intermediate_size, hidden_size),
                dtype=torch.bfloat16,
                device="cpu",
            )
            .contiguous()
        )
        down_proj = (
            torch.randn(
                (expert_num, hidden_size, intermediate_size),
                dtype=torch.bfloat16,
                device="cpu",
            )
            .contiguous()
        )

Comment on lines +106 to +132
gate_proj = (
torch.randn(
(expert_num, intermediate_size, hidden_size),
dtype=torch.bfloat16,
device="cuda",
)
.to("cpu")
.contiguous()
)
up_proj = (
torch.randn(
(expert_num, intermediate_size, hidden_size),
dtype=torch.bfloat16,
device="cuda",
)
.to("cpu")
.contiguous()
)
down_proj = (
torch.randn(
(expert_num, hidden_size, intermediate_size),
dtype=torch.bfloat16,
device="cuda",
)
.to("cpu")
.contiguous()
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This CPU test has a dependency on CUDA. Tensors are created on device="cuda" and then moved to the CPU. This will cause the test to fail in a CPU-only environment. Please create the tensors directly on the CPU by using device="cpu".

        gate_proj = (
            torch.randn(
                (expert_num, intermediate_size, hidden_size),
                dtype=torch.bfloat16,
                device="cpu",
            )
            .contiguous()
        )
        up_proj = (
            torch.randn(
                (expert_num, intermediate_size, hidden_size),
                dtype=torch.bfloat16,
                device="cpu",
            )
            .contiguous()
        )
        down_proj = (
            torch.randn(
                (expert_num, hidden_size, intermediate_size),
                dtype=torch.bfloat16,
                device="cpu",
            )
            .contiguous()
        )

Comment on lines +158 to +172
gate_proj = (
torch.randn((expert_num, intermediate_size, hidden_size), dtype=torch.float32, device="cuda")
.to("cpu")
.contiguous()
)
up_proj = (
torch.randn((expert_num, intermediate_size, hidden_size), dtype=torch.float32, device="cuda")
.to("cpu")
.contiguous()
)
down_proj = (
torch.randn((expert_num, hidden_size, intermediate_size), dtype=torch.float32, device="cuda")
.to("cpu")
.contiguous()
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This CPU benchmark test has a dependency on CUDA. The gate_proj, up_proj, and down_proj tensors are created on device="cuda" and then moved to the CPU. This will cause the test to fail in a CPU-only environment. Please create these tensors directly on the CPU.

            gate_proj = (
                torch.randn((expert_num, intermediate_size, hidden_size), dtype=torch.float32, device="cpu")
                .contiguous()
            )
            up_proj = (
                torch.randn((expert_num, intermediate_size, hidden_size), dtype=torch.float32, device="cpu")
                .contiguous()
            )
            down_proj = (
                torch.randn((expert_num, hidden_size, intermediate_size), dtype=torch.float32, device="cpu")
                .contiguous()
            )

Comment on lines +165 to +179
torch.randn((expert_num, intermediate_size, hidden_size), dtype=torch.float32, device="cuda")
.to("cpu")
.contiguous()
)
down_proj = (
torch.randn((expert_num, hidden_size, intermediate_size), dtype=torch.float32, device="cuda")
.to("cpu")
.contiguous()
)
config = kt_kernel_ext.moe.MOEConfig(expert_num, num_experts_per_tok, hidden_size, intermediate_size, 0)
config.max_len = max_len
config.gate_proj = gate_proj.data_ptr()
config.up_proj = up_proj.data_ptr()
config.down_proj = down_proj.data_ptr()
config.pool = CPUInfer.backend_
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This CPU benchmark test has a dependency on CUDA. The gate_proj, up_proj, and down_proj tensors are created on device="cuda" and then moved to the CPU. This will cause the test to fail in a CPU-only environment. Please create these tensors directly on the CPU.

            gate_proj = (
                torch.randn((expert_num, intermediate_size, hidden_size), dtype=torch.float32, device="cpu")
                .contiguous()
            )
            up_proj = (
                torch.randn((expert_num, intermediate_size, hidden_size), dtype=torch.float32, device="cpu")
                .contiguous()
            )
            down_proj = (
                torch.randn((expert_num, hidden_size, intermediate_size), dtype=torch.float32, device="cpu")
                .contiguous()
            )

Comment on lines +54 to +75
def run_with_timeout(
func: Callable,
args: tuple = (),
kwargs: Optional[dict] = None,
timeout: float = None,
):
"""Run a function with timeout."""
ret_value = []

def _target_func():
ret_value.append(func(*args, **(kwargs or {})))

t = threading.Thread(target=_target_func)
t.start()
t.join(timeout=timeout)
if t.is_alive():
raise TimeoutError()

if not ret_value:
raise RuntimeError()

return ret_value[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current implementation of run_with_timeout swallows exceptions raised by the wrapped function func. If func raises an exception, it is not propagated, and a generic RuntimeError is raised instead, losing the original traceback. This makes debugging test failures very difficult. The function should be modified to capture and re-raise exceptions from the target thread.

def run_with_timeout(
    func: Callable,
    args: tuple = (),
    kwargs: Optional[dict] = None,
    timeout: float = None,
):
    """Run a function with timeout."""
    ret_value = [None]
    exception = [None]

    def _target_func():
        try:
            ret_value[0] = func(*args, **(kwargs or {}))
        except Exception as e:
            exception[0] = e

    t = threading.Thread(target=_target_func)
    t.start()
    t.join(timeout=timeout)
    if t.is_alive():
        raise TimeoutError()

    if exception[0]:
        raise exception[0]

    return ret_value[0]

else:
passed_tests.append(filename)
except TimeoutError:
kill_process_tree(process.pid)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There is a race condition here. If the timeout occurs before subprocess.Popen is called in the run_one_file thread, the process variable will still be None. This will cause kill_process_tree(process.pid) to raise an AttributeError, crashing the test runner. You should add a check to ensure process is not None before attempting to kill it.

            if process:
                kill_process_tree(process.pid)

Comment on lines +69 to +71
assert (
est_time is not None
), "esimation_time is required and should be a constant"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There's a typo in the assertion message. esimation_time should be estimation_time.

        assert est_time is not None, "estimation_time is required and should be a constant"

@gemini-code-assist
Copy link
Contributor

Warning

Gemini encountered an error creating the review. You can try again by commenting /gemini review.

@ovowei ovowei merged commit fef6dd9 into main Nov 27, 2025
9 checks passed
KMSorSMS pushed a commit that referenced this pull request Dec 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant