Conversation
Summary of ChangesHello @ovowei, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly upgrades the project's testing capabilities by introducing a new, robust CI testing framework. It integrates Pytest and adds a comprehensive suite of accuracy and performance tests for various AMX quantized Mixture-of-Experts operations. Concurrently, the model quantization script has been improved with better memory management options and clearer documentation, ensuring both the reliability and efficiency of the kernel's core functionalities. Highlights
Ignored Files
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive testing framework for accuracy and performance, which is a great addition. The framework includes test registration, a test runner with timeout capabilities, and hardware-specific test execution. The changes to convert_gpu_weights.py to improve memory management and error handling are also very valuable. However, I've found several critical issues in the new test files. The CPU accuracy and benchmark tests have an unnecessary and problematic dependency on CUDA, which will cause them to fail in CPU-only environments. Additionally, the test runner utility has a couple of bugs related to exception handling and race conditions that could make test failures difficult to debug or cause the runner itself to crash. I've left detailed comments with suggestions on how to fix these issues. Once they are addressed, this will be a solid contribution to the project's testing infrastructure.
| gate_proj = ( | ||
| torch.randn( | ||
| (expert_num, intermediate_size, hidden_size), | ||
| dtype=torch.bfloat16, | ||
| device="cuda", | ||
| ) | ||
| .to("cpu") | ||
| .contiguous() | ||
| ) | ||
| up_proj = ( | ||
| torch.randn( | ||
| (expert_num, intermediate_size, hidden_size), | ||
| dtype=torch.bfloat16, | ||
| device="cuda", | ||
| ) | ||
| .to("cpu") | ||
| .contiguous() | ||
| ) | ||
| down_proj = ( | ||
| torch.randn( | ||
| (expert_num, hidden_size, intermediate_size), | ||
| dtype=torch.bfloat16, | ||
| device="cuda", | ||
| ) | ||
| .to("cpu") | ||
| .contiguous() | ||
| ) |
There was a problem hiding this comment.
This CPU test has a dependency on CUDA. Tensors are created on device="cuda" and then moved to the CPU. This will cause the test to fail in a CPU-only environment. Please create the tensors directly on the CPU by using device="cpu".
gate_proj = (
torch.randn(
(expert_num, intermediate_size, hidden_size),
dtype=torch.bfloat16,
device="cpu",
)
.contiguous()
)
up_proj = (
torch.randn(
(expert_num, intermediate_size, hidden_size),
dtype=torch.bfloat16,
device="cpu",
)
.contiguous()
)
down_proj = (
torch.randn(
(expert_num, hidden_size, intermediate_size),
dtype=torch.bfloat16,
device="cpu",
)
.contiguous()
)| input_tensor = ( | ||
| torch.randn((layer_num, qlen, hidden_size), dtype=torch.bfloat16, device="cuda").to("cpu").contiguous() | ||
| ) | ||
| output_tensor = ( | ||
| torch.empty((layer_num, qlen, hidden_size), dtype=torch.bfloat16, device="cuda").to("cpu").contiguous() | ||
| ) |
There was a problem hiding this comment.
This CPU benchmark test has a dependency on CUDA. The input_tensor and output_tensor are created on device="cuda" and then moved to the CPU. This will cause the test to fail in a CPU-only environment. Please create these tensors directly on the CPU.
input_tensor = (
torch.randn((layer_num, qlen, hidden_size), dtype=torch.bfloat16, device="cpu").contiguous()
)
output_tensor = (
torch.empty((layer_num, qlen, hidden_size), dtype=torch.bfloat16, device="cpu").contiguous()
)| input_tensor = ( | ||
| torch.randn((layer_num, qlen, hidden_size), dtype=torch.bfloat16, device="cuda").to("cpu").contiguous() | ||
| ) | ||
| output_tensor = ( | ||
| torch.empty((layer_num, qlen, hidden_size), dtype=torch.bfloat16, device="cuda").to("cpu").contiguous() | ||
| ) |
There was a problem hiding this comment.
This CPU benchmark test has a dependency on CUDA. The input_tensor and output_tensor are created on device="cuda" and then moved to the CPU. This will cause the test to fail in a CPU-only environment. Please create these tensors directly on the CPU.
input_tensor = (
torch.randn((layer_num, qlen, hidden_size), dtype=torch.bfloat16, device="cpu").contiguous()
)
output_tensor = (
torch.empty((layer_num, qlen, hidden_size), dtype=torch.bfloat16, device="cpu").contiguous()
)| gate_proj = ( | ||
| torch.randn( | ||
| (expert_num, intermediate_size, hidden_size), | ||
| dtype=torch.bfloat16, | ||
| device="cuda", | ||
| ) | ||
| .to("cpu") | ||
| .contiguous() | ||
| ) | ||
| up_proj = ( | ||
| torch.randn( | ||
| (expert_num, intermediate_size, hidden_size), | ||
| dtype=torch.bfloat16, | ||
| device="cuda", | ||
| ) | ||
| .to("cpu") | ||
| .contiguous() | ||
| ) | ||
| down_proj = ( | ||
| torch.randn( | ||
| (expert_num, hidden_size, intermediate_size), | ||
| dtype=torch.bfloat16, | ||
| device="cuda", | ||
| ) | ||
| .to("cpu") | ||
| .contiguous() | ||
| ) |
There was a problem hiding this comment.
This CPU test has a dependency on CUDA. Tensors are created on device="cuda" and then moved to the CPU. This will cause the test to fail in a CPU-only environment. Please create the tensors directly on the CPU by using device="cpu".
gate_proj = (
torch.randn(
(expert_num, intermediate_size, hidden_size),
dtype=torch.bfloat16,
device="cpu",
)
.contiguous()
)
up_proj = (
torch.randn(
(expert_num, intermediate_size, hidden_size),
dtype=torch.bfloat16,
device="cpu",
)
.contiguous()
)
down_proj = (
torch.randn(
(expert_num, hidden_size, intermediate_size),
dtype=torch.bfloat16,
device="cpu",
)
.contiguous()
)| gate_proj = ( | ||
| torch.randn( | ||
| (expert_num, intermediate_size, hidden_size), | ||
| dtype=torch.bfloat16, | ||
| device="cuda", | ||
| ) | ||
| .to("cpu") | ||
| .contiguous() | ||
| ) | ||
| up_proj = ( | ||
| torch.randn( | ||
| (expert_num, intermediate_size, hidden_size), | ||
| dtype=torch.bfloat16, | ||
| device="cuda", | ||
| ) | ||
| .to("cpu") | ||
| .contiguous() | ||
| ) | ||
| down_proj = ( | ||
| torch.randn( | ||
| (expert_num, hidden_size, intermediate_size), | ||
| dtype=torch.bfloat16, | ||
| device="cuda", | ||
| ) | ||
| .to("cpu") | ||
| .contiguous() | ||
| ) |
There was a problem hiding this comment.
This CPU test has a dependency on CUDA. Tensors are created on device="cuda" and then moved to the CPU. This will cause the test to fail in a CPU-only environment. Please create the tensors directly on the CPU by using device="cpu".
gate_proj = (
torch.randn(
(expert_num, intermediate_size, hidden_size),
dtype=torch.bfloat16,
device="cpu",
)
.contiguous()
)
up_proj = (
torch.randn(
(expert_num, intermediate_size, hidden_size),
dtype=torch.bfloat16,
device="cpu",
)
.contiguous()
)
down_proj = (
torch.randn(
(expert_num, hidden_size, intermediate_size),
dtype=torch.bfloat16,
device="cpu",
)
.contiguous()
)| gate_proj = ( | ||
| torch.randn((expert_num, intermediate_size, hidden_size), dtype=torch.float32, device="cuda") | ||
| .to("cpu") | ||
| .contiguous() | ||
| ) | ||
| up_proj = ( | ||
| torch.randn((expert_num, intermediate_size, hidden_size), dtype=torch.float32, device="cuda") | ||
| .to("cpu") | ||
| .contiguous() | ||
| ) | ||
| down_proj = ( | ||
| torch.randn((expert_num, hidden_size, intermediate_size), dtype=torch.float32, device="cuda") | ||
| .to("cpu") | ||
| .contiguous() | ||
| ) |
There was a problem hiding this comment.
This CPU benchmark test has a dependency on CUDA. The gate_proj, up_proj, and down_proj tensors are created on device="cuda" and then moved to the CPU. This will cause the test to fail in a CPU-only environment. Please create these tensors directly on the CPU.
gate_proj = (
torch.randn((expert_num, intermediate_size, hidden_size), dtype=torch.float32, device="cpu")
.contiguous()
)
up_proj = (
torch.randn((expert_num, intermediate_size, hidden_size), dtype=torch.float32, device="cpu")
.contiguous()
)
down_proj = (
torch.randn((expert_num, hidden_size, intermediate_size), dtype=torch.float32, device="cpu")
.contiguous()
)| torch.randn((expert_num, intermediate_size, hidden_size), dtype=torch.float32, device="cuda") | ||
| .to("cpu") | ||
| .contiguous() | ||
| ) | ||
| down_proj = ( | ||
| torch.randn((expert_num, hidden_size, intermediate_size), dtype=torch.float32, device="cuda") | ||
| .to("cpu") | ||
| .contiguous() | ||
| ) | ||
| config = kt_kernel_ext.moe.MOEConfig(expert_num, num_experts_per_tok, hidden_size, intermediate_size, 0) | ||
| config.max_len = max_len | ||
| config.gate_proj = gate_proj.data_ptr() | ||
| config.up_proj = up_proj.data_ptr() | ||
| config.down_proj = down_proj.data_ptr() | ||
| config.pool = CPUInfer.backend_ |
There was a problem hiding this comment.
This CPU benchmark test has a dependency on CUDA. The gate_proj, up_proj, and down_proj tensors are created on device="cuda" and then moved to the CPU. This will cause the test to fail in a CPU-only environment. Please create these tensors directly on the CPU.
gate_proj = (
torch.randn((expert_num, intermediate_size, hidden_size), dtype=torch.float32, device="cpu")
.contiguous()
)
up_proj = (
torch.randn((expert_num, intermediate_size, hidden_size), dtype=torch.float32, device="cpu")
.contiguous()
)
down_proj = (
torch.randn((expert_num, hidden_size, intermediate_size), dtype=torch.float32, device="cpu")
.contiguous()
)| def run_with_timeout( | ||
| func: Callable, | ||
| args: tuple = (), | ||
| kwargs: Optional[dict] = None, | ||
| timeout: float = None, | ||
| ): | ||
| """Run a function with timeout.""" | ||
| ret_value = [] | ||
|
|
||
| def _target_func(): | ||
| ret_value.append(func(*args, **(kwargs or {}))) | ||
|
|
||
| t = threading.Thread(target=_target_func) | ||
| t.start() | ||
| t.join(timeout=timeout) | ||
| if t.is_alive(): | ||
| raise TimeoutError() | ||
|
|
||
| if not ret_value: | ||
| raise RuntimeError() | ||
|
|
||
| return ret_value[0] |
There was a problem hiding this comment.
The current implementation of run_with_timeout swallows exceptions raised by the wrapped function func. If func raises an exception, it is not propagated, and a generic RuntimeError is raised instead, losing the original traceback. This makes debugging test failures very difficult. The function should be modified to capture and re-raise exceptions from the target thread.
def run_with_timeout(
func: Callable,
args: tuple = (),
kwargs: Optional[dict] = None,
timeout: float = None,
):
"""Run a function with timeout."""
ret_value = [None]
exception = [None]
def _target_func():
try:
ret_value[0] = func(*args, **(kwargs or {}))
except Exception as e:
exception[0] = e
t = threading.Thread(target=_target_func)
t.start()
t.join(timeout=timeout)
if t.is_alive():
raise TimeoutError()
if exception[0]:
raise exception[0]
return ret_value[0]| else: | ||
| passed_tests.append(filename) | ||
| except TimeoutError: | ||
| kill_process_tree(process.pid) |
There was a problem hiding this comment.
There is a race condition here. If the timeout occurs before subprocess.Popen is called in the run_one_file thread, the process variable will still be None. This will cause kill_process_tree(process.pid) to raise an AttributeError, crashing the test runner. You should add a check to ensure process is not None before attempting to kill it.
if process:
kill_process_tree(process.pid)| assert ( | ||
| est_time is not None | ||
| ), "esimation_time is required and should be a constant" |
|
Warning Gemini encountered an error creating the review. You can try again by commenting |
add accuracy and performance test