Skip to content

Conversation

@liaozhaoyan
Copy link

No description provided.

@stmatengss stmatengss requested a review from Copilot June 29, 2025 17:27
@stmatengss stmatengss self-assigned this Jun 29, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds a new RDMA-based transfer engine for both training and inference, along with end-to-end test scripts and sample configuration/model metadata.

  • Introduces MooncakeTransferEngine wrapper with sync and batch transfer APIs
  • Implements MooncakeTraining (server) and MooncakeInference (client) using ZMQ and the transfer engine
  • Adds test harnesses (torch.yaml, test_training.py, test_inference.py) and a sample weight file (little.txt)

Reviewed Changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
mooncake-wheel/tests/torch.yaml YAML config for IPs, GPUs, ports, and bulk settings
mooncake-wheel/tests/test_training.py Training-side script enqueuing and transferring tensors
mooncake-wheel/tests/test_inference.py Inference-side script receiving tensors and registering memory
mooncake-wheel/tests/little.txt Sample model tensor metadata for tests
mooncake-wheel/mooncake/transfer_engine/transfer_engine.py Wrapper around Mooncake.engine TransferEngine
mooncake-wheel/mooncake/transfer_engine/new_tensor.py Utilities for tensor creation and slicing
mooncake-wheel/mooncake/transfer_engine/mooncake_training.py Threaded training engine sending tensors via ZMQ and RDMA
mooncake-wheel/mooncake/transfer_engine/mooncake_inference.py Inference client receiving tensors via ZMQ and RDMA
mooncake-wheel/mooncake/transfer_engine/init.py Package init stub for backward compatibility
Comments suppressed due to low confidence (2)

mooncake-wheel/mooncake/transfer_engine/transfer_engine.py:76

  • Update the docstring for batch_transfer_sync to reflect its batch behavior (e.g., mention handling multiple buffers and addresses).
        """Synchronously transfer data to the specified address."""

mooncake-wheel/tests/torch.yaml:6

  • [nitpick] The bulk_size field is declared but not used by the test scripts; consider removing it or wiring it into the code to avoid confusion.
bulk_size: 5 # 5GB

from queue import Queue
from new_tensor import get_dtype_str

from transfer_engine import MooncakeTransferEngine
Copy link

Copilot AI Jun 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use a relative import for the local transfer_engine module to avoid import errors (e.g., from .transfer_engine import MooncakeTransferEngine).

Suggested change
from transfer_engine import MooncakeTransferEngine
from .transfer_engine import MooncakeTransferEngine

Copilot uses AI. Check for mistakes.
import torch
import zmq

from transfer_engine import MooncakeTransferEngine
Copy link

Copilot AI Jun 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use a relative import for the local transfer_engine module to avoid import errors (e.g., from .transfer_engine import MooncakeTransferEngine).

Suggested change
from transfer_engine import MooncakeTransferEngine
from .transfer_engine import MooncakeTransferEngine

Copilot uses AI. Check for mistakes.
Comment on lines +69 to +70
logger.error("Mooncake Transfer Engine Return Error.")
raise RuntimeError("Mooncake Transfer Engine Return Error.")
Copy link

Copilot AI Jun 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Differentiate error messages for transfer_sync and batch_transfer_sync failures, and include context (e.g., function name or return code) to aid debugging.

Suggested change
logger.error("Mooncake Transfer Engine Return Error.")
raise RuntimeError("Mooncake Transfer Engine Return Error.")
logger.error(f"Mooncake Transfer Engine `transfer_sync` failed with return code {ret}.")
raise RuntimeError(f"Mooncake Transfer Engine `transfer_sync` failed with return code {ret}.")

Copilot uses AI. Check for mistakes.
def get_dtype(dtype: str)-> torch.dtype:
return dtype_map[dtype]

def get_dtype_str(dtype: torch.dtype)-> str:
Copy link

Copilot AI Jun 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add explicit fallback or error handling when a dtype key is not found in dtype_map to avoid returning None silently.

Copilot uses AI. Check for mistakes.
from transfer_engine import MooncakeTransferEngine
from new_tensor import create_tensor

class MooncakeInference:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these class names do not precisely reveal what they are actually doing. Maybe the usage is for RL training and inference, but the names are very confusing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Will fix

@doujiang24
Copy link
Collaborator

@liaozhaoyan Could you please add some doc for it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants