add transfer engine for training and inference. #569

liaozhaoyan · 2025-06-29T17:02:28Z

No description provided.

Copilot

Pull Request Overview

This PR adds a new RDMA-based transfer engine for both training and inference, along with end-to-end test scripts and sample configuration/model metadata.

Introduces MooncakeTransferEngine wrapper with sync and batch transfer APIs
Implements MooncakeTraining (server) and MooncakeInference (client) using ZMQ and the transfer engine
Adds test harnesses (torch.yaml, test_training.py, test_inference.py) and a sample weight file (little.txt)

Reviewed Changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
mooncake-wheel/tests/torch.yaml	YAML config for IPs, GPUs, ports, and bulk settings
mooncake-wheel/tests/test_training.py	Training-side script enqueuing and transferring tensors
mooncake-wheel/tests/test_inference.py	Inference-side script receiving tensors and registering memory
mooncake-wheel/tests/little.txt	Sample model tensor metadata for tests
mooncake-wheel/mooncake/transfer_engine/transfer_engine.py	Wrapper around Mooncake.engine TransferEngine
mooncake-wheel/mooncake/transfer_engine/new_tensor.py	Utilities for tensor creation and slicing
mooncake-wheel/mooncake/transfer_engine/mooncake_training.py	Threaded training engine sending tensors via ZMQ and RDMA
mooncake-wheel/mooncake/transfer_engine/mooncake_inference.py	Inference client receiving tensors via ZMQ and RDMA
mooncake-wheel/mooncake/transfer_engine/init.py	Package init stub for backward compatibility

Comments suppressed due to low confidence (2)

mooncake-wheel/mooncake/transfer_engine/transfer_engine.py:76

Update the docstring for batch_transfer_sync to reflect its batch behavior (e.g., mention handling multiple buffers and addresses).

        """Synchronously transfer data to the specified address."""

mooncake-wheel/tests/torch.yaml:6

[nitpick] The bulk_size field is declared but not used by the test scripts; consider removing it or wiring it into the code to avoid confusion.

bulk_size: 5 # 5GB

Copilot · 2025-06-29T17:28:31Z

mooncake-wheel/mooncake/transfer_engine/mooncake_training.py

+from queue import Queue
+from new_tensor import get_dtype_str
+
+from transfer_engine import MooncakeTransferEngine


Use a relative import for the local transfer_engine module to avoid import errors (e.g., from .transfer_engine import MooncakeTransferEngine).

Suggested change

from transfer_engine import MooncakeTransferEngine

from .transfer_engine import MooncakeTransferEngine

Copilot · 2025-06-29T17:28:32Z

mooncake-wheel/mooncake/transfer_engine/mooncake_inference.py

+import torch
+import zmq
+
+from transfer_engine import MooncakeTransferEngine


Use a relative import for the local transfer_engine module to avoid import errors (e.g., from .transfer_engine import MooncakeTransferEngine).

Suggested change

from transfer_engine import MooncakeTransferEngine

from .transfer_engine import MooncakeTransferEngine

Copilot · 2025-06-29T17:28:32Z

mooncake-wheel/mooncake/transfer_engine/transfer_engine.py

+            logger.error("Mooncake Transfer Engine Return Error.")
+            raise RuntimeError("Mooncake Transfer Engine Return Error.")


Differentiate error messages for transfer_sync and batch_transfer_sync failures, and include context (e.g., function name or return code) to aid debugging.

Suggested change

logger.error("Mooncake Transfer Engine Return Error.")

raise RuntimeError("Mooncake Transfer Engine Return Error.")

logger.error(f"Mooncake Transfer Engine `transfer_sync` failed with return code {ret}.")

raise RuntimeError(f"Mooncake Transfer Engine `transfer_sync` failed with return code {ret}.")

Copilot · 2025-06-29T17:28:33Z

mooncake-wheel/mooncake/transfer_engine/new_tensor.py

+def get_dtype(dtype: str)-> torch.dtype:
+    return dtype_map[dtype]
+
+def get_dtype_str(dtype: torch.dtype)-> str:


Add explicit fallback or error handling when a dtype key is not found in dtype_map to avoid returning None silently.

ShangmingCai · 2025-06-30T03:48:04Z

mooncake-wheel/mooncake/transfer_engine/mooncake_inference.py

+from transfer_engine import MooncakeTransferEngine
+from new_tensor import create_tensor
+
+class MooncakeInference:


I think these class names do not precisely reveal what they are actually doing. Maybe the usage is for RL training and inference, but the names are very confusing.

Sure. Will fix

doujiang24 · 2025-07-03T15:20:46Z

@liaozhaoyan Could you please add some doc for it?

stmatengss requested a review from Copilot June 29, 2025 17:27

stmatengss self-assigned this Jun 29, 2025

Copilot AI reviewed Jun 29, 2025

View reviewed changes

add transfer engine for training and inference.

857123e

liaozhaoyan force-pushed the main branch from 7ca6478 to 857123e Compare June 29, 2025 17:30

ShangmingCai reviewed Jun 30, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add transfer engine for training and inference. #569

add transfer engine for training and inference. #569

Uh oh!

liaozhaoyan commented Jun 29, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jun 29, 2025

Uh oh!

Copilot AI Jun 29, 2025

Uh oh!

Copilot AI Jun 29, 2025

Uh oh!

Copilot AI Jun 29, 2025

Uh oh!

ShangmingCai Jun 30, 2025

Uh oh!

stmatengss Jun 30, 2025

Uh oh!

doujiang24 commented Jul 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	from transfer_engine import MooncakeTransferEngine
	from .transfer_engine import MooncakeTransferEngine

		logger.error("Mooncake Transfer Engine Return Error.")
		raise RuntimeError("Mooncake Transfer Engine Return Error.")

add transfer engine for training and inference. #569

Are you sure you want to change the base?

add transfer engine for training and inference. #569

Uh oh!

Conversation

liaozhaoyan commented Jun 29, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Jun 29, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jun 29, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jun 29, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jun 29, 2025

Choose a reason for hiding this comment

Uh oh!

ShangmingCai Jun 30, 2025

Choose a reason for hiding this comment

Uh oh!

stmatengss Jun 30, 2025

Choose a reason for hiding this comment

Uh oh!

doujiang24 commented Jul 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants