-
Notifications
You must be signed in to change notification settings - Fork 405
add transfer engine for training and inference. #569
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds a new RDMA-based transfer engine for both training and inference, along with end-to-end test scripts and sample configuration/model metadata.
- Introduces
MooncakeTransferEnginewrapper with sync and batch transfer APIs - Implements
MooncakeTraining(server) andMooncakeInference(client) using ZMQ and the transfer engine - Adds test harnesses (
torch.yaml,test_training.py,test_inference.py) and a sample weight file (little.txt)
Reviewed Changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| mooncake-wheel/tests/torch.yaml | YAML config for IPs, GPUs, ports, and bulk settings |
| mooncake-wheel/tests/test_training.py | Training-side script enqueuing and transferring tensors |
| mooncake-wheel/tests/test_inference.py | Inference-side script receiving tensors and registering memory |
| mooncake-wheel/tests/little.txt | Sample model tensor metadata for tests |
| mooncake-wheel/mooncake/transfer_engine/transfer_engine.py | Wrapper around Mooncake.engine TransferEngine |
| mooncake-wheel/mooncake/transfer_engine/new_tensor.py | Utilities for tensor creation and slicing |
| mooncake-wheel/mooncake/transfer_engine/mooncake_training.py | Threaded training engine sending tensors via ZMQ and RDMA |
| mooncake-wheel/mooncake/transfer_engine/mooncake_inference.py | Inference client receiving tensors via ZMQ and RDMA |
| mooncake-wheel/mooncake/transfer_engine/init.py | Package init stub for backward compatibility |
Comments suppressed due to low confidence (2)
mooncake-wheel/mooncake/transfer_engine/transfer_engine.py:76
- Update the docstring for
batch_transfer_syncto reflect its batch behavior (e.g., mention handling multiple buffers and addresses).
"""Synchronously transfer data to the specified address."""
mooncake-wheel/tests/torch.yaml:6
- [nitpick] The
bulk_sizefield is declared but not used by the test scripts; consider removing it or wiring it into the code to avoid confusion.
bulk_size: 5 # 5GB
| from queue import Queue | ||
| from new_tensor import get_dtype_str | ||
|
|
||
| from transfer_engine import MooncakeTransferEngine |
Copilot
AI
Jun 29, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use a relative import for the local transfer_engine module to avoid import errors (e.g., from .transfer_engine import MooncakeTransferEngine).
| from transfer_engine import MooncakeTransferEngine | |
| from .transfer_engine import MooncakeTransferEngine |
| import torch | ||
| import zmq | ||
|
|
||
| from transfer_engine import MooncakeTransferEngine |
Copilot
AI
Jun 29, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use a relative import for the local transfer_engine module to avoid import errors (e.g., from .transfer_engine import MooncakeTransferEngine).
| from transfer_engine import MooncakeTransferEngine | |
| from .transfer_engine import MooncakeTransferEngine |
| logger.error("Mooncake Transfer Engine Return Error.") | ||
| raise RuntimeError("Mooncake Transfer Engine Return Error.") |
Copilot
AI
Jun 29, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Differentiate error messages for transfer_sync and batch_transfer_sync failures, and include context (e.g., function name or return code) to aid debugging.
| logger.error("Mooncake Transfer Engine Return Error.") | |
| raise RuntimeError("Mooncake Transfer Engine Return Error.") | |
| logger.error(f"Mooncake Transfer Engine `transfer_sync` failed with return code {ret}.") | |
| raise RuntimeError(f"Mooncake Transfer Engine `transfer_sync` failed with return code {ret}.") |
| def get_dtype(dtype: str)-> torch.dtype: | ||
| return dtype_map[dtype] | ||
|
|
||
| def get_dtype_str(dtype: torch.dtype)-> str: |
Copilot
AI
Jun 29, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add explicit fallback or error handling when a dtype key is not found in dtype_map to avoid returning None silently.
| from transfer_engine import MooncakeTransferEngine | ||
| from new_tensor import create_tensor | ||
|
|
||
| class MooncakeInference: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these class names do not precisely reveal what they are actually doing. Maybe the usage is for RL training and inference, but the names are very confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Will fix
|
@liaozhaoyan Could you please add some doc for it? |
No description provided.