Skip to content

Commit c041d61

Browse files
committed
Merge branch 'main' into jinlong/nvmeof-upstream
2 parents 2c27250 + cadaabc commit c041d61

File tree

120 files changed

+5314
-2080
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

120 files changed

+5314
-2080
lines changed

.github/ISSUE_TEMPLATE/3.performance_dicussion.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
name: "⚙️ Preformance discussions"
1+
name: "⚙️ Performance discussions"
22
description: "Questions about Mooncake's performance"
33
title: "[Performance]: "
44
labels: ["performance"]

.github/workflows/ci.yml

Lines changed: 92 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -91,14 +91,6 @@ jobs:
9191
python ./bootstrap_server.py &
9292
shell: bash
9393

94-
- name: Start Mooncake Master
95-
run: |
96-
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib
97-
# Set a small kv lease ttl to make the test faster.
98-
# Must be consistent with the client test parameters.
99-
mooncake_master --default_kv_lease_ttl=500 &
100-
shell: bash
101-
10294
- name: Test (in build env)
10395
run: |
10496
cd build
@@ -107,11 +99,6 @@ jobs:
10799
MC_METADATA_SERVER=http://127.0.0.1:8080/metadata DEFAULT_KV_LEASE_TTL=500 make test -j ARGS="-V"
108100
shell: bash
109101

110-
- name: Stop Mooncake Master Service
111-
run: |
112-
pkill mooncake_master || true
113-
shell: bash
114-
115102
- name: Generate Python version tag
116103
id: generate_tag_build
117104
run: |
@@ -307,6 +294,98 @@ jobs:
307294
name: mooncake-wheel-ubuntu-py${{ steps.generate_tag_flags.outputs.python_version_tag }}
308295
path: mooncake-wheel/dist-py${{ steps.generate_tag_flags.outputs.python_version_tag }}/*.whl
309296

297+
build-with-ep:
298+
runs-on: ubuntu-22.04
299+
strategy:
300+
matrix:
301+
python-version: ['3.10', '3.12']
302+
env:
303+
BUILD_WITH_EP: "1"
304+
SCCACHE_GHA_ENABLED: "true"
305+
306+
steps:
307+
- uses: actions/checkout@v4
308+
309+
- name: Set up Python ${{ matrix.python-version }}
310+
uses: actions/setup-python@v5
311+
with:
312+
python-version: ${{ matrix.python-version }}
313+
314+
- name: Free up disk space
315+
run: |
316+
sudo rm -rf /usr/share/dotnet
317+
sudo rm -rf /opt/ghc
318+
sudo rm -rf /opt/hostedtoolcache/CodeQL
319+
320+
- name: Install CUDA Toolkit
321+
uses: Jimver/[email protected]
322+
with:
323+
cuda: '12.8.1'
324+
linux-local-args: '["--toolkit"]'
325+
method: 'network'
326+
sub-packages: '["nvcc", "nvrtc-dev"]'
327+
non-cuda-sub-packages: '["libcusparse-dev", "libcublas-dev", "libcusolver-dev"]'
328+
329+
- name: Run sccache-cache
330+
uses: mozilla-actions/[email protected]
331+
332+
- name: Configure sccache
333+
uses: actions/github-script@v7
334+
with:
335+
script: |
336+
core.exportVariable('ACTIONS_RESULTS_URL', process.env.ACTIONS_RESULTS_URL || '');
337+
core.exportVariable('ACTIONS_RUNTIME_TOKEN', process.env.ACTIONS_RUNTIME_TOKEN || '');
338+
339+
- name: Run sccache stat for check
340+
shell: bash
341+
run: ${SCCACHE_PATH} --show-stats
342+
343+
- name: Install dependencies
344+
run: |
345+
sudo apt update -y
346+
sudo bash -x dependencies.sh -y
347+
pip install toml-cli # for updating the version
348+
pip install torch==2.8.0
349+
shell: bash
350+
351+
- name: Build transfer engine with EP
352+
run: |
353+
mkdir build
354+
cd build
355+
export PATH=/usr/local/nvidia/bin:/usr/local/nvidia/lib64:$PATH
356+
export LD_LIBRARY_PATH=/usr/local/cuda/lib64/stubs:$LD_LIBRARY_PATH
357+
cmake .. -DUSE_ETCD=ON -DUSE_REDIS=ON -DUSE_HTTP=ON -DUSE_CUDA=ON -DWITH_STORE=ON -DWITH_P2P_STORE=ON -DWITH_EP=ON -DWITH_METRICS=ON -DBUILD_UNIT_TESTS=ON -DBUILD_EXAMPLES=ON -DENABLE_SCCACHE=ON -DUSE_CUDA=OFF -DUSE_MNNVL=OFF -DCMAKE_EXE_LINKER_FLAGS="-L/usr/local/cuda/lib64/stubs"
358+
make -j
359+
sudo make install
360+
shell: bash
361+
362+
- name: Build nvlink_allocator.so
363+
run: |
364+
mkdir -p build/mooncake-transfer-engine/nvlink-allocator
365+
cd mooncake-transfer-engine/nvlink-allocator
366+
bash build.sh --ci-build ../../build/mooncake-transfer-engine/nvlink-allocator/
367+
shell: bash
368+
369+
- name: Generate Python version tag
370+
id: generate_tag_flags
371+
run: |
372+
echo "python_version_tag=$(echo ${{ matrix.python-version }} | tr -d '.')" >> $GITHUB_OUTPUT
373+
shell: bash
374+
375+
- name: Build Python wheel
376+
run: |
377+
BASE_VERSION=$(toml get --toml-path mooncake-wheel/pyproject.toml project.version | tr -d '"')
378+
toml set --toml-path mooncake-wheel/pyproject.toml project.version "${BASE_VERSION}+ep"
379+
# Build wheel with specific Python version
380+
PYTHON_VERSION=${{ matrix.python-version }} OUTPUT_DIR=dist-py${{ steps.generate_tag_flags.outputs.python_version_tag }} ./scripts/build_wheel.sh
381+
shell: bash
382+
383+
- name: Upload Python wheel artifact
384+
uses: actions/upload-artifact@v4
385+
with:
386+
name: mooncake-wheel-ubuntu-py${{ steps.generate_tag_flags.outputs.python_version_tag }}+ep
387+
path: mooncake-wheel/dist-py${{ steps.generate_tag_flags.outputs.python_version_tag }}/*.whl
388+
310389
build-docker:
311390
name: Build Docker Image
312391
runs-on: ubuntu-22.04

.github/workflows/release.yaml

Lines changed: 97 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -95,8 +95,104 @@ jobs:
9595
name: mooncake-wheel-py${{ steps.generate_tag_release.outputs.python_version_tag }}
9696
path: mooncake-wheel/dist-py${{ steps.generate_tag_release.outputs.python_version_tag }}/*.whl
9797

98+
build-with-ep:
99+
runs-on: ubuntu-22.04
100+
permissions:
101+
contents: write
102+
strategy:
103+
matrix:
104+
python-version: ['3.10', '3.12']
105+
env:
106+
BUILD_WITH_EP: "1"
107+
steps:
108+
- name: Checkout source
109+
uses: actions/checkout@v4
110+
111+
- name: Set up Python ${{ matrix.python-version }}
112+
uses: actions/setup-python@v5
113+
with:
114+
python-version: ${{ matrix.python-version }}
115+
116+
- name: Free up disk space
117+
run: |
118+
sudo rm -rf /usr/share/dotnet
119+
sudo rm -rf /opt/ghc
120+
sudo rm -rf /opt/hostedtoolcache/CodeQL
121+
122+
- name: Install CUDA Toolkit
123+
uses: Jimver/[email protected]
124+
with:
125+
cuda: '12.8.1'
126+
linux-local-args: '["--toolkit"]'
127+
method: 'network'
128+
sub-packages: '["nvcc", "nvrtc-dev"]'
129+
non-cuda-sub-packages: '["libcusparse-dev", "libcublas-dev", "libcusolver-dev"]'
130+
131+
- name: Run sccache-cache
132+
uses: mozilla-actions/[email protected]
133+
134+
- name: Configure sccache
135+
uses: actions/github-script@v7
136+
with:
137+
script: |
138+
core.exportVariable('ACTIONS_RESULTS_URL', process.env.ACTIONS_RESULTS_URL || '');
139+
core.exportVariable('ACTIONS_RUNTIME_TOKEN', process.env.ACTIONS_RUNTIME_TOKEN || '');
140+
141+
- name: Run sccache stat for check
142+
shell: bash
143+
run: ${SCCACHE_PATH} --show-stats
144+
145+
- name: Configure project
146+
run: |
147+
sudo apt update -y
148+
sudo bash -x dependencies.sh -y
149+
pip install toml-cli # for updating the version
150+
pip install torch==2.8.0
151+
mkdir build
152+
cd build
153+
cmake .. -DUSE_HTTP=ON -DUSE_ETCD=ON -DUSE_CUDA=ON -DWITH_EP=ON -DSTORE_USE_ETCD=ON -DENABLE_SCCACHE=ON -DCMAKE_BUILD_TYPE=Release
154+
shell: bash
155+
156+
- name: Build project
157+
run: |
158+
cd build
159+
make -j
160+
sudo make install
161+
shell: bash
162+
163+
- name: Build nvlink_allocator.so
164+
run: |
165+
mkdir -p build/mooncake-transfer-engine/nvlink-allocator
166+
cd mooncake-transfer-engine/nvlink-allocator
167+
bash build.sh --ci-build ../../build/mooncake-transfer-engine/nvlink-allocator/
168+
shell: bash
169+
170+
- name: Generate Python version tag
171+
id: generate_tag_release
172+
run: |
173+
echo "python_version_tag=$(echo ${{ matrix.python-version }} | tr -d '.')" >> $GITHUB_OUTPUT
174+
shell: bash
175+
176+
- name: Build Python wheel
177+
run: |
178+
BASE_VERSION=$(toml get --toml-path mooncake-wheel/pyproject.toml project.version | tr -d '"')
179+
toml set --toml-path mooncake-wheel/pyproject.toml project.version "${BASE_VERSION}+ep"
180+
# Set LD_LIBRARY_PATH for wheel building
181+
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib
182+
PYTHON_VERSION=${{ matrix.python-version }} OUTPUT_DIR=dist-py${{ steps.generate_tag_release.outputs.python_version_tag }} ./scripts/build_wheel.sh
183+
env:
184+
VERSION: ${{ env.VERSION }}
185+
186+
- name: Upload Python wheel artifact
187+
uses: actions/upload-artifact@v4
188+
with:
189+
name: mooncake-wheel-py${{ steps.generate_tag_release.outputs.python_version_tag }}+ep
190+
path: mooncake-wheel/dist-py${{ steps.generate_tag_release.outputs.python_version_tag }}/*.whl
191+
98192
publish-release:
99-
needs: build
193+
needs:
194+
- build
195+
- build-with-ep
100196
runs-on: ubuntu-22.04
101197
permissions:
102198
contents: write

.typos.toml

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,7 @@
11
[default]
2-
extend-ignore-words = ["CANN"]
3-
4-
[files]
5-
extend-exclude = ["mooncake-ep/csrc/*.h"]
2+
extend-ignore-words = ["CANN", "ASO", "fre"]
63

74
[default.extend-words]
85
CANN = "CANN"
6+
ASO = "ASO"
7+
fre = "fre"

CMakeLists.txt

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@ endif()
1515
option(WITH_STORE "build mooncake store library and sample code" ON)
1616
option(WITH_P2P_STORE "build p2p store library and sample code" OFF)
1717
option(WITH_RUST_EXAMPLE "build the Rust interface and sample code for the transfer engine" OFF)
18+
option(WITH_EP "build mooncake with expert parallelism support" OFF)
1819

1920
add_subdirectory(${CMAKE_CURRENT_SOURCE_DIR}/extern/pybind11)
2021
set(PYTHON_EXECUTABLE "python3")
@@ -51,6 +52,12 @@ if (WITH_STORE)
5152
include_directories(mooncake-store/include)
5253
endif()
5354

55+
if (WITH_EP)
56+
message(STATUS "Mooncake EP will be built")
57+
add_subdirectory(mooncake-ep)
58+
include_directories(mooncake-ep/include)
59+
endif()
60+
5461
add_subdirectory(mooncake-integration)
5562

5663
if (WITH_P2P_STORE)

doc/en/build.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,11 @@ This document describes how to build Mooncake from source.
66
```bash
77
pip3 install mooncake-transfer-engine --upgrade
88
```
9+
- To install with the Mooncake Backend and Mooncake EP support, use the following command:
10+
```bash
11+
# replace torch2.8.0 with the corresponding version
12+
pip3 install mooncake-transfer-engine==0.3.7+ep --upgrade
13+
```
914

1015
## Automatic
1116

doc/en/ep-backend.md

Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
# Mooncake EP & Mooncake Backend
2+
3+
## Overview
4+
5+
Mooncake EP is an adaption of [DeepEP](https://github.com/deepseek-ai/DeepEP) that supports **fault tolerance** and fast data transfer with **IBGDA**, designed as a critical component for large-scale, latency-sensitive MoE (Mixture of Experts) inference. Mooncake EP aims to retain full compatibility with the DeepEP API, with the addition of an `active_ranks` tensor passed to both the `dispatch` and `combine` functions to capture information about rank activeness. By integrating with the EPLB module, Mooncake EP ensures fault tolerance during MoE inference, enabling robust performance even in large-scale, fault-prone environments.
6+
7+
Mooncake Backend is a PyTorch distributed backend (a replacement for NCCL and Gloo) that provides **fault-tolerant collective communication primitives** and can be seamlessly integrated into machine learning systems. Built with the [Transfer Engine](transfer-engine.md), Mooncake Backend ensures that collective communications can continue even in the event of rank failures. Furthermore, it reports these failures to the upper layers of the system, allowing for graceful error handling without disrupting ongoing operations.
8+
9+
## Usage
10+
11+
### Mooncake EP
12+
13+
> **Note:** Mooncake EP currently supports only the low-latency transfer mode.
14+
15+
The API is largely consistent with DeepEP's, with only minor differences in a few parameters. Mooncake EP exposes a `Buffer` that can be imported from `mooncake.mooncake_ep_buffer`. For example, refer to `mooncake-wheel/tests/test_mooncake_ep.py`.
16+
17+
#### Buffer.get_buffer_size_hint()
18+
19+
**Signature:**
20+
21+
```python
22+
@staticmethod
23+
def get_ep_buffer_size_hint(num_max_dispatch_tokens_per_rank: int, hidden: int, num_ranks: int, num_experts: int) -> int
24+
```
25+
26+
Calculates the number of bytes to pre-allocate for data transfer.
27+
28+
#### Buffer.\_\_init\_\_()
29+
30+
**Signature:**
31+
32+
```python
33+
def __init__(self, group: dist.ProcessGroup, num_ep_buffer_bytes: int = 0)
34+
```
35+
36+
The constructor. Ensure that only one instance is created.
37+
38+
- **group**: Must be a Mooncake Backend process group.
39+
- **num_ep_buffer_bytes**: The number of bytes acquired with `Buffer.get_buffer_size_hint()`
40+
41+
#### Buffer.dispatch/Buffer.combine
42+
43+
**Signature:** Similar to DeepEP's `low_latency_dispatch`/`low_latency_combine`, with two additional parameters:
44+
45+
- **active_ranks**: A tensor of shape `(num_ranks,)` containing values of 0 or 1. The indices of the broken ranks will be set to 0.
46+
- **timeout_us**: The timeout in microseconds for a rank to be considered broken. Set to -1 for infinite timeout.
47+
48+
### Mooncake Backend
49+
50+
Basic usage:
51+
52+
```python
53+
import torch
54+
import torch.distributed as dist
55+
from mooncake import ep
56+
57+
active_ranks = torch.ones((world_size,), dtype=torch.int32, device="cuda")
58+
dist.init_process_group(
59+
backend="mooncake",
60+
rank=rank,
61+
world_size=world_size,
62+
pg_options=ep.MooncakeBackendOptions(active_ranks),
63+
)
64+
65+
dist.all_gather(...) # Standard API usage
66+
assert active_ranks.all() # Verify that no ranks are broken
67+
```
68+
69+
For a full example, see `mooncake-wheel/tests/test_mooncake_backend.py`.
70+

doc/en/error-code.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -35,18 +35,19 @@ Mooncake Store may generate various types of errors during execution. For most A
3535
| Segment Selection | SHARD_INDEX_OUT_OF_RANGE (-100)| Shard index is out of bounds |
3636
| | SEGMENT_NOT_FOUND (-101) | No available segments found |
3737
| | SEGMENT_ALREADY_EXISTS (-102) | Segment already exists |
38-
| Handle Selection | NO_AVAILABLE_HANDLE (-200) | Memory allocation failed due to insufficient space. |
38+
| Handle Selection | NO_AVAILABLE_HANDLE (-200) | Memory allocation failed due to insufficient space |
3939
| Version | INVALID_VERSION (-300) | Invalid version |
4040
| Key | INVALID_KEY (-400) | Invalid key |
4141
| Engine | WRITE_FAIL (-500) | Write operation failed |
4242
| Parameter | INVALID_PARAMS (-600) | Invalid parameters |
4343
| Engine Operation | INVALID_WRITE (-700) | Invalid write operation |
4444
| | INVALID_READ (-701) | Invalid read operation |
4545
| | INVALID_REPLICA (-702) | Invalid replica operation |
46-
| | REPLICA_IS_NOT_READY (-703) | Replica is not ready |
46+
| Object | REPLICA_IS_NOT_READY (-703) | Replica is not ready |
4747
| | OBJECT_NOT_FOUND (-704) | Object not found |
4848
| | OBJECT_ALREADY_EXISTS (-705) | Object already exists |
4949
| | OBJECT_HAS_LEASE (-706) | Object has lease |
50+
| | LEASE_EXPIRED (-707) | Lease expired before data transfer completed |
5051
| Transfer | TRANSFER_FAIL (-800) | Transfer operation failed |
5152
| RPC | RPC_FAIL (-900) | RPC operation failed |
5253
| High Availability | ETCD_OPERATION_ERROR (-1000) | etcd operation failed |

doc/en/heterogeneous_ascend.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,9 @@ Heterogeneous Ascend Transport is a high-performance data transmission library d
99

1010
> Current version only supports WRITE semantics. READ semantics will be implemented in future releases.
1111
12+
## Enhanced HBM-to-DRAM Data Transfer Optimization
13+
The copy bandwidth from HBM to DRAM is constrained by the size of data blocks. Small data blocks smaller than 2MB result in underutilized bandwidth. We have implemented an optimization using "data aggregation + pipeline parallelism": first, small data blocks are aggregated into 8MB blocks within HBM before being transferred to DRAM, while data copying and RDMA transmission are executed in parallel. This solution effectively hides the HBM-DRAM copy latency and significantly reduces the overall transmission time.
14+
1215
## Build Instructions
1316
The `USE_ASCEND_HETEROGENEOUS` compilation option has been added to `mooncake-common/common.cmake` to control this feature:
1417

0 commit comments

Comments
 (0)