You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Mooncake EP is an adaption of [DeepEP](https://github.com/deepseek-ai/DeepEP) that supports **fault tolerance** and fast data transfer with **IBGDA**, designed as a critical component for large-scale, latency-sensitive MoE (Mixture of Experts) inference. Mooncake EP aims to retain full compatibility with the DeepEP API, with the addition of an `active_ranks` tensor passed to both the `dispatch` and `combine` functions to capture information about rank activeness. By integrating with the EPLB module, Mooncake EP ensures fault tolerance during MoE inference, enabling robust performance even in large-scale, fault-prone environments.
6
+
7
+
Mooncake Backend is a PyTorch distributed backend (a replacement for NCCL and Gloo) that provides **fault-tolerant collective communication primitives** and can be seamlessly integrated into machine learning systems. Built with the [Transfer Engine](transfer-engine.md), Mooncake Backend ensures that collective communications can continue even in the event of rank failures. Furthermore, it reports these failures to the upper layers of the system, allowing for graceful error handling without disrupting ongoing operations.
8
+
9
+
## Usage
10
+
11
+
### Mooncake EP
12
+
13
+
> **Note:** Mooncake EP currently supports only the low-latency transfer mode.
14
+
15
+
The API is largely consistent with DeepEP's, with only minor differences in a few parameters. Mooncake EP exposes a `Buffer` that can be imported from `mooncake.mooncake_ep_buffer`. For example, refer to `mooncake-wheel/tests/test_mooncake_ep.py`.
16
+
17
+
#### Buffer.get_buffer_size_hint()
18
+
19
+
**Signature:**
20
+
21
+
```python
22
+
@staticmethod
23
+
defget_ep_buffer_size_hint(num_max_dispatch_tokens_per_rank: int, hidden: int, num_ranks: int, num_experts: int) -> int
24
+
```
25
+
26
+
Calculates the number of bytes to pre-allocate for data transfer.
Copy file name to clipboardExpand all lines: doc/en/heterogeneous_ascend.md
+3Lines changed: 3 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -9,6 +9,9 @@ Heterogeneous Ascend Transport is a high-performance data transmission library d
9
9
10
10
> Current version only supports WRITE semantics. READ semantics will be implemented in future releases.
11
11
12
+
## Enhanced HBM-to-DRAM Data Transfer Optimization
13
+
The copy bandwidth from HBM to DRAM is constrained by the size of data blocks. Small data blocks smaller than 2MB result in underutilized bandwidth. We have implemented an optimization using "data aggregation + pipeline parallelism": first, small data blocks are aggregated into 8MB blocks within HBM before being transferred to DRAM, while data copying and RDMA transmission are executed in parallel. This solution effectively hides the HBM-DRAM copy latency and significantly reduces the overall transmission time.
14
+
12
15
## Build Instructions
13
16
The `USE_ASCEND_HETEROGENEOUS` compilation option has been added to `mooncake-common/common.cmake` to control this feature:
0 commit comments