Skip to content

Commit 7e4e3e6

Browse files
committed
prov/efa: Add efa fabrics comparison doc
This doc compares the features and implementations between efa and efa-direct fabrics Signed-off-by: Shi Jin <sjina@amazon.com>
1 parent 2cb4a5b commit 7e4e3e6

File tree

1 file changed

+283
-0
lines changed

1 file changed

+283
-0
lines changed
Lines changed: 283 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,283 @@
1+
# EFA vs EFA-Direct Fabric Comparison
2+
3+
## Overview
4+
5+
The Libfabric EFA provider provides an interface to access the Elastic Fabric Adapter (EFA) NIC produced by AWS. The EFA NIC supports both two-sided and one-sided RDMA using a proprietary protocol called [Scalable Reliable Datagram (SRD)](https://ieeexplore.ieee.org/document/9167399). The EFA provider in libfabric offers two distinct fabric types: `efa` and `efa-direct`. Both fabrics provide RDM (reliable datagram) endpoint type, but they differ in their implementation approach and code path complexity.
6+
7+
The **`efa` fabric** implements a comprehensive set of [wire protocols](efa_rdm_protocol_v4.md) that include emulations to support capabilities beyond what the EFA device natively provides. This allows broader libfabric feature support and application compatibility, but results in a more complex code path with additional protocol overhead.
8+
9+
The **`efa-direct` fabric** offers a more direct approach that mostly exposes only what the EFA NIC hardware natively supports. This results in a more compact and efficient code path with reduced protocol overhead, but requires applications to work within the constraints of the hardware capabilities.
10+
11+
## Basic Workflow
12+
13+
The data transfer path in libfabric can be roughly divided into two categories: work request (WR) post and completion polling. Operations like `fi_send`/`fi_recv`/`fi_write`/`fi_read` fall into the first category, while `fi_cq_read` and its variants fall into the second category. The WR post can be further devided into Tx (fi_send/write/read) and Rx (fi_recv) post.
14+
15+
### EFA-Direct Workflow
16+
17+
EFA-direct provides a straightforward, direct mapping to hardware operations:
18+
19+
**Tx Post:**
20+
- Constructs Work Queue Entry (WQE) directly from application calls (`fi_*` functions)
21+
- Maintains 1-to-1 mapping between WQE and libfabric call
22+
- Only performs two operations before data is sent over wire:
23+
1. Construct WQE
24+
2. Ring the doorbell (when required)
25+
26+
**Rx Post:**
27+
- No internal Rx buffers - each `fi_recv` call is constructed as WQE and posted directly to device
28+
- User buffers from `fi_recv` calls are directly used by hardware
29+
- Zero-copy receive path with direct data placement
30+
31+
**Completion Polling:**
32+
- Maintains 1-to-1 mapping between device completions and libfabric completions
33+
- Polls the device CQ directly
34+
- Generates libfabric CQ entry from device completion
35+
36+
### EFA Workflow
37+
38+
EFA fabric implements a more complex, layered approach with protocol emulation:
39+
40+
**Tx Post:**
41+
- Allocates internal data structure called `efa_rdm_ope` (EFA-RDM operational entry)
42+
- Maintains 1-to-1 mapping between `efa_rdm_ope` and libfabric call (`fi_*` functions)
43+
- Chooses appropriate protocol based on operation type and message size
44+
- Allocates `efa_rdm_pke` (EFA-RDM packet entry) structures from buffer pool
45+
- Each packet entry is a 128-byte data structure allocated from ~8KB buffers (comparable to device MTU size) to support staging wiredata from application when
46+
necessary.
47+
48+
- Each `pke` corresponds to a WQE that interacts with EFA device
49+
- One operation entry can map to multiple packet entries (e.g., a 16KB message can be sent via 2 packet entries)
50+
- **Note**: For RMA operations (`fi_read`/`fi_write`), such workflow still applies, but when device RDMA is available, the data goes directly to/from user buffers without internal staging or copying. Since efa fabric supports unlimited
51+
size for RMA, when the libfabric message is larger than the max rdma size of the device, it consume use multiple packet entries.
52+
53+
**Rx Post:**
54+
- Pre-posts internal Rx buffers to device for incoming data from peers
55+
- User buffers from `fi_recv` calls are queued in internal libfabric queue (not posted to device)
56+
- On device completion, searches internal queue to find matching Rx buffer
57+
- Copies data from packet entry to matched user buffer
58+
- **Note**: One exception is the "zero-copy receive" mode
59+
of efa fabric. See efa_rdm_protocol_v4.md for details.
60+
61+
62+
**Completion Polling:**
63+
- Polls device CQ for completion of packet entries posted to EFA device
64+
- Finds corresponding operation entries stored in packet entry structures
65+
- Uses counters and metadata in operation entry to track completion progress
66+
- Generates libfabric completion when operation entry has all required data
67+
68+
### Workflow comparison Diagram
69+
70+
```mermaid
71+
sequenceDiagram
72+
title EFA-Direct Workflow - Simple Direct Path
73+
participant App as Application
74+
participant EFADirect as EFA-Direct
75+
participant Device as EFA Device
76+
77+
Note over App,Device: Tx Post (fi_send)
78+
App->>EFADirect: fi_send()
79+
EFADirect->>EFADirect: Construct WQE
80+
EFADirect->>Device: Ring doorbell
81+
Device->>Device: Send data over wire
82+
83+
Note over App,Device: Rx Post (fi_recv)
84+
App->>EFADirect: fi_recv()
85+
EFADirect->>EFADirect: Construct WQE
86+
EFADirect->>Device: Ring doorbell
87+
Device->>Device: Receive data (zero-copy)
88+
89+
Note over App,Device: Completion Polling
90+
App->>EFADirect: fi_cq_read()
91+
EFADirect->>Device: Poll device CQ
92+
Device->>EFADirect: Device completion
93+
EFADirect->>App: Return completion (1:1 mapping)
94+
```
95+
96+
```mermaid
97+
sequenceDiagram
98+
title EFA Workflow - Complex Layered Path
99+
participant App as Application
100+
participant EFA as EFA Fabric
101+
participant Queue as Internal Rx Queue
102+
participant Device as EFA Device
103+
104+
Note over App,Device: Tx Post (fi_send)
105+
App->>EFA: fi_send()
106+
EFA->>EFA: Alloc efa_rdm_ope
107+
EFA->>EFA: Choose protocol
108+
EFA->>EFA: Alloc efa_rdm_pke (128B from ~8KB pool)
109+
EFA->>EFA: Stage data (if necessary)
110+
EFA->>EFA: Construct WQE
111+
EFA->>Device: Ring doorbell
112+
Device->>Device: Send data over wire
113+
114+
Note over App,Device: Rx Post (fi_recv)
115+
App->>EFA: fi_recv()
116+
EFA->>Queue: Queue user buffer
117+
Note over EFA,Device: Internal Rx buffers already posted
118+
Device->>Device: Receive data to internal buffer
119+
120+
Note over App,Device: Completion Polling
121+
App->>EFA: fi_cq_read()
122+
EFA->>Device: Poll device CQ
123+
Device->>EFA: Device completion (pke)
124+
EFA->>EFA: Find ope from pke
125+
EFA->>Queue: Search Rx queue for match
126+
EFA->>EFA: Copy from pke to user buffer
127+
EFA->>App: Return completion
128+
```
129+
130+
## Feature Support Matrix
131+
132+
### Key
133+
134+
✓ = well supported
135+
136+
\* = limited support
137+
138+
❌ = not supported
139+
140+
R = required mode bit
141+
142+
O = optional mode bit
143+
144+
` ` (no mark) = not applicable or not needed
145+
146+
***
147+
148+
| **Endpoint Types** |efa|efa-direct|
149+
| -------------------------- |:-:|:--------:|
150+
| `FI_EP_RDM` |||
151+
| `FI_EP_DGRAM` |||
152+
| `FI_EP_MSG` |||
153+
154+
- Both support FI_EP_RDM for reliable datagram.
155+
156+
- FI_EP_DGRAM is only supported by efa fabric. Though it uses the same code path as efa-direct, it is kept in efa fabric for backward compatibility.
157+
158+
- Neither support MSG endpoint type today.
159+
160+
| **Primary Caps** |efa|efa-direct|
161+
| ------------------ |:-:|:--------:|
162+
| `FI_ATOMIC` |||
163+
| `FI_DIRECTED_RECV` |||
164+
| `FI_HMEM` |||
165+
| `FI_MSG` |||
166+
| `FI_MULTICAST` |||
167+
| `FI_NAMED_RX_CTX` |||
168+
| `FI_RMA` |||
169+
| `FI_TAGGED` |||
170+
171+
| **Primary Mods** |efa|efa-direct|
172+
| ------------------ |:-:|:--------:|
173+
| `FI_READ` |||
174+
| `FI_RECV` |||
175+
| `FI_REMOTE_READ` |||
176+
| `FI_REMOTE_WRITE` |||
177+
| `FI_SEND` |||
178+
| `FI_WRITE` |||
179+
180+
| **Secondary Caps** |efa|efa-direct|
181+
| ------------------ |:-:|:--------:|
182+
| `FI_FENCE` |||
183+
| `FI_MULTI_RECV` |||
184+
| `FI_LOCAL_COMM` |||
185+
| `FI_REMOTE_COMM` |||
186+
| `FI_RMA_EVENT` |||
187+
| `FI_RMA_PMEM` |||
188+
| `FI_SHARED_AV` |||
189+
| `FI_SOURCE` |||
190+
| `FI_SOURCE_ERR` |||
191+
192+
193+
194+
Feature comparison:
195+
- **FI_MSG**: Both support. efa supports unlimited (UINT64_MAX) size of message, efa-direct supports message size up to the device limits (MTU ~ 8 KB). Both support
196+
up to 2 IOVs for each message.
197+
- **FI_RMA**: Both support. efa supports unlimited (UINT64_MAX) size of message, efa-direct supports message size up to the device limits (max_rdma_size ~ 1 GB).
198+
efa-direct only support 1 IOV for RMA message, efa support multiple (2) IOVs that
199+
consistent with the FI_MSG iov limit.
200+
- **FI_TAGGED**: efa provides support through software emulation, efa-direct lacks support
201+
- **FI_ATOMICS**: efa provides support through software emulation, efa-direct lacks support
202+
- **FI_DIRECTED_RECV**: efa provides support through software emulation, efa-direct lacks support
203+
- **FI_HMEM**: Both support HMEM operations - efa has software emulation when NIC-GPU peer-to-peer unavailable, efa-direct requires peer-to-peer support. See the feature comparison for FI_MR_HMEM mode bit.
204+
205+
- **FI_MULTI_RECV**: efa-direct lacks support, efa provides support
206+
207+
208+
209+
210+
| **Modes** |efa|efa-direct|
211+
| ------------------------- |:-:|:--------:|
212+
| `FI_ASYNC_IOV` | | |
213+
| `FI_BUFFERED_RECV` | | |
214+
| `FI_CONTEXT` | | |
215+
| `FI_CONTEXT2` | |R |
216+
| `FI_LOCAL_MR (compat)` | | |
217+
| `FI_MSG_PREFIX` | | |
218+
| `FI_RX_CQ_DATA` | |O |
219+
220+
Feature comparison:
221+
- **FI_CONTEXT2**: efa-direct requires this mode, efa fabric doesn't
222+
- **FI_MSG_PREFIX**: efa fabric DGRAM endpoint requires FI_MSG_PREFIX due to the 40-byte prefix requirement per IBV_QPT_UD spec
223+
- **FI_RX_CQ_DATA**: efa-direct accepts this optional mode, meaning operations carrying CQ data consume an RX buffer on responder side
224+
225+
226+
| **MR Modes** |efa|efa-direct|
227+
| ------------------------- |:-:|:--------:|
228+
| `FI_MR_ALLOCATED` |R |R |
229+
| `FI_MR_ENDPOINT` | | |
230+
| `FI_MR_HMEM` | | |
231+
| `FI_MR_LOCAL` | |R |
232+
| `FI_MR_PROV_KEY` |R |R |
233+
| `FI_MR_MMU_NOTIFY` | | |
234+
| `FI_MR_RAW` | | |
235+
| `FI_MR_RMA_EVENT` | | |
236+
| `FI_MR_VIRT_ADDR` |R |R |
237+
| `FI_MR_BASIC (compat)` |||
238+
| `FI_MR_SCALABLE (compat)` |||
239+
240+
Feature comparison:
241+
- **FI_MR_LOCAL**: efa-direct requires this mode, forcing applications to provide memory descriptors for all operations, while efa fabric supports both local and non-local MR modes
242+
- **FI_MR_HMEM**: Required by both fabrics - each HMEM buffer must be registered before data transfer operations.
243+
For efa fabric, it supports registering a hmem buffer without p2p support. When p2p is not available, a hmem buffer cannot be registered to EFA NIC directly. efa fabric will emulate a lkey/rkey generated by Libfabric and uses internal MR map to verify the key on the target side.
244+
It will use device memcpy API to copy data from/to hmem
245+
buffer to/from the bounce buffer during tx and rx
246+
operations and use bounce buffer to send data over the NIC.
247+
248+
249+
| **Other Libfabric Features** |efa|efa-direct|
250+
| ---------------------------- |:-:|:--------:|
251+
| FI_RM_ENABLED |\*|\* |
252+
| fi_counter |||
253+
| fi_cancel support |\* ||
254+
| message ordering |\* ||
255+
256+
257+
- **Counters**: Both fabrics support local and remote operation counters
258+
- **fi_cancel support**: efa provides limited support (non-zero-copy-receive mode only), efa-direct has no support
259+
- **message ordering**: efa fabric supports FI_ORDER_SAS, FI_ORDER_ATOMIC_RAR, FI_ORDER_ATOMIC_RAW, FI_ORDER_ATOMIC_WAR, FI_ORDER_ATOMIC_WAW ordering, efa-direct
260+
doesn't support any ordering.
261+
262+
- **FI_RM_ENABLED**: Both fabrics provide limited resource management support. One thing required for FI_RM_ENABLED is that provider needs to protect the resource (including CQ) from being overrun. But today efa and efa-direct doesn't have that protection.
263+
264+
265+
| **EFA proivider specific Features and Restrictions** |efa|efa-direct|
266+
| ---------------------------- |:-:|:--------:|
267+
| Unsolicited write recv |||
268+
| FI_EFA_HOMOGENEOUS_PEER option || |
269+
| Peer AV entry on the RMA target side | | R |
270+
| GPU Direct Async (GDA) domain ops extension |||
271+
| Data path direct |||
272+
| Util CQ bypass |||
273+
274+
- **Unsolicited write recv**: This is a feature that allows efa device not consume a Rx buffer on the target side for rdma write with immediate data operations. Both efa and efa-direct support it. However, if application wants to turn this feature off, for efa-direct, application
275+
needs to support FI_RX_CQ_DATA to maintain the rx buffer itself. The efa fabric doesn't have such requirement, because it has internal rx buffer that can be consumed.
276+
277+
- **Homogeneous peers option**: efa supports FI_OPT_EFA_HOMOGENEOUS_PEERS configuration that skips the handshake establishment between local and peer, efa-direct is unaffected by this option
278+
- **Peer AV entry on the RMA target side**: For the efa-direct fabric, the target side of RMA operation must insert the initiator side’s address into AV before the RMA operation is kicked off, due to a current device limitation. The same limitation applies to the efa fabric when the FI_OPT_EFA_HOMOGENEOUS_PEERS option is set as true.
279+
- **GPU Direct Async extension**: efa-direct provides query operations for address, queue pair, and completion queue attributes. efa fabric doesn't support these operations.
280+
- **Data Path Direct**: A recent improvement to implement the WQE post and CQ poll
281+
directly in Libfabric without rdma-core API. It is now enabled in both fabrics
282+
- **Util CQ Bypass** Another improvement to get rid of the CQE staging in util CQ,
283+
more details are in the [util_cq_bypass doc](util_cq_bypass.md).

0 commit comments

Comments
 (0)