|
| 1 | +# EFA vs EFA-Direct Fabric Comparison |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +The Libfabric EFA provider provides an interface to access the Elastic Fabric Adapter (EFA) NIC produced by AWS. The EFA NIC supports both two-sided and one-sided RDMA using a proprietary protocol called [Scalable Reliable Datagram (SRD)](https://ieeexplore.ieee.org/document/9167399). The EFA provider in libfabric offers two distinct fabric types: `efa` and `efa-direct`. Both fabrics provide RDM (reliable datagram) endpoint type, but they differ in their implementation approach and code path complexity. |
| 6 | + |
| 7 | +The **`efa` fabric** implements a comprehensive set of [wire protocols](efa_rdm_protocol_v4.md) that include emulations to support capabilities beyond what the EFA device natively provides. This allows broader libfabric feature support and application compatibility, but results in a more complex code path with additional protocol overhead. |
| 8 | + |
| 9 | +The **`efa-direct` fabric** offers a more direct approach that mostly exposes only what the EFA NIC hardware natively supports. This results in a more compact and efficient code path with reduced protocol overhead, but requires applications to work within the constraints of the hardware capabilities. |
| 10 | + |
| 11 | +## Basic Workflow |
| 12 | + |
| 13 | +The data transfer path in libfabric can be roughly divided into two categories: work request (WR) post and completion polling. Operations like `fi_send`/`fi_recv`/`fi_write`/`fi_read` fall into the first category, while `fi_cq_read` and its variants fall into the second category. The WR post can be further devided into Tx (fi_send/write/read) and Rx (fi_recv) post. |
| 14 | + |
| 15 | +### EFA-Direct Workflow |
| 16 | + |
| 17 | +EFA-direct provides a straightforward, direct mapping to hardware operations: |
| 18 | + |
| 19 | +**Tx Post:** |
| 20 | +- Constructs Work Queue Entry (WQE) directly from application calls (`fi_*` functions) |
| 21 | +- Maintains 1-to-1 mapping between WQE and libfabric call |
| 22 | +- Only performs two operations before data is sent over wire: |
| 23 | + 1. Construct WQE |
| 24 | + 2. Ring the doorbell (when required) |
| 25 | + |
| 26 | +**Rx Post:** |
| 27 | +- No internal Rx buffers - each `fi_recv` call is constructed as WQE and posted directly to device |
| 28 | +- User buffers from `fi_recv` calls are directly used by hardware |
| 29 | +- Zero-copy receive path with direct data placement |
| 30 | + |
| 31 | +**Completion Polling:** |
| 32 | +- Maintains 1-to-1 mapping between device completions and libfabric completions |
| 33 | +- Polls the device CQ directly |
| 34 | +- Generates libfabric CQ entry from device completion |
| 35 | + |
| 36 | +### EFA Workflow |
| 37 | + |
| 38 | +EFA fabric implements a more complex, layered approach with protocol emulation: |
| 39 | + |
| 40 | +**Tx Post:** |
| 41 | +- Allocates internal data structure called `efa_rdm_ope` (EFA-RDM operational entry) |
| 42 | +- Maintains 1-to-1 mapping between `efa_rdm_ope` and libfabric call (`fi_*` functions) |
| 43 | +- Chooses appropriate protocol based on operation type and message size |
| 44 | +- Allocates `efa_rdm_pke` (EFA-RDM packet entry) structures from buffer pool |
| 45 | +- Each packet entry is a 128-byte data structure allocated from ~8KB buffers (comparable to device MTU size) to support staging wiredata from application when |
| 46 | +necessary. |
| 47 | + |
| 48 | +- Each `pke` corresponds to a WQE that interacts with EFA device |
| 49 | +- One operation entry can map to multiple packet entries (e.g., a 16KB message can be sent via 2 packet entries) |
| 50 | +- **Note**: For RMA operations (`fi_read`/`fi_write`), such workflow still applies, but when device RDMA is available, the data goes directly to/from user buffers without internal staging or copying. Since efa fabric supports unlimited |
| 51 | +size for RMA, when the libfabric message is larger than the max rdma size of the device, it consume use multiple packet entries. |
| 52 | + |
| 53 | +**Rx Post:** |
| 54 | +- Pre-posts internal Rx buffers to device for incoming data from peers |
| 55 | +- User buffers from `fi_recv` calls are queued in internal libfabric queue (not posted to device) |
| 56 | +- On device completion, searches internal queue to find matching Rx buffer |
| 57 | +- Copies data from packet entry to matched user buffer |
| 58 | +- **Note**: One exception is the "zero-copy receive" mode |
| 59 | +of efa fabric. See efa_rdm_protocol_v4.md for details. |
| 60 | + |
| 61 | + |
| 62 | +**Completion Polling:** |
| 63 | +- Polls device CQ for completion of packet entries posted to EFA device |
| 64 | +- Finds corresponding operation entries stored in packet entry structures |
| 65 | +- Uses counters and metadata in operation entry to track completion progress |
| 66 | +- Generates libfabric completion when operation entry has all required data |
| 67 | + |
| 68 | +### Workflow comparison Diagram |
| 69 | + |
| 70 | +```mermaid |
| 71 | +sequenceDiagram |
| 72 | + title EFA-Direct Workflow - Simple Direct Path |
| 73 | + participant App as Application |
| 74 | + participant EFADirect as EFA-Direct |
| 75 | + participant Device as EFA Device |
| 76 | +
|
| 77 | + Note over App,Device: Tx Post (fi_send) |
| 78 | + App->>EFADirect: fi_send() |
| 79 | + EFADirect->>EFADirect: Construct WQE |
| 80 | + EFADirect->>Device: Ring doorbell |
| 81 | + Device->>Device: Send data over wire |
| 82 | +
|
| 83 | + Note over App,Device: Rx Post (fi_recv) |
| 84 | + App->>EFADirect: fi_recv() |
| 85 | + EFADirect->>EFADirect: Construct WQE |
| 86 | + EFADirect->>Device: Ring doorbell |
| 87 | + Device->>Device: Receive data (zero-copy) |
| 88 | +
|
| 89 | + Note over App,Device: Completion Polling |
| 90 | + App->>EFADirect: fi_cq_read() |
| 91 | + EFADirect->>Device: Poll device CQ |
| 92 | + Device->>EFADirect: Device completion |
| 93 | + EFADirect->>App: Return completion (1:1 mapping) |
| 94 | +``` |
| 95 | + |
| 96 | +```mermaid |
| 97 | +sequenceDiagram |
| 98 | + title EFA Workflow - Complex Layered Path |
| 99 | + participant App as Application |
| 100 | + participant EFA as EFA Fabric |
| 101 | + participant Queue as Internal Rx Queue |
| 102 | + participant Device as EFA Device |
| 103 | +
|
| 104 | + Note over App,Device: Tx Post (fi_send) |
| 105 | + App->>EFA: fi_send() |
| 106 | + EFA->>EFA: Alloc efa_rdm_ope |
| 107 | + EFA->>EFA: Choose protocol |
| 108 | + EFA->>EFA: Alloc efa_rdm_pke (128B from ~8KB pool) |
| 109 | + EFA->>EFA: Stage data (if necessary) |
| 110 | + EFA->>EFA: Construct WQE |
| 111 | + EFA->>Device: Ring doorbell |
| 112 | + Device->>Device: Send data over wire |
| 113 | +
|
| 114 | + Note over App,Device: Rx Post (fi_recv) |
| 115 | + App->>EFA: fi_recv() |
| 116 | + EFA->>Queue: Queue user buffer |
| 117 | + Note over EFA,Device: Internal Rx buffers already posted |
| 118 | + Device->>Device: Receive data to internal buffer |
| 119 | +
|
| 120 | + Note over App,Device: Completion Polling |
| 121 | + App->>EFA: fi_cq_read() |
| 122 | + EFA->>Device: Poll device CQ |
| 123 | + Device->>EFA: Device completion (pke) |
| 124 | + EFA->>EFA: Find ope from pke |
| 125 | + EFA->>Queue: Search Rx queue for match |
| 126 | + EFA->>EFA: Copy from pke to user buffer |
| 127 | + EFA->>App: Return completion |
| 128 | +``` |
| 129 | + |
| 130 | +## Feature Support Matrix |
| 131 | + |
| 132 | +### Key |
| 133 | + |
| 134 | +✓ = well supported |
| 135 | + |
| 136 | +\* = limited support |
| 137 | + |
| 138 | +❌ = not supported |
| 139 | + |
| 140 | +R = required mode bit |
| 141 | + |
| 142 | +O = optional mode bit |
| 143 | + |
| 144 | +` ` (no mark) = not applicable or not needed |
| 145 | + |
| 146 | +*** |
| 147 | + |
| 148 | +| **Endpoint Types** |efa|efa-direct| |
| 149 | +| -------------------------- |:-:|:--------:| |
| 150 | +| `FI_EP_RDM` |✓ |✓ | |
| 151 | +| `FI_EP_DGRAM` |✓ |❌ | |
| 152 | +| `FI_EP_MSG` |❌|❌ | |
| 153 | + |
| 154 | +- Both support FI_EP_RDM for reliable datagram. |
| 155 | + |
| 156 | +- FI_EP_DGRAM is only supported by efa fabric. Though it uses the same code path as efa-direct, it is kept in efa fabric for backward compatibility. |
| 157 | + |
| 158 | +- Neither support MSG endpoint type today. |
| 159 | + |
| 160 | +| **Primary Caps** |efa|efa-direct| |
| 161 | +| ------------------ |:-:|:--------:| |
| 162 | +| `FI_ATOMIC` |✓ |❌ | |
| 163 | +| `FI_DIRECTED_RECV` |✓ |❌ | |
| 164 | +| `FI_HMEM` |✓ |✓ | |
| 165 | +| `FI_MSG` |✓ |✓ | |
| 166 | +| `FI_MULTICAST` |❌|❌ | |
| 167 | +| `FI_NAMED_RX_CTX` |❌|❌ | |
| 168 | +| `FI_RMA` |✓ |✓ | |
| 169 | +| `FI_TAGGED` |✓ |❌ | |
| 170 | + |
| 171 | +| **Primary Mods** |efa|efa-direct| |
| 172 | +| ------------------ |:-:|:--------:| |
| 173 | +| `FI_READ` |✓ |✓ | |
| 174 | +| `FI_RECV` |✓ |✓ | |
| 175 | +| `FI_REMOTE_READ` |✓ |✓ | |
| 176 | +| `FI_REMOTE_WRITE` |✓ |✓ | |
| 177 | +| `FI_SEND` |✓ |✓ | |
| 178 | +| `FI_WRITE` |✓ |✓ | |
| 179 | + |
| 180 | +| **Secondary Caps** |efa|efa-direct| |
| 181 | +| ------------------ |:-:|:--------:| |
| 182 | +| `FI_FENCE` |❌|❌ | |
| 183 | +| `FI_MULTI_RECV` |❌ |✓ | |
| 184 | +| `FI_LOCAL_COMM` |✓ |✓ | |
| 185 | +| `FI_REMOTE_COMM` |✓ |✓ | |
| 186 | +| `FI_RMA_EVENT` |❌|❌ | |
| 187 | +| `FI_RMA_PMEM` |❌|❌ | |
| 188 | +| `FI_SHARED_AV` |❌|❌ | |
| 189 | +| `FI_SOURCE` |✓ |✓ | |
| 190 | +| `FI_SOURCE_ERR` |❌ |❌ | |
| 191 | + |
| 192 | + |
| 193 | + |
| 194 | +Feature comparison: |
| 195 | +- **FI_MSG**: Both support. efa supports unlimited (UINT64_MAX) size of message, efa-direct supports message size up to the device limits (MTU ~ 8 KB). Both support |
| 196 | +up to 2 IOVs for each message. |
| 197 | +- **FI_RMA**: Both support. efa supports unlimited (UINT64_MAX) size of message, efa-direct supports message size up to the device limits (max_rdma_size ~ 1 GB). |
| 198 | +efa-direct only support 1 IOV for RMA message, efa support multiple (2) IOVs that |
| 199 | +consistent with the FI_MSG iov limit. |
| 200 | +- **FI_TAGGED**: efa provides support through software emulation, efa-direct lacks support |
| 201 | +- **FI_ATOMICS**: efa provides support through software emulation, efa-direct lacks support |
| 202 | +- **FI_DIRECTED_RECV**: efa provides support through software emulation, efa-direct lacks support |
| 203 | +- **FI_HMEM**: Both support HMEM operations - efa has software emulation when NIC-GPU peer-to-peer unavailable, efa-direct requires peer-to-peer support. See the feature comparison for FI_MR_HMEM mode bit. |
| 204 | + |
| 205 | +- **FI_MULTI_RECV**: efa-direct lacks support, efa provides support |
| 206 | + |
| 207 | + |
| 208 | + |
| 209 | + |
| 210 | +| **Modes** |efa|efa-direct| |
| 211 | +| ------------------------- |:-:|:--------:| |
| 212 | +| `FI_ASYNC_IOV` | | | |
| 213 | +| `FI_BUFFERED_RECV` | | | |
| 214 | +| `FI_CONTEXT` | | | |
| 215 | +| `FI_CONTEXT2` | |R | |
| 216 | +| `FI_LOCAL_MR (compat)` | | | |
| 217 | +| `FI_MSG_PREFIX` | | | |
| 218 | +| `FI_RX_CQ_DATA` | |O | |
| 219 | + |
| 220 | +Feature comparison: |
| 221 | +- **FI_CONTEXT2**: efa-direct requires this mode, efa fabric doesn't |
| 222 | +- **FI_MSG_PREFIX**: efa fabric DGRAM endpoint requires FI_MSG_PREFIX due to the 40-byte prefix requirement per IBV_QPT_UD spec |
| 223 | +- **FI_RX_CQ_DATA**: efa-direct accepts this optional mode, meaning operations carrying CQ data consume an RX buffer on responder side |
| 224 | + |
| 225 | + |
| 226 | +| **MR Modes** |efa|efa-direct| |
| 227 | +| ------------------------- |:-:|:--------:| |
| 228 | +| `FI_MR_ALLOCATED` |R |R | |
| 229 | +| `FI_MR_ENDPOINT` | | | |
| 230 | +| `FI_MR_HMEM` | | | |
| 231 | +| `FI_MR_LOCAL` | |R | |
| 232 | +| `FI_MR_PROV_KEY` |R |R | |
| 233 | +| `FI_MR_MMU_NOTIFY` | | | |
| 234 | +| `FI_MR_RAW` | | | |
| 235 | +| `FI_MR_RMA_EVENT` | | | |
| 236 | +| `FI_MR_VIRT_ADDR` |R |R | |
| 237 | +| `FI_MR_BASIC (compat)` |✓ |✓ | |
| 238 | +| `FI_MR_SCALABLE (compat)` |❌|❌ | |
| 239 | + |
| 240 | +Feature comparison: |
| 241 | +- **FI_MR_LOCAL**: efa-direct requires this mode, forcing applications to provide memory descriptors for all operations, while efa fabric supports both local and non-local MR modes |
| 242 | +- **FI_MR_HMEM**: Required by both fabrics - each HMEM buffer must be registered before data transfer operations. |
| 243 | +For efa fabric, it supports registering a hmem buffer without p2p support. When p2p is not available, a hmem buffer cannot be registered to EFA NIC directly. efa fabric will emulate a lkey/rkey generated by Libfabric and uses internal MR map to verify the key on the target side. |
| 244 | +It will use device memcpy API to copy data from/to hmem |
| 245 | +buffer to/from the bounce buffer during tx and rx |
| 246 | +operations and use bounce buffer to send data over the NIC. |
| 247 | + |
| 248 | + |
| 249 | +| **Other Libfabric Features** |efa|efa-direct| |
| 250 | +| ---------------------------- |:-:|:--------:| |
| 251 | +| FI_RM_ENABLED |\*|\* | |
| 252 | +| fi_counter |✓ |✓ | |
| 253 | +| fi_cancel support |\* |❌ | |
| 254 | +| message ordering |\* |❌ | |
| 255 | + |
| 256 | + |
| 257 | +- **Counters**: Both fabrics support local and remote operation counters |
| 258 | +- **fi_cancel support**: efa provides limited support (non-zero-copy-receive mode only), efa-direct has no support |
| 259 | +- **message ordering**: efa fabric supports FI_ORDER_SAS, FI_ORDER_ATOMIC_RAR, FI_ORDER_ATOMIC_RAW, FI_ORDER_ATOMIC_WAR, FI_ORDER_ATOMIC_WAW ordering, efa-direct |
| 260 | +doesn't support any ordering. |
| 261 | + |
| 262 | +- **FI_RM_ENABLED**: Both fabrics provide limited resource management support. One thing required for FI_RM_ENABLED is that provider needs to protect the resource (including CQ) from being overrun. But today efa and efa-direct doesn't have that protection. |
| 263 | + |
| 264 | + |
| 265 | +| **EFA proivider specific Features and Restrictions** |efa|efa-direct| |
| 266 | +| ---------------------------- |:-:|:--------:| |
| 267 | +| Unsolicited write recv | ✓ | ✓ | |
| 268 | +| FI_EFA_HOMOGENEOUS_PEER option |✓ | | |
| 269 | +| Peer AV entry on the RMA target side | | R | |
| 270 | +| GPU Direct Async (GDA) domain ops extension |❌|✓ | |
| 271 | +| Data path direct | ✓ | ✓ | |
| 272 | +| Util CQ bypass | ❌ | ✓ | |
| 273 | + |
| 274 | +- **Unsolicited write recv**: This is a feature that allows efa device not consume a Rx buffer on the target side for rdma write with immediate data operations. Both efa and efa-direct support it. However, if application wants to turn this feature off, for efa-direct, application |
| 275 | +needs to support FI_RX_CQ_DATA to maintain the rx buffer itself. The efa fabric doesn't have such requirement, because it has internal rx buffer that can be consumed. |
| 276 | + |
| 277 | +- **Homogeneous peers option**: efa supports FI_OPT_EFA_HOMOGENEOUS_PEERS configuration that skips the handshake establishment between local and peer, efa-direct is unaffected by this option |
| 278 | +- **Peer AV entry on the RMA target side**: For the efa-direct fabric, the target side of RMA operation must insert the initiator side’s address into AV before the RMA operation is kicked off, due to a current device limitation. The same limitation applies to the efa fabric when the FI_OPT_EFA_HOMOGENEOUS_PEERS option is set as true. |
| 279 | +- **GPU Direct Async extension**: efa-direct provides query operations for address, queue pair, and completion queue attributes. efa fabric doesn't support these operations. |
| 280 | +- **Data Path Direct**: A recent improvement to implement the WQE post and CQ poll |
| 281 | +directly in Libfabric without rdma-core API. It is now enabled in both fabrics |
| 282 | +- **Util CQ Bypass** Another improvement to get rid of the CQE staging in util CQ, |
| 283 | +more details are in the [util_cq_bypass doc](util_cq_bypass.md). |
0 commit comments