|
| 1 | +# Generic NVMeoF Transport |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +NVMeoFGenericTransport is a more complete NVMeoF protocol-based TransferEngine Transport, designed to eventually replace the existing NVMeoFTransport and provide TransferEngine with the ability to manage and access file Segments. |
| 6 | + |
| 7 | +Compared to the legacy NVMeoFTransport, NVMeoFGenericTransport offers the following advantages: |
| 8 | + |
| 9 | +- **More Complete:** Provides a full set of management interfaces consistent with memory Segments, including registering/unregistering local files, mounting/unmounting remote files, etc. |
| 10 | +- **More Generic:** No longer depends on cuFile, and can be deployed and used in environments without CUDA support. |
| 11 | +- **Higher Performance:** Supports multi-threaded I/O and Direct I/O, fully leveraging the performance potential of NICs and SSDs. |
| 12 | +- **More Reliable:** Ensures that unavailability of a single file or storage device does not affect the availability of others, through a more flexible multi-file management scheme. |
| 13 | + |
| 14 | +## Component Support |
| 15 | + |
| 16 | +Both TransferEngine and Mooncake Store have added full support for NVMeoFGenericTransport. The relevant API interfaces are listed below: |
| 17 | + |
| 18 | +### TransferEngine Support |
| 19 | + |
| 20 | +`TransferEngine` now supports registering and reading/writing file segments. This mainly includes adding fields related to file management and access in `SegmentDesc` and `TransferRequest`, and introducing interfaces for registering and unregistering files. |
| 21 | + |
| 22 | +#### SegmentDesc |
| 23 | + |
| 24 | +To support file registration management, the `file_buffers` field has been added to `SegmentDesc`. |
| 25 | + |
| 26 | +```cpp |
| 27 | +using FileBufferID = uint32_t; |
| 28 | +struct FileBufferDesc { |
| 29 | + FileBufferID id; // File ID, used to identify the file within a Segment |
| 30 | + std::string path; // File path on the owning node |
| 31 | + std::size_t size; // Available space size of the file |
| 32 | + std::size_t align; // For future usage. |
| 33 | +}; |
| 34 | + |
| 35 | +struct SegmentDesc { |
| 36 | + std::string name; |
| 37 | + std::string protocol; |
| 38 | + // Generic file buffers. |
| 39 | + std::vector<FileBufferDesc> file_buffers; |
| 40 | + |
| 41 | + // Other fields... |
| 42 | +}; |
| 43 | +``` |
| 44 | +
|
| 45 | +#### TransferRequest |
| 46 | +
|
| 47 | +To support multi-file registration and access, the `file_id` field has been added to `TransferRequest` to identify the file to be read from or written to. |
| 48 | +
|
| 49 | +```cpp |
| 50 | +struct TransferRequest { |
| 51 | + enum OpCode { READ, WRITE }; |
| 52 | + OpCode opcode; |
| 53 | + void *source; |
| 54 | + SegmentID target_id; |
| 55 | + uint64_t target_offset; // When accessing a file, target_offset indicates the offset within the target file |
| 56 | + size_t length; |
| 57 | + int advise_retry_cnt = 0; |
| 58 | + FileBufferID file_id; // Target file ID, required only when accessing files, used with target_id to locate the target file |
| 59 | +}; |
| 60 | +``` |
| 61 | + |
| 62 | +`file_id` is the ID assigned by the target `TransferEngine` when registering the target file, and can be obtained from the `SegmentDesc` of the target `Segment`. |
| 63 | + |
| 64 | +#### installTransport |
| 65 | + |
| 66 | +```cpp |
| 67 | +Transport *installTransport(const std::string &proto, void **args) |
| 68 | +``` |
| 69 | +
|
| 70 | +The `TransferEngine::installTransport` interface now supports directly passing the `args` parameter to the `install` interface of the corresponding Transport, enabling Transport-specific initialization parameters. |
| 71 | +
|
| 72 | +For `NVMeoFGenericTransport`, if the current TransferEngine instance does not need to share local files, the `args` parameter can be `nullptr`. Otherwise, `args` should be a valid pointer array, where the first pointer points to a `char *` that references a string containing NVMeoF Target configuration parameters. For example: |
| 73 | +
|
| 74 | +```cpp |
| 75 | +// NVMeoF Target configuration parameters |
| 76 | +char *trid_str = "trtype=<tcp|rdma> adrfam=<ipv4|ipv6> traddr=<Listen address> trsvcid=<Listen port>"; |
| 77 | +
|
| 78 | +// Arguments for installTransport |
| 79 | +void **args = (void **)&trid_str; |
| 80 | +``` |
| 81 | + |
| 82 | +#### registerLocalFile |
| 83 | + |
| 84 | +```cpp |
| 85 | +int registerLocalFile(const std::string &path, size_t size, FileBufferID &id); |
| 86 | +``` |
| 87 | +
|
| 88 | +Registers a local file into TransferEngine, enabling cross-node access. The file can be a regular file or a block device file. **Note: Using a block device file for registration may cause data corruption or complete loss on the device—use with caution!** |
| 89 | +
|
| 90 | +- `path`: File path, can be any regular file or block device file such as `/dev/nvmeXnY`; |
| 91 | +- `size`: Available space size of the file, can be less than or equal to the physical size; |
| 92 | +- `id`: ID assigned by `TransferEngine` to the file, used to distinguish each file when multiple files are registered; |
| 93 | +- Return value: Returns 0 on success, otherwise returns a negative error code; |
| 94 | +
|
| 95 | +#### unregisterLocalFile |
| 96 | +
|
| 97 | +```cpp |
| 98 | +int unregisterLocalFile(const std::string &path); |
| 99 | +``` |
| 100 | + |
| 101 | +Unregisters a local file. |
| 102 | + |
| 103 | +- `path`: File path, must match the path used during registration; |
| 104 | + |
| 105 | +### Mooncake Store Support |
| 106 | + |
| 107 | +Mooncake Store now supports using files as shared storage space for storing objects. This capability is based on two newly added interfaces: |
| 108 | + |
| 109 | +#### MountFileSegment |
| 110 | + |
| 111 | +```cpp |
| 112 | +tl::expected<void, ErrorCode> MountFileSegment(const std::string& path); |
| 113 | +``` |
| 114 | +
|
| 115 | +Mounts the local file at `path` as part of the shared storage space. |
| 116 | +
|
| 117 | +#### UnmountFileSegment |
| 118 | +
|
| 119 | +```cpp |
| 120 | +tl::expected<void, ErrorCode> UnmountFileSegment(const std::string& path); |
| 121 | +``` |
| 122 | + |
| 123 | +Unmounts a previously mounted file. |
| 124 | + |
| 125 | +### Mooncake Store Python API |
| 126 | + |
| 127 | +The Mooncake Store Python API now supports specifying a set of local files as shared storage space. |
| 128 | + |
| 129 | +#### setup_with_files |
| 130 | + |
| 131 | +```python |
| 132 | +def setup_with_files( |
| 133 | + local_hostname: str, |
| 134 | + metadata_server: str, |
| 135 | + files: List[str], |
| 136 | + local_buffer_size: int, |
| 137 | + protocol: str, |
| 138 | + protocol_arg: str, |
| 139 | + master_server_addr: str |
| 140 | + ): |
| 141 | + pass |
| 142 | +``` |
| 143 | + |
| 144 | +Starts a Mooncake Store Client instance and registers the specified files as shared storage space. |
| 145 | + |
| 146 | +## Running Tests |
| 147 | + |
| 148 | +Users can test NVMeoFGenericTransport at both the TransferEngine and Mooncake Store levels. |
| 149 | + |
| 150 | +### Environment Requirements |
| 151 | + |
| 152 | +In addition to the original compilation and runtime environment of the Mooncake project, NVMeoFGenericTransport has additional requirements: |
| 153 | + |
| 154 | +#### Kernel Version and Drivers |
| 155 | + |
| 156 | +NVMeoFGenericTransport currently relies on the Linux kernel's nvme and nvmet driver suite, including the following kernel modules: |
| 157 | + |
| 158 | +- NVMeoF RDMA: Requires Linux Kernel 4.8 or higher, install drivers: |
| 159 | + |
| 160 | +```bash |
| 161 | +# Initiator driver, required for accessing remote files |
| 162 | +modprobe nvme_rdma |
| 163 | + |
| 164 | +# Target driver, required for sharing local files |
| 165 | +modprobe nvmet_rdma |
| 166 | +``` |
| 167 | + |
| 168 | +- NVMeoF TCP: Requires Linux Kernel 5.0 or higher, install drivers: |
| 169 | + |
| 170 | +```bash |
| 171 | +# Initiator driver, required for accessing remote files |
| 172 | +modprobe nvme_tcp |
| 173 | + |
| 174 | +# Target driver, required for sharing local files |
| 175 | +modprobe nvmet_tcp |
| 176 | +``` |
| 177 | + |
| 178 | +#### Dependencies |
| 179 | + |
| 180 | +NVMeoFGenericTransport depends on the following third-party libraries: |
| 181 | + |
| 182 | +```bash |
| 183 | +apt install -y libaio-dev libnvme-dev |
| 184 | +``` |
| 185 | + |
| 186 | +### Build Options |
| 187 | + |
| 188 | +To enable NVMeoFGenericTransport, the `USE_NVMEOF_GENERIC` build option must be turned on: |
| 189 | + |
| 190 | +```bash |
| 191 | +cmake .. -DUSE_NVMEOF_GENERIC=ON |
| 192 | +``` |
| 193 | + |
| 194 | +### Runtime Options |
| 195 | + |
| 196 | +NVMeoFGenericTransport supports configuring the following runtime options via environment variables: |
| 197 | + |
| 198 | +- `MC_NVMEOF_GENERIC_DIRECT_IO`: Use Direct I/O when reading/writing NVMeoF SSDs. Disabled by default. Enabling this option can significantly improve performance, but requires that buffer addresses, SSD locations, and I/O lengths all meet alignment requirements (typically 512-byte alignment, 4 KiB alignment recommended). |
| 199 | +- `MC_NVMEOF_GENERIC_NUM_WORKERS`: Number of threads used for reading/writing NVMeoF SSDs. Default is 8. |
| 200 | + |
| 201 | +### TransferEngine Testing |
| 202 | + |
| 203 | +After enabling the `USE_NVMEOF_GENERIC` option and completing the build, an executable named `transfer_engine_nvmeof_generic_bench` can be found under `build/mooncake-transfer-engine/example`. This program can be used to test the performance of NVMeoFGenericTransport. |
| 204 | + |
| 205 | +#### Start Metadata Service |
| 206 | + |
| 207 | +Same as the `transfer_engine_bench` test tool. Refer to [transfer-engine.md](../zh/transfer-engine.md#范例程序transfer-engine-bench) for details. |
| 208 | + |
| 209 | +Assume the metadata service address is `http://127.0.0.1:8080/metadata` (using HTTP metadata service as an example). |
| 210 | + |
| 211 | +#### Start Target |
| 212 | + |
| 213 | +**Note: After file registration, existing data may be corrupted or completely lost—use with extreme caution!!** |
| 214 | + |
| 215 | +```bash |
| 216 | +./build/mooncake-transfer-engine/example/transfer_engine_nvmeof_generic_bench \ |
| 217 | + --local_server_name=127.0.0.1:8081 \ |
| 218 | + --metadata_server=http://127.0.0.0.0:8080/metadata \ |
| 219 | + --mode=target \ |
| 220 | + --trtype=tcp \ |
| 221 | + --traddr=127.0.0.1 \ |
| 222 | + --trsvcid=4420 \ |
| 223 | + --files="/path/to/file0 /path/to/file1 ..." |
| 224 | +``` |
| 225 | + |
| 226 | +#### Start Initiator |
| 227 | + |
| 228 | +```bash |
| 229 | +./build/mooncake-transfer-engine/example/transfer_engine_nvmeof_generic_bench \ |
| 230 | + --local_server_name=127.0.0.1:8082 \ |
| 231 | + --metadata_server=http://127.0.0.1:8080/metadata \ |
| 232 | + --mode=initiator \ |
| 233 | + --operation=read \ |
| 234 | + --segment_id=127.0.0.1:8081 \ |
| 235 | + --batch_size=4096 \ |
| 236 | + --block_size=65536 \ |
| 237 | + --duration=30 \ |
| 238 | + --threads=1 \ |
| 239 | + --report_unit=GB |
| 240 | +``` |
| 241 | + |
| 242 | +#### Loopback Mode |
| 243 | + |
| 244 | +For quick validation, loopback mode can also be used to test on a single machine: |
| 245 | + |
| 246 | +```bash |
| 247 | +./build/mooncake-transfer-engine/example/transfer_engine_nvmeof_generic_bench \ |
| 248 | + --local_server_name=127.0.0.1:8081 \ |
| 249 | + --metadata_server=http://127.0.0.1:8080/metadata \ |
| 250 | + --mode=loopback \ |
| 251 | + --operation=read \ |
| 252 | + --segment_id=127.0.0.1:8081 \ |
| 253 | + --batch_size=4096 \ |
| 254 | + --block_size=65536 \ |
| 255 | + --duration=30 \ |
| 256 | + --threads=1 \ |
| 257 | + --report_unit=GB \ |
| 258 | + --trtype=tcp \ |
| 259 | + --traddr=127.0.0.1 \ |
| 260 | + --trsvcid=4420 \ |
| 261 | + --files="/path/to/file0 /path/to/file1 ..." |
| 262 | +``` |
| 263 | + |
| 264 | +#### Performance Tuning |
| 265 | + |
| 266 | +- For workloads involving many files, increasing `MC_NVMEOF_GENERIC_NUM_WORKERS` appropriately usually improves performance. |
| 267 | +- If `--block_size` meets `4 KiB` alignment, set environment variable `MC_NVMEOF_GENERIC_DIRECT_IO=on` to significantly boost performance on SSD devices. |
| 268 | + |
| 269 | +### Mooncake Store Testing |
| 270 | + |
| 271 | +Use `mooncake-store/tests/stress_cluster_benchmark.py` to test the performance of Mooncake Store based on NVMeoFGenericTransport. |
| 272 | + |
| 273 | +#### Start Metadata Service |
| 274 | + |
| 275 | +Follow the instructions in [transfer-engine.md](./transfer-engine.md#example-transfer-engine-bench) and [mooncake-store-preview.md](./mooncake-store-preview.md#starting-the-master-service) to start the metadata service and Master service respectively. |
| 276 | + |
| 277 | +#### Start Prefill Instance |
| 278 | + |
| 279 | +```bash |
| 280 | +python3 ../mooncake-store/tests/stress_cluster_benchmark.py \ |
| 281 | + --local-hostname=127.0.0.1:8081 \ |
| 282 | + --role=prefill \ |
| 283 | + --protocol=nvmeof_generic \ |
| 284 | + --protocol-args="trtype=tcp adrfam=ipv4 traddr=127.0.0.1 trsvcid=4420" \ |
| 285 | + --local-buffer-size=1024 \ |
| 286 | + --files="/path/to/file0 /path/to/file1 ..." |
| 287 | +``` |
| 288 | + |
| 289 | +#### Start Decode Instance |
| 290 | + |
| 291 | +```bash |
| 292 | +python3 ../mooncake-store/tests/stress_cluster_benchmark.py \ |
| 293 | + --local-hostname=127.0.0.1:8082 \ |
| 294 | + --role=decode \ |
| 295 | + --protocol=nvmeof_generic \ |
| 296 | + --protocol-args="" \ |
| 297 | + --local-buffer-size=1024 \ |
| 298 | + --files="" |
| 299 | +``` |
| 300 | + |
| 301 | +#### Performance Tuning |
| 302 | + |
| 303 | +- For workloads involving many files, increasing `MC_NVMEOF_GENERIC_NUM_WORKERS` appropriately usually improves performance. |
| 304 | +- Mooncake Store currently cannot guarantee allocation of buffers that meet Direct I/O alignment requirements; therefore, Direct I/O is not currently supported. |
0 commit comments