Skip to content

Commit 8ada073

Browse files
committed
[Doc]: Add documents for nvmeof_generic transport
Signed-off-by: Jinlong Chen <[email protected]>
1 parent 6273dc1 commit 8ada073

File tree

2 files changed

+608
-0
lines changed

2 files changed

+608
-0
lines changed

doc/en/nvmeof_generic_transport.md

Lines changed: 304 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,304 @@
1+
# Generic NVMeoF Transport
2+
3+
## Overview
4+
5+
NVMeoFGenericTransport is a more complete NVMeoF protocol-based TransferEngine Transport, designed to eventually replace the existing NVMeoFTransport and provide TransferEngine with the ability to manage and access file Segments.
6+
7+
Compared to the legacy NVMeoFTransport, NVMeoFGenericTransport offers the following advantages:
8+
9+
- **More Complete:** Provides a full set of management interfaces consistent with memory Segments, including registering/unregistering local files, mounting/unmounting remote files, etc.
10+
- **More Generic:** No longer depends on cuFile, and can be deployed and used in environments without CUDA support.
11+
- **Higher Performance:** Supports multi-threaded I/O and Direct I/O, fully leveraging the performance potential of NICs and SSDs.
12+
- **More Reliable:** Ensures that unavailability of a single file or storage device does not affect the availability of others, through a more flexible multi-file management scheme.
13+
14+
## Component Support
15+
16+
Both TransferEngine and Mooncake Store have added full support for NVMeoFGenericTransport. The relevant API interfaces are listed below:
17+
18+
### TransferEngine Support
19+
20+
`TransferEngine` now supports registering and reading/writing file segments. This mainly includes adding fields related to file management and access in `SegmentDesc` and `TransferRequest`, and introducing interfaces for registering and unregistering files.
21+
22+
#### SegmentDesc
23+
24+
To support file registration management, the `file_buffers` field has been added to `SegmentDesc`.
25+
26+
```cpp
27+
using FileBufferID = uint32_t;
28+
struct FileBufferDesc {
29+
FileBufferID id; // File ID, used to identify the file within a Segment
30+
std::string path; // File path on the owning node
31+
std::size_t size; // Available space size of the file
32+
std::size_t align; // For future usage.
33+
};
34+
35+
struct SegmentDesc {
36+
std::string name;
37+
std::string protocol;
38+
// Generic file buffers.
39+
std::vector<FileBufferDesc> file_buffers;
40+
41+
// Other fields...
42+
};
43+
```
44+
45+
#### TransferRequest
46+
47+
To support multi-file registration and access, the `file_id` field has been added to `TransferRequest` to identify the file to be read from or written to.
48+
49+
```cpp
50+
struct TransferRequest {
51+
enum OpCode { READ, WRITE };
52+
OpCode opcode;
53+
void *source;
54+
SegmentID target_id;
55+
uint64_t target_offset; // When accessing a file, target_offset indicates the offset within the target file
56+
size_t length;
57+
int advise_retry_cnt = 0;
58+
FileBufferID file_id; // Target file ID, required only when accessing files, used with target_id to locate the target file
59+
};
60+
```
61+
62+
`file_id` is the ID assigned by the target `TransferEngine` when registering the target file, and can be obtained from the `SegmentDesc` of the target `Segment`.
63+
64+
#### installTransport
65+
66+
```cpp
67+
Transport *installTransport(const std::string &proto, void **args)
68+
```
69+
70+
The `TransferEngine::installTransport` interface now supports directly passing the `args` parameter to the `install` interface of the corresponding Transport, enabling Transport-specific initialization parameters.
71+
72+
For `NVMeoFGenericTransport`, if the current TransferEngine instance does not need to share local files, the `args` parameter can be `nullptr`. Otherwise, `args` should be a valid pointer array, where the first pointer points to a `char *` that references a string containing NVMeoF Target configuration parameters. For example:
73+
74+
```cpp
75+
// NVMeoF Target configuration parameters
76+
char *trid_str = "trtype=<tcp|rdma> adrfam=<ipv4|ipv6> traddr=<Listen address> trsvcid=<Listen port>";
77+
78+
// Arguments for installTransport
79+
void **args = (void **)&trid_str;
80+
```
81+
82+
#### registerLocalFile
83+
84+
```cpp
85+
int registerLocalFile(const std::string &path, size_t size, FileBufferID &id);
86+
```
87+
88+
Registers a local file into TransferEngine, enabling cross-node access. The file can be a regular file or a block device file. **Note: Using a block device file for registration may cause data corruption or complete loss on the device—use with caution!**
89+
90+
- `path`: File path, can be any regular file or block device file such as `/dev/nvmeXnY`;
91+
- `size`: Available space size of the file, can be less than or equal to the physical size;
92+
- `id`: ID assigned by `TransferEngine` to the file, used to distinguish each file when multiple files are registered;
93+
- Return value: Returns 0 on success, otherwise returns a negative error code;
94+
95+
#### unregisterLocalFile
96+
97+
```cpp
98+
int unregisterLocalFile(const std::string &path);
99+
```
100+
101+
Unregisters a local file.
102+
103+
- `path`: File path, must match the path used during registration;
104+
105+
### Mooncake Store Support
106+
107+
Mooncake Store now supports using files as shared storage space for storing objects. This capability is based on two newly added interfaces:
108+
109+
#### MountFileSegment
110+
111+
```cpp
112+
tl::expected<void, ErrorCode> MountFileSegment(const std::string& path);
113+
```
114+
115+
Mounts the local file at `path` as part of the shared storage space.
116+
117+
#### UnmountFileSegment
118+
119+
```cpp
120+
tl::expected<void, ErrorCode> UnmountFileSegment(const std::string& path);
121+
```
122+
123+
Unmounts a previously mounted file.
124+
125+
### Mooncake Store Python API
126+
127+
The Mooncake Store Python API now supports specifying a set of local files as shared storage space.
128+
129+
#### setup_with_files
130+
131+
```python
132+
def setup_with_files(
133+
local_hostname: str,
134+
metadata_server: str,
135+
files: List[str],
136+
local_buffer_size: int,
137+
protocol: str,
138+
protocol_arg: str,
139+
master_server_addr: str
140+
):
141+
pass
142+
```
143+
144+
Starts a Mooncake Store Client instance and registers the specified files as shared storage space.
145+
146+
## Running Tests
147+
148+
Users can test NVMeoFGenericTransport at both the TransferEngine and Mooncake Store levels.
149+
150+
### Environment Requirements
151+
152+
In addition to the original compilation and runtime environment of the Mooncake project, NVMeoFGenericTransport has additional requirements:
153+
154+
#### Kernel Version and Drivers
155+
156+
NVMeoFGenericTransport currently relies on the Linux kernel's nvme and nvmet driver suite, including the following kernel modules:
157+
158+
- NVMeoF RDMA: Requires Linux Kernel 4.8 or higher, install drivers:
159+
160+
```bash
161+
# Initiator driver, required for accessing remote files
162+
modprobe nvme_rdma
163+
164+
# Target driver, required for sharing local files
165+
modprobe nvmet_rdma
166+
```
167+
168+
- NVMeoF TCP: Requires Linux Kernel 5.0 or higher, install drivers:
169+
170+
```bash
171+
# Initiator driver, required for accessing remote files
172+
modprobe nvme_tcp
173+
174+
# Target driver, required for sharing local files
175+
modprobe nvmet_tcp
176+
```
177+
178+
#### Dependencies
179+
180+
NVMeoFGenericTransport depends on the following third-party libraries:
181+
182+
```bash
183+
apt install -y libaio-dev libnvme-dev
184+
```
185+
186+
### Build Options
187+
188+
To enable NVMeoFGenericTransport, the `USE_NVMEOF_GENERIC` build option must be turned on:
189+
190+
```bash
191+
cmake .. -DUSE_NVMEOF_GENERIC=ON
192+
```
193+
194+
### Runtime Options
195+
196+
NVMeoFGenericTransport supports configuring the following runtime options via environment variables:
197+
198+
- `MC_NVMEOF_GENERIC_DIRECT_IO`: Use Direct I/O when reading/writing NVMeoF SSDs. Disabled by default. Enabling this option can significantly improve performance, but requires that buffer addresses, SSD locations, and I/O lengths all meet alignment requirements (typically 512-byte alignment, 4 KiB alignment recommended).
199+
- `MC_NVMEOF_GENERIC_NUM_WORKERS`: Number of threads used for reading/writing NVMeoF SSDs. Default is 8.
200+
201+
### TransferEngine Testing
202+
203+
After enabling the `USE_NVMEOF_GENERIC` option and completing the build, an executable named `transfer_engine_nvmeof_generic_bench` can be found under `build/mooncake-transfer-engine/example`. This program can be used to test the performance of NVMeoFGenericTransport.
204+
205+
#### Start Metadata Service
206+
207+
Same as the `transfer_engine_bench` test tool. Refer to [transfer-engine.md](../zh/transfer-engine.md#范例程序transfer-engine-bench) for details.
208+
209+
Assume the metadata service address is `http://127.0.0.1:8080/metadata` (using HTTP metadata service as an example).
210+
211+
#### Start Target
212+
213+
**Note: After file registration, existing data may be corrupted or completely lost—use with extreme caution!!**
214+
215+
```bash
216+
./build/mooncake-transfer-engine/example/transfer_engine_nvmeof_generic_bench \
217+
--local_server_name=127.0.0.1:8081 \
218+
--metadata_server=http://127.0.0.0.0:8080/metadata \
219+
--mode=target \
220+
--trtype=tcp \
221+
--traddr=127.0.0.1 \
222+
--trsvcid=4420 \
223+
--files="/path/to/file0 /path/to/file1 ..."
224+
```
225+
226+
#### Start Initiator
227+
228+
```bash
229+
./build/mooncake-transfer-engine/example/transfer_engine_nvmeof_generic_bench \
230+
--local_server_name=127.0.0.1:8082 \
231+
--metadata_server=http://127.0.0.1:8080/metadata \
232+
--mode=initiator \
233+
--operation=read \
234+
--segment_id=127.0.0.1:8081 \
235+
--batch_size=4096 \
236+
--block_size=65536 \
237+
--duration=30 \
238+
--threads=1 \
239+
--report_unit=GB
240+
```
241+
242+
#### Loopback Mode
243+
244+
For quick validation, loopback mode can also be used to test on a single machine:
245+
246+
```bash
247+
./build/mooncake-transfer-engine/example/transfer_engine_nvmeof_generic_bench \
248+
--local_server_name=127.0.0.1:8081 \
249+
--metadata_server=http://127.0.0.1:8080/metadata \
250+
--mode=loopback \
251+
--operation=read \
252+
--segment_id=127.0.0.1:8081 \
253+
--batch_size=4096 \
254+
--block_size=65536 \
255+
--duration=30 \
256+
--threads=1 \
257+
--report_unit=GB \
258+
--trtype=tcp \
259+
--traddr=127.0.0.1 \
260+
--trsvcid=4420 \
261+
--files="/path/to/file0 /path/to/file1 ..."
262+
```
263+
264+
#### Performance Tuning
265+
266+
- For workloads involving many files, increasing `MC_NVMEOF_GENERIC_NUM_WORKERS` appropriately usually improves performance.
267+
- If `--block_size` meets `4 KiB` alignment, set environment variable `MC_NVMEOF_GENERIC_DIRECT_IO=on` to significantly boost performance on SSD devices.
268+
269+
### Mooncake Store Testing
270+
271+
Use `mooncake-store/tests/stress_cluster_benchmark.py` to test the performance of Mooncake Store based on NVMeoFGenericTransport.
272+
273+
#### Start Metadata Service
274+
275+
Follow the instructions in [transfer-engine.md](./transfer-engine.md#example-transfer-engine-bench) and [mooncake-store-preview.md](./mooncake-store-preview.md#starting-the-master-service) to start the metadata service and Master service respectively.
276+
277+
#### Start Prefill Instance
278+
279+
```bash
280+
python3 ../mooncake-store/tests/stress_cluster_benchmark.py \
281+
--local-hostname=127.0.0.1:8081 \
282+
--role=prefill \
283+
--protocol=nvmeof_generic \
284+
--protocol-args="trtype=tcp adrfam=ipv4 traddr=127.0.0.1 trsvcid=4420" \
285+
--local-buffer-size=1024 \
286+
--files="/path/to/file0 /path/to/file1 ..."
287+
```
288+
289+
#### Start Decode Instance
290+
291+
```bash
292+
python3 ../mooncake-store/tests/stress_cluster_benchmark.py \
293+
--local-hostname=127.0.0.1:8082 \
294+
--role=decode \
295+
--protocol=nvmeof_generic \
296+
--protocol-args="" \
297+
--local-buffer-size=1024 \
298+
--files=""
299+
```
300+
301+
#### Performance Tuning
302+
303+
- For workloads involving many files, increasing `MC_NVMEOF_GENERIC_NUM_WORKERS` appropriately usually improves performance.
304+
- Mooncake Store currently cannot guarantee allocation of buffers that meet Direct I/O alignment requirements; therefore, Direct I/O is not currently supported.

0 commit comments

Comments
 (0)