Skip to content

Commit 1a2a617

Browse files
authored
Merge pull request ceph#54671 from baum/ceph-nvmeof-mon
mon: add NVMe-oF gateway monitor and HA Reviewed-by: Josh Durgin <[email protected]> Reviewed-by: Radoslaw Zarzynski <[email protected]>
2 parents 32c4e9b + 6911df2 commit 1a2a617

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

74 files changed

+4627
-90
lines changed

.gitmodules

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -78,4 +78,11 @@
7878
[submodule "src/BLAKE3"]
7979
path = src/BLAKE3
8080
url = https://github.com/BLAKE3-team/BLAKE3.git
81-
81+
[submodule "src/boost_redis"]
82+
path = src/boost_redis
83+
url = https://github.com/boostorg/redis.git
84+
[submodule "src/nvmeof/gateway"]
85+
path = src/nvmeof/gateway
86+
url = https://github.com/ceph/ceph-nvmeof.git
87+
fetchRecurseSubmodules = false
88+
shallow = true

PendingReleaseNotes

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -512,3 +512,11 @@ Relevant tracker: https://tracker.ceph.com/issues/57090
512512
set using the `fs set` command. This flag prevents using a standby for another
513513
file system (join_fs = X) when standby for the current filesystem is not available.
514514
Relevant tracker: https://tracker.ceph.com/issues/61599
515+
* mon: add NVMe-oF gateway monitor and HA
516+
This PR adds high availability support for the nvmeof Ceph service. High availability
517+
means that even in the case that a certain GW is down, there will be another available
518+
path for the initiator to be able to continue the IO through another GW.
519+
It is also adding 2 new mon commands, to notify monitor about the gateway creation/deletion:
520+
- nvme-gw create
521+
- nvme-gw delete
522+
Relevant tracker: https://tracker.ceph.com/issues/64777

ceph.spec.in

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -250,6 +250,7 @@ BuildRequires: gperf
250250
BuildRequires: cmake > 3.5
251251
BuildRequires: fuse-devel
252252
BuildRequires: git
253+
BuildRequires: grpc-devel
253254
%if 0%{?fedora} || 0%{?suse_version} > 1500 || 0%{?rhel} == 9 || 0%{?openEuler}
254255
BuildRequires: gcc-c++ >= 11
255256
%endif
@@ -642,6 +643,17 @@ system. One or more instances of ceph-mon form a Paxos part-time
642643
parliament cluster that provides extremely reliable and durable storage
643644
of cluster membership, configuration, and state.
644645

646+
%package mon-client-nvmeof
647+
Summary: Ceph NVMeoF Gateway Monitor Client
648+
%if 0%{?suse_version}
649+
Group: System/Filesystems
650+
%endif
651+
Provides: ceph-test:/usr/bin/ceph-nvmeof-monitor-client
652+
Requires: librados2 = %{_epoch_prefix}%{version}-%{release}
653+
%description mon-client-nvmeof
654+
Ceph NVMeoF Gateway Monitor Client distributes Paxos ANA info
655+
to NVMeoF Gateway and provides beacons to the monitor daemon
656+
645657
%package mgr
646658
Summary: Ceph Manager Daemon
647659
%if 0%{?suse_version}
@@ -2077,6 +2089,9 @@ if [ $1 -ge 1 ] ; then
20772089
fi
20782090
fi
20792091

2092+
%files mon-client-nvmeof
2093+
%{_bindir}/ceph-nvmeof-monitor-client
2094+
20802095
%files fuse
20812096
%{_bindir}/ceph-fuse
20822097
%{_mandir}/man8/ceph-fuse.8*

doc/nvmeof/ha.md

Lines changed: 142 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,142 @@
1+
# Background
2+
3+
The nvmeof GW should support high availability. High availability means that even in the case that a certain GW is down, there will be another available path for the initiator to be able to continue the IO through another GW. It means that initially there are at least 2 paths which the nvme initiator can use to do the IO to the namespace(s). The multi pathing is achieved by connecting to a subsystem through more than 1 GW. This is native to the nvme initiator behavior, and this is done by connecting the nvme initiator to all relevant GWs (e.g. nvme connect-all command). Multi pathing allows the option to the initiator, to use one of the paths to write to the subsystem. This is a must for HA, but not enough. The problem is that the initiator should not simultaneously write to the same namespace(s) (i.e. volumes) from more than 1 path. Writing simultaneously to the same namespace(s) will eventually result in data inconsistency because there is no guarantee on the order of the writes that arrive at the namespace via the different GWs. There are many design options to solve this issue, the selected option that we implemented, is discussed here.
4+
5+
The core idea is to provide an Active-Standby access from the initiator to namespace(s). It means that at any point in time, there is only one (and only one) active path from the initiator to a namespace, but there are also standby path(s). The management of the Active-Standby states is being done in a new component that is called NVMeofGwMon.
6+
7+
Namespaces in nvme belong to a subsystem. That’s why the management of the entire Active-Standby states is done at a subsystem level. The implementation is using the nvme ANA protocol, which allows to define a state for each path. The state can be Optimized, Inaccessible, or Non-optimized. In our implementation, we set the state to either Optimized (i.e. Active), or Inaccessible (i.e. Standby). The ANA protocol is using ANA groups to define the path states. So per path, we can see different ANA groups, and per ANA group, we can know if the path is Optimized or Inaccessible. ANA group is a collection of namespaces.
8+
9+
The NVMeofGwMon should manage the ANA groups in a way that a particular group is alway optimized on at only one path (i.e. GW), and it is Inaccessible on all the other paths (i.e. GWs). The NVMeofGwMon needs to track the liveliness of all the GWs, and handle these cases:
10+
11+
1. GW disappeared.
12+
13+
2. GW reappeared.
14+
15+
The NVMeofGwMon should take the required actions when such events occur. E.g.
16+
17+
1. GW disappeared - the NVMeofGwMon should assign a new GW to be Optimized on this path, and then it needs to update all the GWs in the group, to change their state accordingly. This is called Failover.
18+
19+
2. GW reappeared - the NVMeofGwMon should re-assign the returning GW to be Optimized on this path, and then it needs to update all the GWs in the group, to change their state accordingly. This is called Failback.
20+
21+
22+
# Main design decisions
23+
24+
## HA environment setup requirements
25+
26+
It is assumed that between the nvmeof initiator (i.e. the nvmeof client) and the nvmeof target (i.e. the nvmeof ceph gw), there is full redundancy in the network connectivity. This means that the nvmeof initiator has 2 ethernet ports that are connected to the nvmeof target, via a network with redundancy (e.g. 2 networks switches).
27+
28+
Figure 1 - Full redundancy in the network connectivity
29+
30+
## Failover scenarios
31+
32+
The HA mode is not taking care of situations where the network paths between the nvmeof initiator and the nvmeof target are broken. This case should be covered by the network configuration which includes full redundancy to the network paths.
33+
34+
The following failover scenarios will be taken care of by the HA mode:
35+
36+
1. GW dead.
37+
38+
2. GW removed by cephadm.
39+
40+
3. Network partition between the gateway and rbd
41+
42+
43+
## Blocklisting
44+
45+
Whenever we failover a path, there is a danger that the peer that owned this path before, might still be alive, or might be temporarily frozen, and it might still hold some inflight IOs that it is about to submit to Ceph. This might cause data inconsistencies, and therefore we will always blocklist the peer before taking over any path. Blocklist will invalidate any inflight IO that it has. Ceph blocklist is built in a way that it doesn't require the node that is blocked to acknowledge the operation. The node that was blocked, even if it is alive somehow, we will be abled to use the blocked cluster context for any writes after the blocklist occured.
46+
47+
## ANA states
48+
49+
It is not allowed to manipulate the ANA states externally (e.g. via the SPDK RPC), because doing that will invalidate the auto HA solution.
50+
51+
## ANA groups
52+
53+
The HA solution will only use ANA group 1..number of active GWs. It means that if we have 2 GWs, we will use ANA groups 1 and 2, and if we have 3 GWs we will use ANA groups 1,2,3, and so on. The idea is that each GW will always own one ANA grp, and will be standby on the other ANA groups.
54+
55+
## Load Balancing
56+
57+
The optimal load balancing will be achieved when the number of active (i.e. optimized) namespaces, is distributed evenly between all of the GWs. It means that every GW will handle the same number of namespaces in a good path IO situation (where all GWs are up and running). The code will automatically assign the namespaces evenly across the GWs upon namespace creation. But this can also be manually assigned when creating a new namespace. This assignment is persistent in the OMAP state, and can be modified by another gRPC/CLI call.
58+
59+
60+
## GW initialization
61+
62+
The nvmeof GW initialization is changed. The GW must get some initial data from the NVMeofGwMon to be able to complete its initialization. The initial data will include the ANA grp id that it should own. Based on the ANA grp id, the GW can tell which unique controller ids to use, and it knows on which ANA grp id it should be optimized. This means that the GW initialization sequence is delayed until it gets this initial data. And until this initial data is received, the gRPC/CLI and the SPDK initialization is on hold.
63+
64+
## Network partition
65+
66+
It is possible that the nvmeof GW monitor will think that a GW is down, but in reality the GW will be alive. This can happen in a case of a network partition for example. The problem in this case is that the monitor will decide to failover the ANA groups that “belong” to this GW, to other GWs. But the GW will not know about it. For this reason, it is decided that the GW (i.e. the GW client in this case), will get heartbeats from the monitor every few seconds. In the case that the heartbeats stop (i.e. not heartbeat few cycles), the GW will commit suicide to avoid the case that the same ANA group is considered to be optimized by more than one GW.
67+
68+
# Modules
69+
70+
There are changes in the Ceph code, and there are changes in the nvmeof GW code.
71+
72+
### Ceph code - new modules
73+
74+
#### MNVMeofGwMap
75+
76+
Description: Class that coordinates Gateway’s Failover/ Failback
77+
78+
Main responsibilities:
79+
80+
Coordinates the behavior of all Gateways in the CEPH that configured in HA mode.
81+
82+
Implements stateful behavior for performing Failover/Failback by Gateways within the same subsystem.
83+
84+
Supports for independent state machines(per ANA group) within the same Gateway.
85+
86+
Implements the blocklist of ceph entries used for blocked traffic related to specific ANA groups.
87+
88+
Holds GWMAP map, GW_Created map database.
89+
90+
#### MNVMeofGwBeacon
91+
92+
Description:
93+
94+
Main responsibilities:
95+
96+
#### NVMeofGwMon
97+
98+
Description: New monitor in the Paxos environment - used for monitoring Gateways in HA mode
99+
100+
Main responsibilities:
101+
102+
Forwards inherited Paxos messages, aggregates the NVMeofGwMap object
103+
104+
Distributes maps to Paxos and broadcasts them to the GW Clients
105+
106+
Handles Beacon messages from the GW Clients, determines the Keep Alive timeout from the GW
107+
108+
Conveys CEPH commands - create/delete GW to the NVMeofGwMap embedded object
109+
110+
Sends immediate unicast Beacon Ack message as response to Beacon to ensure symmetric handshake
111+
112+
113+
#### NVMeofGw
114+
115+
Description:
116+
117+
Main responsibilities:
118+
119+
120+
121+
#### NVMeofGwClient
122+
123+
Description:
124+
125+
Main responsibilities:
126+
127+
#### NVMeofGwMonitorGroupClient
128+
129+
Description:
130+
131+
Main responsibilities:
132+
133+
### Ceph code - changed modules
134+
135+
#### MonCommands
136+
137+
#### Monitor
138+
139+
#### Message
140+
141+
# Sequence diagrams
142+
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
use_shaman: True
2+
tasks:
3+
- install:
4+
extra_packages:
5+
- nvme-cli
6+
- cephadm:
7+
watchdog_setup:
8+
- cephadm.shell:
9+
host.a:
10+
# get state before nvmeof deployment
11+
- ceph orch status
12+
- ceph orch ps
13+
- ceph orch host ls
14+
- ceph orch device ls
15+
- ceph osd lspools

0 commit comments

Comments
 (0)