ArbitCode
diff --git a/‎.gitmodules‎
Lines changed: 8 additions & 1 deletion b/‎.gitmodules‎
Lines changed: 8 additions & 1 deletion
diff --git a/‎PendingReleaseNotes‎
Lines changed: 8 additions & 0 deletions b/‎PendingReleaseNotes‎
Lines changed: 8 additions & 0 deletions
diff --git a/‎ceph.spec.in‎
Lines changed: 15 additions & 0 deletions b/‎ceph.spec.in‎
Lines changed: 15 additions & 0 deletions
diff --git a/‎doc/nvmeof/ha.md‎
Lines changed: 142 additions & 0 deletions b/‎doc/nvmeof/ha.md‎
Lines changed: 142 additions & 0 deletions
diff --git a/‎qa/suites/rbd/nvmeof/.qa‎ ‎qa/suites/nvmeof/.qa‎qa/suites/rbd/nvmeof/.qa renamed to qa/suites/nvmeof/.qa b/‎qa/suites/rbd/nvmeof/.qa‎ ‎qa/suites/nvmeof/.qa‎qa/suites/rbd/nvmeof/.qa renamed to qa/suites/nvmeof/.qa
diff --git a/‎qa/suites/rbd/nvmeof/%‎ ‎qa/suites/nvmeof/basic/%‎qa/suites/rbd/nvmeof/% renamed to qa/suites/nvmeof/basic/% b/‎qa/suites/rbd/nvmeof/%‎ ‎qa/suites/nvmeof/basic/%‎qa/suites/rbd/nvmeof/% renamed to qa/suites/nvmeof/basic/%
diff --git a/‎qa/suites/rbd/nvmeof/base/.qa‎ ‎qa/suites/nvmeof/basic/.qa‎qa/suites/rbd/nvmeof/base/.qa renamed to qa/suites/nvmeof/basic/.qa b/‎qa/suites/rbd/nvmeof/base/.qa‎ ‎qa/suites/nvmeof/basic/.qa‎qa/suites/rbd/nvmeof/base/.qa renamed to qa/suites/nvmeof/basic/.qa
diff --git a/‎qa/suites/rbd/nvmeof/cluster/.qa‎ ‎qa/suites/nvmeof/basic/base/.qa‎qa/suites/rbd/nvmeof/cluster/.qa renamed to qa/suites/nvmeof/basic/base/.qa b/‎qa/suites/rbd/nvmeof/cluster/.qa‎ ‎qa/suites/nvmeof/basic/base/.qa‎qa/suites/rbd/nvmeof/cluster/.qa renamed to qa/suites/nvmeof/basic/base/.qa
diff --git a/‎qa/suites/nvmeof/basic/base/install.yaml‎
Lines changed: 15 additions & 0 deletions b/‎qa/suites/nvmeof/basic/base/install.yaml‎
Lines changed: 15 additions & 0 deletions
diff --git a/‎qa/suites/rbd/nvmeof/centos_latest.yaml‎ ‎qa/suites/nvmeof/basic/centos_latest.yaml‎qa/suites/rbd/nvmeof/centos_latest.yaml renamed to qa/suites/nvmeof/basic/centos_latest.yaml b/‎qa/suites/rbd/nvmeof/centos_latest.yaml‎ ‎qa/suites/nvmeof/basic/centos_latest.yaml‎qa/suites/rbd/nvmeof/centos_latest.yaml renamed to qa/suites/nvmeof/basic/centos_latest.yaml
@@ -78,4 +78,11 @@
 [submodule "src/BLAKE3"]
 	path = src/BLAKE3
 	url = https://github.com/BLAKE3-team/BLAKE3.git
-
+[submodule "src/boost_redis"]
+	path = src/boost_redis
+	url = https://github.com/boostorg/redis.git
+[submodule "src/nvmeof/gateway"]
+	path = src/nvmeof/gateway
+	url = https://github.com/ceph/ceph-nvmeof.git
+	fetchRecurseSubmodules = false
+	shallow = true
@@ -512,3 +512,11 @@ Relevant tracker: https://tracker.ceph.com/issues/57090
 set using the `fs set` command. This flag prevents using a standby for another
 file system (join_fs = X) when standby for the current filesystem is not available.
 Relevant tracker: https://tracker.ceph.com/issues/61599
+* mon: add NVMe-oF gateway monitor and HA
+  This PR adds high availability support for the nvmeof Ceph service. High availability
+means that even in the case that a certain GW is down, there will be another available
+path for the initiator to be able to continue the IO through another GW.
+It is also adding 2 new mon commands, to notify monitor about the gateway creation/deletion:
+  - nvme-gw create
+  - nvme-gw delete
+Relevant tracker: https://tracker.ceph.com/issues/64777
@@ -250,6 +250,7 @@ BuildRequires:	gperf
 BuildRequires:  cmake > 3.5
 BuildRequires:	fuse-devel
 BuildRequires:	git
+BuildRequires:	grpc-devel
 %if 0%{?fedora} || 0%{?suse_version} > 1500 || 0%{?rhel} == 9 || 0%{?openEuler}
 BuildRequires:	gcc-c++ >= 11
 %endif
@@ -642,6 +643,17 @@ system. One or more instances of ceph-mon form a Paxos part-time
 parliament cluster that provides extremely reliable and durable storage
 of cluster membership, configuration, and state.
 
+%package mon-client-nvmeof
+Summary:	Ceph NVMeoF Gateway Monitor Client
+%if 0%{?suse_version}
+Group:		System/Filesystems
+%endif
+Provides:	ceph-test:/usr/bin/ceph-nvmeof-monitor-client
+Requires:	librados2 = %{_epoch_prefix}%{version}-%{release}
+%description mon-client-nvmeof
+Ceph NVMeoF Gateway Monitor Client distributes Paxos ANA info
+to NVMeoF Gateway and provides beacons to the monitor daemon
+
 %package mgr
 Summary:        Ceph Manager Daemon
 %if 0%{?suse_version}
@@ -2077,6 +2089,9 @@ if [ $1 -ge 1 ] ; then
   fi
 fi
 
+%files mon-client-nvmeof
+%{_bindir}/ceph-nvmeof-monitor-client
+
 %files fuse
 %{_bindir}/ceph-fuse
 %{_mandir}/man8/ceph-fuse.8*
 
@@ -0,0 +1,142 @@
+# Background
+
+The nvmeof GW should support high availability. High availability means that even in the case that a certain GW is down, there will be another available path for the initiator to be able to continue the IO through another GW. It means that initially there are at least 2 paths which the nvme initiator can use to do the IO to the namespace(s). The multi pathing is achieved by connecting to a subsystem through more than 1 GW. This is native to the nvme initiator behavior, and this is done by connecting the nvme initiator to all relevant GWs (e.g. nvme connect-all command). Multi pathing allows the option to the initiator, to use one of the paths to write to the subsystem. This is a must for HA, but not enough. The problem is that the initiator should not simultaneously write to the same namespace(s) (i.e. volumes) from more than 1 path. Writing simultaneously to the same namespace(s) will eventually result in data inconsistency because there is no guarantee on the order of the writes that arrive at the namespace via the different GWs. There are many design options to solve this issue, the selected option that we implemented, is discussed here.
+
+The core idea is to provide an Active-Standby access from the initiator to namespace(s). It means that at any point in time, there is only one (and only one) active path from the initiator to a namespace, but there are also standby path(s). The management of the Active-Standby states is being done in a new component that is called NVMeofGwMon.
+
+Namespaces in nvme belong to a subsystem. That’s why the management of the entire Active-Standby states is done at a subsystem level. The implementation is using the nvme ANA protocol, which allows to define a state for each path. The state can be Optimized, Inaccessible, or Non-optimized. In our implementation, we set the state to either Optimized (i.e. Active), or Inaccessible (i.e. Standby). The ANA protocol is using ANA groups to define the path states. So per path, we can see different ANA groups, and per ANA group, we can know if the path is Optimized or Inaccessible. ANA group is a collection of namespaces.
+
+The NVMeofGwMon should manage the ANA groups in a way that a particular group is alway optimized on at only one path (i.e. GW), and it is Inaccessible on all the other paths (i.e. GWs). The NVMeofGwMon needs to track the liveliness of all the GWs, and handle these cases:
+
+1.  GW disappeared.
+    
+2.  GW reappeared.
+    
+The NVMeofGwMon should take the required actions when such events occur. E.g.
+
+1.  GW disappeared - the NVMeofGwMon should assign a new GW to be Optimized on this path, and then it needs to update all the GWs in the group, to change their state accordingly. This is called Failover.
+    
+2.  GW reappeared - the NVMeofGwMon should re-assign the returning GW to be Optimized on this path, and then it needs to update all the GWs in the group, to change their state accordingly. This is called Failback.
+    
+
+# Main design decisions
+
+## HA environment setup requirements
+
+It is assumed that between the nvmeof initiator (i.e. the nvmeof client) and the nvmeof target (i.e. the nvmeof ceph gw), there is full redundancy in the network connectivity. This means that the nvmeof initiator has 2 ethernet ports that are connected to the nvmeof target, via a network with redundancy (e.g. 2 networks switches).
+
+Figure 1 - Full redundancy in the network connectivity
+
+## Failover scenarios
+
+The HA mode is not taking care of situations where the network paths between the nvmeof initiator and the nvmeof target are broken. This case should be covered by the network configuration which includes full redundancy to the network paths.
+
+The following failover scenarios will be taken care of by the HA mode:
+
+1.  GW dead.
+    
+2.  GW removed by cephadm.
+    
+3.  Network partition between the gateway and rbd
+    
+
+## Blocklisting
+
+Whenever we failover a path, there is a danger that the peer that owned this path before, might still be alive, or might be temporarily frozen, and it might still hold some inflight IOs that it is about to submit to Ceph. This might cause data inconsistencies, and therefore we will always blocklist the peer before taking over any path. Blocklist will invalidate any inflight IO that it has. Ceph blocklist is built in a way that it doesn't require the node that is blocked to acknowledge the operation. The node that was blocked, even if it is alive somehow, we will be abled to use the blocked cluster context for any writes after the blocklist occured. 
+
+## ANA states
+
+It is not allowed to manipulate the ANA states externally (e.g. via the SPDK RPC), because doing that will invalidate the auto HA solution.
+
+## ANA groups
+
+The HA solution will only use ANA group 1..number of active GWs. It means that if we have 2 GWs, we will use ANA groups 1 and 2, and if we have 3 GWs we will use ANA groups 1,2,3, and so on. The idea is that each GW will always own one ANA grp, and will be standby on the other ANA groups.
+
+## Load Balancing
+
+The optimal load balancing will be achieved when the number of active (i.e. optimized) namespaces, is distributed evenly between all of the GWs. It means that every GW will handle the same number of namespaces in a good path IO situation (where all GWs are up and running). The code will automatically assign the namespaces evenly across the GWs upon namespace creation. But this can also be manually assigned when creating a new namespace. This assignment is persistent in the OMAP state, and can be modified by another gRPC/CLI call. 
+
+ 
+## GW initialization
+
+The nvmeof GW initialization is changed. The GW must get some initial data from the NVMeofGwMon to be able to complete its initialization. The initial data will include the ANA grp id that it should own. Based on the ANA grp id, the GW can tell which unique controller ids to use, and it knows on which ANA grp id it should be optimized. This means that the GW initialization sequence is delayed until it gets this initial data. And until this initial data is received, the gRPC/CLI and the SPDK initialization is on hold.
+
+## Network partition
+
+It is possible that the nvmeof GW monitor will think that a GW is down, but in reality the GW will be alive. This can happen in a case of a network partition for example. The problem in this case is that the monitor will decide to failover the ANA groups that “belong” to this GW, to other GWs. But the GW will not know about it. For this reason, it is decided that the GW (i.e. the GW client in this case), will get heartbeats from the monitor every few seconds. In the case that the heartbeats stop (i.e. not heartbeat few cycles), the GW will commit suicide to avoid the case that the same ANA group is considered to be optimized by more than one GW.
+
+# Modules
+
+There are changes in the Ceph code, and there are changes in the nvmeof GW code.
+
+### Ceph code - new modules
+
+#### MNVMeofGwMap
+
+Description: Class that coordinates Gateway’s Failover/ Failback
+
+Main responsibilities:
+
+Coordinates the behavior of all Gateways in the CEPH that configured in HA mode.
+
+Implements stateful behavior for performing Failover/Failback by Gateways within the same subsystem.
+
+Supports for independent state machines(per ANA group) within the same Gateway.
+
+Implements the blocklist of ceph entries used for blocked traffic related to specific ANA groups.
+
+Holds GWMAP map, GW_Created map database.
+
+#### MNVMeofGwBeacon
+
+Description:
+
+Main responsibilities:
+
+#### NVMeofGwMon
+
+Description: New monitor in the Paxos environment - used for monitoring Gateways in HA mode
+
+Main responsibilities:
+
+Forwards inherited Paxos messages, aggregates the NVMeofGwMap object
+
+Distributes maps to Paxos and broadcasts them to the GW Clients
+
+Handles Beacon messages from the GW Clients, determines the Keep Alive timeout from the GW
+
+Conveys CEPH commands - create/delete GW to the NVMeofGwMap embedded object
+
+Sends immediate unicast Beacon Ack message as response to Beacon to ensure symmetric handshake
+
+  
+#### NVMeofGw
+
+Description:
+
+Main responsibilities:
+
+  
+
+#### NVMeofGwClient
+
+Description:
+
+Main responsibilities:
+
+#### NVMeofGwMonitorGroupClient
+
+Description:
+
+Main responsibilities:
+
+### Ceph code - changed modules
+
+#### MonCommands
+
+#### Monitor
+
+#### Message
+
+# Sequence diagrams
+
@@ -0,0 +1,15 @@
+use_shaman: True
+tasks:
+- install:
+    extra_packages:
+        - nvme-cli
+- cephadm: 
+    watchdog_setup:
+- cephadm.shell:
+    host.a:
+    # get state before nvmeof deployment
+    - ceph orch status
+    - ceph orch ps
+    - ceph orch host ls
+    - ceph orch device ls
+    - ceph osd lspools