Skip to content

Commit 5843c6b

Browse files
leonidcAlexander Indenbaum
authored andcommitted
mon: add NVMe-oF gateway monitor and HA
- gateway submodule Fixes: https://tracker.ceph.com/issues/64777 This PR adds high availability support for the nvmeof Ceph service. High availability means that even in the case that a certain GW is down, there will be another available path for the initiator to be able to continue the IO through another GW. High availability is achieved by running nvmeof service consisting of at least 2 nvmeof GWs in the Ceph cluster. Every GW will be seen by the host (initiator) as a separate path to the nvme namespaces (volumes). The implementation consists of the following main modules: - NVMeofGWMon - a PaxosService. It is a monitor that tracks the status of the nvmeof running services, and take actions in case that services fail, and in case services restored. - NVMeofGwMonitorClient – It is an agent that is running as a part of each nvmeof GW. It is sending beacons to the monitor to signal that the GW is alive. As a part of the beacon, the client also sends information about the service. This information is used by the monitor to take decisions and perform some operations. - MNVMeofGwBeacon – It is a structure used by the client and the monitor to send/recv the beacons. - MNVMeofGwMap – The map is tracking the nvmeof GWs status. It also defines what should be the new role of every GW. So in the events of GWs go down or GWs restored, the map will reflect the new role of each GW resulted by these events. The map is distributed to the NVMeofGwMonitorClient on each GW, and it knows to update the GW with the required changes. It is also adding 3 new mon commands: - nvme-gw create - nvme-gw delete - nvme-gw show The commands are used by the ceph adm to update the monitor that a new GW is deployed. The monitor will update the map accordingly and will start tracking this GW until it is deleted. Signed-off-by: Leonid Chernin <[email protected]> Signed-off-by: Alexander Indenbaum <[email protected]>
1 parent 3f4aee2 commit 5843c6b

37 files changed

+3885
-10
lines changed

.gitmodules

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -78,4 +78,11 @@
7878
[submodule "src/BLAKE3"]
7979
path = src/BLAKE3
8080
url = https://github.com/BLAKE3-team/BLAKE3.git
81-
81+
[submodule "src/boost_redis"]
82+
path = src/boost_redis
83+
url = https://github.com/boostorg/redis.git
84+
[submodule "src/nvmeof/gateway"]
85+
path = src/nvmeof/gateway
86+
url = https://github.com/ceph/ceph-nvmeof.git
87+
fetchRecurseSubmodules = false
88+
shallow = true

PendingReleaseNotes

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -506,3 +506,11 @@ Relevant tracker: https://tracker.ceph.com/issues/57090
506506
set using the `fs set` command. This flag prevents using a standby for another
507507
file system (join_fs = X) when standby for the current filesystem is not available.
508508
Relevant tracker: https://tracker.ceph.com/issues/61599
509+
* mon: add NVMe-oF gateway monitor and HA
510+
This PR adds high availability support for the nvmeof Ceph service. High availability
511+
means that even in the case that a certain GW is down, there will be another available
512+
path for the initiator to be able to continue the IO through another GW.
513+
It is also adding 2 new mon commands, to notify monitor about the gateway creation/deletion:
514+
- nvme-gw create
515+
- nvme-gw delete
516+
Relevant tracker: https://tracker.ceph.com/issues/64777

ceph.spec.in

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -250,6 +250,7 @@ BuildRequires: gperf
250250
BuildRequires: cmake > 3.5
251251
BuildRequires: fuse-devel
252252
BuildRequires: git
253+
BuildRequires: grpc-devel
253254
%if 0%{?fedora} || 0%{?suse_version} > 1500 || 0%{?rhel} == 9 || 0%{?openEuler}
254255
BuildRequires: gcc-c++ >= 11
255256
%endif
@@ -642,6 +643,17 @@ system. One or more instances of ceph-mon form a Paxos part-time
642643
parliament cluster that provides extremely reliable and durable storage
643644
of cluster membership, configuration, and state.
644645

646+
%package mon-client-nvmeof
647+
Summary: Ceph NVMeoF Gateway Monitor Client
648+
%if 0%{?suse_version}
649+
Group: System/Filesystems
650+
%endif
651+
Provides: ceph-test:/usr/bin/ceph-nvmeof-monitor-client
652+
Requires: librados2 = %{_epoch_prefix}%{version}-%{release}
653+
%description mon-client-nvmeof
654+
Ceph NVMeoF Gateway Monitor Client distributes Paxos ANA info
655+
to NVMeoF Gateway and provides beacons to the monitor daemon
656+
645657
%package mgr
646658
Summary: Ceph Manager Daemon
647659
%if 0%{?suse_version}
@@ -2077,6 +2089,9 @@ if [ $1 -ge 1 ] ; then
20772089
fi
20782090
fi
20792091

2092+
%files mon-client-nvmeof
2093+
%{_bindir}/ceph-nvmeof-monitor-client
2094+
20802095
%files fuse
20812096
%{_bindir}/ceph-fuse
20822097
%{_mandir}/man8/ceph-fuse.8*

src/CMakeLists.txt

Lines changed: 112 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -305,6 +305,12 @@ endif(WITH_BLKIN)
305305

306306
if(WITH_JAEGER)
307307
find_package(thrift 0.13.0 REQUIRED)
308+
309+
if(EXISTS "/etc/redhat-release" OR EXISTS "/etc/fedora-release")
310+
# absl is installed as grpc build dependency on RPM based systems
311+
add_definitions(-DHAVE_ABSEIL)
312+
endif()
313+
308314
include(BuildOpentelemetry)
309315
build_opentelemetry()
310316
add_library(jaeger_base INTERFACE)
@@ -875,6 +881,112 @@ if(WITH_FUSE)
875881
install(PROGRAMS mount.fuse.ceph DESTINATION ${CMAKE_INSTALL_SBINDIR})
876882
endif(WITH_FUSE)
877883

884+
# NVMEOF GATEWAY MONITOR CLIENT
885+
# Supported on RPM-based platforms only, depends on grpc devel libraries/tools
886+
if(EXISTS "/etc/redhat-release" OR EXISTS "/etc/fedora-release")
887+
option(WITH_NVMEOF_GATEWAY_MONITOR_CLIENT "build nvmeof gateway monitor client" ON)
888+
else()
889+
option(WITH_NVMEOF_GATEWAY_MONITOR_CLIENT "build nvmeof gateway monitor client" OFF)
890+
endif()
891+
892+
if(WITH_NVMEOF_GATEWAY_MONITOR_CLIENT)
893+
894+
# Find Protobuf installation
895+
# Looks for protobuf-config.cmake file installed by Protobuf's cmake installation.
896+
option(protobuf_MODULE_COMPATIBLE TRUE)
897+
find_package(Protobuf REQUIRED)
898+
899+
set(_REFLECTION grpc++_reflection)
900+
if(CMAKE_CROSSCOMPILING)
901+
find_program(_PROTOBUF_PROTOC protoc)
902+
else()
903+
set(_PROTOBUF_PROTOC $<TARGET_FILE:protobuf::protoc>)
904+
endif()
905+
906+
# Find gRPC installation
907+
# Looks for gRPCConfig.cmake file installed by gRPC's cmake installation.
908+
find_package(gRPC CONFIG REQUIRED)
909+
message(STATUS "Using gRPC ${gRPC_VERSION}")
910+
set(_GRPC_GRPCPP gRPC::grpc++)
911+
if(CMAKE_CROSSCOMPILING)
912+
find_program(_GRPC_CPP_PLUGIN_EXECUTABLE grpc_cpp_plugin)
913+
else()
914+
set(_GRPC_CPP_PLUGIN_EXECUTABLE $<TARGET_FILE:gRPC::grpc_cpp_plugin>)
915+
endif()
916+
917+
# Gateway Proto file
918+
get_filename_component(nvmeof_gateway_proto "nvmeof/gateway/control/proto/gateway.proto" ABSOLUTE)
919+
get_filename_component(nvmeof_gateway_proto_path "${nvmeof_gateway_proto}" PATH)
920+
921+
# Generated sources
922+
set(nvmeof_gateway_proto_srcs "${CMAKE_CURRENT_BINARY_DIR}/gateway.pb.cc")
923+
set(nvmeof_gateway_proto_hdrs "${CMAKE_CURRENT_BINARY_DIR}/gateway.pb.h")
924+
set(nvmeof_gateway_grpc_srcs "${CMAKE_CURRENT_BINARY_DIR}/gateway.grpc.pb.cc")
925+
set(nvmeof_gateway_grpc_hdrs "${CMAKE_CURRENT_BINARY_DIR}/gateway.grpc.pb.h")
926+
927+
add_custom_command(
928+
OUTPUT "${nvmeof_gateway_proto_srcs}" "${nvmeof_gateway_proto_hdrs}" "${nvmeof_gateway_grpc_srcs}" "${nvmeof_gateway_grpc_hdrs}"
929+
COMMAND ${_PROTOBUF_PROTOC}
930+
ARGS --grpc_out "${CMAKE_CURRENT_BINARY_DIR}"
931+
--cpp_out "${CMAKE_CURRENT_BINARY_DIR}"
932+
-I "${nvmeof_gateway_proto_path}"
933+
--experimental_allow_proto3_optional
934+
--plugin=protoc-gen-grpc="${_GRPC_CPP_PLUGIN_EXECUTABLE}"
935+
"${nvmeof_gateway_proto}"
936+
DEPENDS "${nvmeof_gateway_proto}")
937+
938+
939+
# Monitor Proto file
940+
get_filename_component(nvmeof_monitor_proto "nvmeof/gateway/control/proto/monitor.proto" ABSOLUTE)
941+
get_filename_component(nvmeof_monitor_proto_path "${nvmeof_monitor_proto}" PATH)
942+
943+
# Generated sources
944+
set(nvmeof_monitor_proto_srcs "${CMAKE_CURRENT_BINARY_DIR}/monitor.pb.cc")
945+
set(nvmeof_monitor_proto_hdrs "${CMAKE_CURRENT_BINARY_DIR}/monitor.pb.h")
946+
set(nvmeof_monitor_grpc_srcs "${CMAKE_CURRENT_BINARY_DIR}/monitor.grpc.pb.cc")
947+
set(nvmeof_monitor_grpc_hdrs "${CMAKE_CURRENT_BINARY_DIR}/monitor.grpc.pb.h")
948+
949+
add_custom_command(
950+
OUTPUT "${nvmeof_monitor_proto_srcs}" "${nvmeof_monitor_proto_hdrs}" "${nvmeof_monitor_grpc_srcs}" "${nvmeof_monitor_grpc_hdrs}"
951+
COMMAND ${_PROTOBUF_PROTOC}
952+
ARGS --grpc_out "${CMAKE_CURRENT_BINARY_DIR}"
953+
--cpp_out "${CMAKE_CURRENT_BINARY_DIR}"
954+
-I "${nvmeof_monitor_proto_path}"
955+
--experimental_allow_proto3_optional
956+
--plugin=protoc-gen-grpc="${_GRPC_CPP_PLUGIN_EXECUTABLE}"
957+
"${nvmeof_monitor_proto}"
958+
DEPENDS "${nvmeof_monitor_proto}")
959+
960+
# Include generated *.pb.h files
961+
include_directories("${CMAKE_CURRENT_BINARY_DIR}")
962+
963+
set(ceph_nvmeof_monitor_client_srcs
964+
${nvmeof_gateway_proto_srcs}
965+
${nvmeof_gateway_proto_hdrs}
966+
${nvmeof_gateway_grpc_srcs}
967+
${nvmeof_gateway_grpc_hdrs}
968+
${nvmeof_monitor_proto_srcs}
969+
${nvmeof_monitor_proto_hdrs}
970+
${nvmeof_monitor_grpc_srcs}
971+
${nvmeof_monitor_grpc_hdrs}
972+
ceph_nvmeof_monitor_client.cc
973+
nvmeof/NVMeofGwClient.cc
974+
nvmeof/NVMeofGwMonitorGroupClient.cc
975+
nvmeof/NVMeofGwMonitorClient.cc)
976+
add_executable(ceph-nvmeof-monitor-client ${ceph_nvmeof_monitor_client_srcs})
977+
add_dependencies(ceph-nvmeof-monitor-client ceph-common)
978+
target_link_libraries(ceph-nvmeof-monitor-client
979+
client
980+
mon
981+
global-static
982+
ceph-common
983+
${_REFLECTION}
984+
${_GRPC_GRPCPP}
985+
)
986+
install(TARGETS ceph-nvmeof-monitor-client DESTINATION bin)
987+
endif()
988+
# END OF NVMEOF GATEWAY MONITOR CLIENT
989+
878990
if(WITH_DOKAN)
879991
add_subdirectory(dokan)
880992
endif(WITH_DOKAN)

src/ceph_nvmeof_monitor_client.cc

Lines changed: 79 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,79 @@
1+
// -*- mode:C++; tab-width:8; c-basic-offset:2; indent-tabs-mode:t -*-
2+
// vim: ts=8 sw=2 smarttab
3+
/*
4+
* Ceph - scalable distributed file system
5+
*
6+
* Copyright (C) 2023 IBM Inc
7+
*
8+
* Author: Alexander Indenbaum <[email protected]>
9+
*
10+
* This is free software; you can redistribute it and/or
11+
* modify it under the terms of the GNU Lesser General Public
12+
* License version 2.1, as published by the Free Software
13+
* Foundation. See file COPYING.
14+
*
15+
*/
16+
17+
#include <pthread.h>
18+
19+
#include "include/types.h"
20+
#include "include/compat.h"
21+
#include "common/config.h"
22+
#include "common/ceph_argparse.h"
23+
#include "common/errno.h"
24+
#include "common/pick_address.h"
25+
#include "global/global_init.h"
26+
27+
#include "nvmeof/NVMeofGwMonitorClient.h"
28+
29+
static void usage()
30+
{
31+
std::cout << "usage: ceph-nvmeof-monitor-client\n"
32+
" --gateway-name <GW_NAME>\n"
33+
" --gateway-address <GW_ADDRESS>\n"
34+
" --gateway-pool <CEPH_POOL>\n"
35+
" --gateway-group <GW_GROUP>\n"
36+
" --monitor-group-address <MONITOR_GROUP_ADDRESS>\n"
37+
" [flags]\n"
38+
<< std::endl;
39+
generic_server_usage();
40+
}
41+
42+
/**
43+
* A short main() which just instantiates a Nvme and
44+
* hands over control to that.
45+
*/
46+
int main(int argc, const char **argv)
47+
{
48+
ceph_pthread_setname(pthread_self(), "ceph-nvmeof-monitor-client");
49+
50+
auto args = argv_to_vec(argc, argv);
51+
if (args.empty()) {
52+
std::cerr << argv[0] << ": -h or --help for usage" << std::endl;
53+
exit(1);
54+
}
55+
if (ceph_argparse_need_usage(args)) {
56+
usage();
57+
exit(0);
58+
}
59+
60+
auto cct = global_init(nullptr, args, CEPH_ENTITY_TYPE_CLIENT,
61+
CODE_ENVIRONMENT_UTILITY, // maybe later use CODE_ENVIRONMENT_DAEMON,
62+
CINIT_FLAG_NO_DEFAULT_CONFIG_FILE);
63+
64+
pick_addresses(g_ceph_context, CEPH_PICK_ADDRESS_PUBLIC);
65+
66+
global_init_daemonize(g_ceph_context);
67+
global_init_chdir(g_ceph_context);
68+
common_init_finish(g_ceph_context);
69+
70+
NVMeofGwMonitorClient gw_monitor_client(argc, argv);
71+
int rc = gw_monitor_client.init();
72+
if (rc != 0) {
73+
std::cerr << "Error in initialization: " << cpp_strerror(rc) << std::endl;
74+
return rc;
75+
}
76+
77+
return gw_monitor_client.main(args);
78+
}
79+

src/common/options/global.yaml.in

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1755,6 +1755,13 @@ options:
17551755
default: 500
17561756
services:
17571757
- mon
1758+
- name: mon_max_nvmeof_epochs
1759+
type: int
1760+
level: advanced
1761+
desc: max number of nvmeof gateway maps to store
1762+
default: 500
1763+
services:
1764+
- mon
17581765
- name: mon_max_osd
17591766
type: int
17601767
level: advanced

src/common/options/mon.yaml.in

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -72,6 +72,25 @@ options:
7272
default: 30
7373
services:
7474
- mon
75+
- name: mon_nvmeofgw_beacon_grace
76+
type: secs
77+
level: advanced
78+
desc: Period in seconds from last beacon to monitor marking a NVMeoF gateway as
79+
failed
80+
default: 10
81+
services:
82+
- mon
83+
- name: mon_nvmeofgw_set_group_id_retry
84+
type: uint
85+
level: advanced
86+
desc: Retry wait time in microsecond for set group id between the monitor client
87+
and gateway
88+
long_desc: The monitor server determines the gateway's group ID. If the monitor client
89+
receives a monitor group ID assignment before the gateway is fully up during
90+
initialization, a retry is required.
91+
default: 1000
92+
services:
93+
- mon
7594
- name: mon_mgr_inactive_grace
7695
type: int
7796
level: advanced
@@ -1341,3 +1360,18 @@ options:
13411360
with_legacy: true
13421361
see_also:
13431362
- osd_heartbeat_use_min_delay_socket
1363+
- name: nvmeof_mon_client_disconnect_panic
1364+
type: secs
1365+
level: advanced
1366+
desc: The duration, expressed in seconds, after which the nvmeof gateway
1367+
should trigger a panic if it loses connection to the monitor
1368+
default: 100
1369+
services:
1370+
- mon
1371+
- name: nvmeof_mon_client_tick_period
1372+
type: secs
1373+
level: advanced
1374+
desc: Period in seconds of nvmeof gateway beacon messages to monitor
1375+
default: 2
1376+
services:
1377+
- mon

0 commit comments

Comments
 (0)