Skip to content

Commit bd0160d

Browse files
authored
Merge pull request ceph#59982 from rkachach/fix_issue_mgmt_gw_high_availability
Adding HA support for mgmt-gateway and oauth2-proxy services Reviewed-by: Adam king <[email protected]> Reviewed-by: Anthony D'Atri <[email protected]> Reviewed-by: Juan Miguel Olmo Martínez <[email protected]>
2 parents adc6f2d + 4b9d6a3 commit bd0160d

File tree

13 files changed

+303
-112
lines changed

13 files changed

+303
-112
lines changed

doc/cephadm/services/mgmt-gateway.rst

Lines changed: 49 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,55 @@ monitoring `mgmt-gateway` takes care of handling HA when several instances of Pr
4949
available. The reverse proxy will automatically detect healthy instances and use them to process user requests.
5050

5151

52+
High Availability for mgmt-gateway service
53+
==========================================
54+
55+
In addition to providing high availability for the underlying backend services, the mgmt-gateway
56+
service itself can be configured for high availability, ensuring that the system remains resilient
57+
even if certain core components for the service fail.
58+
59+
Multiple mgmt-gateway instances can be deployed in an active/standby configuration using keepalived
60+
for seamless failover. The `oauth2-proxy` service can be deployed as multiple stateless instances,
61+
with nginx acting as a load balancer across them using round-robin strategy. This setup removes
62+
single points of failure and enhances the resilience of the entire system.
63+
64+
In this setup, the underlying internal services follow the same high availability mechanism. Instead of
65+
directly accessing the `mgmt-gateway` internal endpoint, services use the virtual IP specified in the spec.
66+
This ensures that the high availability mechanism for `mgmt-gateway` is transparent to other services.
67+
68+
Example Configuration for High Availability
69+
70+
To deploy the mgmt-gateway in a high availability setup, here is an example of the specification files required:
71+
72+
`mgmt-gateway` Configuration:
73+
74+
.. code-block:: yaml
75+
76+
service_type: mgmt-gateway
77+
placement:
78+
label: mgmt
79+
spec:
80+
enable_auth: true
81+
virtual_ip: 192.168.100.220
82+
83+
`Ingress` Configuration for Keepalived:
84+
85+
.. code-block:: yaml
86+
87+
service_type: ingress
88+
service_id: ingress-mgmt-gw
89+
placement:
90+
label: mgmt
91+
virtual_ip: 192.168.100.220
92+
backend_service: mgmt-gateway
93+
keepalive_only: true
94+
95+
The number of deployed instances is determined by the number of hosts with the mgmt label.
96+
The ingress is configured in `keepalive_only` mode, with labels ensuring that any changes to
97+
the mgmt-gateway daemons are replicated to the corresponding keepalived instances. Additionally,
98+
the `virtual_ip` parameter must be identical in both specifications.
99+
100+
52101
Accessing services with mgmt-gateway
53102
====================================
54103

@@ -123,9 +172,6 @@ The specification can then be applied by running the following command:
123172
Limitations
124173
===========
125174

126-
A non-exhaustive list of important limitations for the mgmt-gateway service follows:
127-
128-
* High-availability configurations and clustering for the mgmt-gateway service itself are currently not supported.
129175
* Services must bind to the appropriate ports based on the applications being proxied. Ensure that there
130176
are no port conflicts that might disrupt service availability.
131177

doc/cephadm/services/oauth2-proxy.rst

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -42,8 +42,10 @@ a secure and flexible authentication mechanism.
4242

4343
High availability
4444
==============================
45-
`oauth2-proxy` is designed to integrate with an external IDP hence login high availability is not the responsibility of this
46-
service. In squid release high availability for the service itself is not supported yet.
45+
In general, `oauth2-proxy` is used in conjunction with the `mgmt-gateway`. The `oauth2-proxy` service can be deployed as multiple
46+
stateless instances, with the `mgmt-gateway` (nginx reverse-proxy) handling load balancing across these instances using a round-robin strategy.
47+
Since oauth2-proxy integrates with an external identity provider (IDP), ensuring high availability for login is managed externally
48+
and not the responsibility of this service.
4749

4850

4951
Accessing services with oauth2-proxy
@@ -70,8 +72,7 @@ An `oauth2-proxy` service can be applied using a specification. An example in YA
7072
service_type: oauth2-proxy
7173
service_id: auth-proxy
7274
placement:
73-
hosts:
74-
- ceph0
75+
label: mgmt
7576
spec:
7677
https_address: "0.0.0.0:4180"
7778
provider_display_name: "My OIDC Provider"

src/pybind/mgr/cephadm/module.py

Lines changed: 45 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -822,30 +822,33 @@ def _get_security_config(self) -> Tuple[bool, bool, bool]:
822822
security_enabled = self.secure_monitoring_stack or mgmt_gw_enabled
823823
return security_enabled, mgmt_gw_enabled, oauth2_proxy_enabled
824824

825-
def get_mgmt_gw_internal_endpoint(self) -> Optional[str]:
825+
def _get_mgmt_gw_endpoint(self, is_internal: bool) -> Optional[str]:
826826
mgmt_gw_daemons = self.cache.get_daemons_by_service('mgmt-gateway')
827827
if not mgmt_gw_daemons:
828828
return None
829829

830830
dd = mgmt_gw_daemons[0]
831831
assert dd.hostname is not None
832-
mgmt_gw_addr = self.get_fqdn(dd.hostname)
833-
mgmt_gw_internal_endpoint = build_url(scheme='https', host=mgmt_gw_addr, port=MgmtGatewayService.INTERNAL_SERVICE_PORT)
834-
return f'{mgmt_gw_internal_endpoint}/internal'
832+
mgmt_gw_spec = cast(MgmtGatewaySpec, self.spec_store['mgmt-gateway'].spec)
833+
mgmt_gw_addr = mgmt_gw_spec.virtual_ip if mgmt_gw_spec.virtual_ip is not None else self.get_fqdn(dd.hostname)
835834

836-
def get_mgmt_gw_external_endpoint(self) -> Optional[str]:
837-
mgmt_gw_daemons = self.cache.get_daemons_by_service('mgmt-gateway')
838-
if not mgmt_gw_daemons:
839-
return None
835+
if is_internal:
836+
mgmt_gw_port: Optional[int] = MgmtGatewayService.INTERNAL_SERVICE_PORT
837+
protocol = 'https'
838+
endpoint_suffix = '/internal'
839+
else:
840+
mgmt_gw_port = dd.ports[0] if dd.ports else None
841+
protocol = 'http' if mgmt_gw_spec.disable_https else 'https'
842+
endpoint_suffix = ''
840843

841-
dd = mgmt_gw_daemons[0]
842-
assert dd.hostname is not None
843-
mgmt_gw_port = dd.ports[0] if dd.ports else None
844-
mgmt_gw_addr = self.get_fqdn(dd.hostname)
845-
mgmt_gw_spec = cast(MgmtGatewaySpec, self.spec_store['mgmt-gateway'].spec)
846-
protocol = 'http' if mgmt_gw_spec.disable_https else 'https'
847-
mgmt_gw_external_endpoint = build_url(scheme=protocol, host=mgmt_gw_addr, port=mgmt_gw_port)
848-
return mgmt_gw_external_endpoint
844+
mgmt_gw_endpoint = build_url(scheme=protocol, host=mgmt_gw_addr, port=mgmt_gw_port)
845+
return f'{mgmt_gw_endpoint}{endpoint_suffix}'
846+
847+
def get_mgmt_gw_internal_endpoint(self) -> Optional[str]:
848+
return self._get_mgmt_gw_endpoint(is_internal=True)
849+
850+
def get_mgmt_gw_external_endpoint(self) -> Optional[str]:
851+
return self._get_mgmt_gw_endpoint(is_internal=False)
849852

850853
def _get_cephadm_binary_path(self) -> str:
851854
import hashlib
@@ -3004,8 +3007,16 @@ def get_daemon_names(daemons: List[str]) -> List[str]:
30043007
daemon_names.append(dd.name())
30053008
return daemon_names
30063009

3007-
alertmanager_user, alertmanager_password = self._get_alertmanager_credentials()
3008-
prometheus_user, prometheus_password = self._get_prometheus_credentials()
3010+
prom_cred_hash = None
3011+
alertmgr_cred_hash = None
3012+
security_enabled, mgmt_gw_enabled, _ = self._get_security_config()
3013+
if security_enabled:
3014+
alertmanager_user, alertmanager_password = self._get_alertmanager_credentials()
3015+
prometheus_user, prometheus_password = self._get_prometheus_credentials()
3016+
if prometheus_user and prometheus_password:
3017+
prom_cred_hash = f'{utils.md5_hash(prometheus_user + prometheus_password)}'
3018+
if alertmanager_user and alertmanager_password:
3019+
alertmgr_cred_hash = f'{utils.md5_hash(alertmanager_user + alertmanager_password)}'
30093020

30103021
deps = []
30113022
if daemon_type == 'haproxy':
@@ -3052,9 +3063,10 @@ def get_daemon_names(daemons: List[str]) -> List[str]:
30523063
else:
30533064
deps = [self.get_mgr_ip()]
30543065
elif daemon_type == 'prometheus':
3055-
# for prometheus we add the active mgr as an explicit dependency,
3056-
# this way we force a redeploy after a mgr failover
3057-
deps.append(self.get_active_mgr().name())
3066+
if not mgmt_gw_enabled:
3067+
# for prometheus we add the active mgr as an explicit dependency,
3068+
# this way we force a redeploy after a mgr failover
3069+
deps.append(self.get_active_mgr().name())
30583070
deps.append(str(self.get_module_option_ex('prometheus', 'server_port', 9283)))
30593071
deps.append(str(self.service_discovery_port))
30603072
# prometheus yaml configuration file (generated by prometheus.yml.j2) contains
@@ -3071,22 +3083,20 @@ def get_daemon_names(daemons: List[str]) -> List[str]:
30713083
deps += [d.name() for d in self.cache.get_daemons_by_service('ceph-exporter')]
30723084
deps += [d.name() for d in self.cache.get_daemons_by_service('mgmt-gateway')]
30733085
deps += [d.name() for d in self.cache.get_daemons_by_service('oauth2-proxy')]
3074-
security_enabled, _, _ = self._get_security_config()
3075-
if security_enabled:
3076-
if prometheus_user and prometheus_password:
3077-
deps.append(f'{hash(prometheus_user + prometheus_password)}')
3078-
if alertmanager_user and alertmanager_password:
3079-
deps.append(f'{hash(alertmanager_user + alertmanager_password)}')
3086+
if prom_cred_hash is not None:
3087+
deps.append(prom_cred_hash)
3088+
if alertmgr_cred_hash is not None:
3089+
deps.append(alertmgr_cred_hash)
30803090
elif daemon_type == 'grafana':
30813091
deps += get_daemon_names(['prometheus', 'loki', 'mgmt-gateway', 'oauth2-proxy'])
3082-
security_enabled, _, _ = self._get_security_config()
3083-
if security_enabled and prometheus_user and prometheus_password:
3084-
deps.append(f'{hash(prometheus_user + prometheus_password)}')
3092+
if prom_cred_hash is not None:
3093+
deps.append(prom_cred_hash)
30853094
elif daemon_type == 'alertmanager':
3086-
deps += get_daemon_names(['mgr', 'alertmanager', 'snmp-gateway', 'mgmt-gateway', 'oauth2-proxy'])
3087-
security_enabled, _, _ = self._get_security_config()
3088-
if security_enabled and alertmanager_user and alertmanager_password:
3089-
deps.append(f'{hash(alertmanager_user + alertmanager_password)}')
3095+
deps += get_daemon_names(['alertmanager', 'snmp-gateway', 'mgmt-gateway', 'oauth2-proxy'])
3096+
if not mgmt_gw_enabled:
3097+
deps += get_daemon_names(['mgr'])
3098+
if alertmgr_cred_hash is not None:
3099+
deps.append(alertmgr_cred_hash)
30903100
elif daemon_type == 'promtail':
30913101
deps += get_daemon_names(['loki'])
30923102
elif daemon_type in ['ceph-exporter', 'node-exporter']:
@@ -3098,9 +3108,7 @@ def get_daemon_names(daemons: List[str]) -> List[str]:
30983108
deps.append(build_url(host=dd.hostname, port=port).lstrip('/'))
30993109
deps = sorted(deps)
31003110
elif daemon_type == 'mgmt-gateway':
3101-
# url_prefix for monitoring daemons depends on the presence of mgmt-gateway
3102-
# while dashboard urls depend on the mgr daemons
3103-
deps += get_daemon_names(['mgr', 'grafana', 'prometheus', 'alertmanager', 'oauth2-proxy'])
3111+
deps = MgmtGatewayService.get_dependencies(self)
31043112
else:
31053113
# this daemon type doesn't need deps mgmt
31063114
pass

src/pybind/mgr/cephadm/services/ingress.py

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -241,7 +241,12 @@ def keepalived_generate_config(
241241
if spec.keepalived_password:
242242
password = spec.keepalived_password
243243

244-
daemons = self.mgr.cache.get_daemons_by_service(spec.service_name())
244+
if spec.keepalive_only:
245+
# when keepalive_only instead of haproxy, we have to monitor the backend service daemons
246+
if spec.backend_service is not None:
247+
daemons = self.mgr.cache.get_daemons_by_service(spec.backend_service)
248+
else:
249+
daemons = self.mgr.cache.get_daemons_by_service(spec.service_name())
245250

246251
if not daemons and not spec.keepalive_only:
247252
raise OrchestratorError(
@@ -297,6 +302,10 @@ def _get_valid_interface_and_ip(vip: str, host: str) -> Tuple[str, str]:
297302
port = d.ports[1] # monitoring port
298303
host_ip = d.ip or self.mgr.inventory.get_addr(d.hostname)
299304
script = f'/usr/bin/curl {build_url(scheme="http", host=host_ip, port=port)}/health'
305+
elif d.daemon_type == 'mgmt-gateway':
306+
mgmt_gw_port = d.ports[0] if d.ports else None
307+
host_ip = d.ip or self.mgr.inventory.get_addr(d.hostname)
308+
script = f'/usr/bin/curl -k {build_url(scheme="https", host=host_ip, port=mgmt_gw_port)}/health'
300309
assert script
301310

302311
states = []

src/pybind/mgr/cephadm/services/mgmt_gateway.py

Lines changed: 42 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,12 @@
11
import logging
2-
from typing import List, Any, Tuple, Dict, cast, Optional
2+
from typing import List, Any, Tuple, Dict, cast, TYPE_CHECKING
33

44
from orchestrator import DaemonDescription
55
from ceph.deployment.service_spec import MgmtGatewaySpec, GrafanaSpec
66
from cephadm.services.cephadmservice import CephadmService, CephadmDaemonDeploySpec, get_dashboard_endpoints
77

8+
if TYPE_CHECKING:
9+
from ..module import CephadmOrchestrator
810

911
logger = logging.getLogger(__name__)
1012

@@ -36,10 +38,11 @@ def get_active_daemon(self, daemon_descrs: List[DaemonDescription]) -> DaemonDes
3638
# if empty list provided, return empty Daemon Desc
3739
return DaemonDescription()
3840

39-
def get_oauth2_service_url(self) -> Optional[str]:
40-
# TODO(redo): check how can we create several servers for HA
41-
oauth2_servers = self.get_service_endpoints('oauth2-proxy')
42-
return f'https://{oauth2_servers[0]}' if oauth2_servers else None
41+
def get_mgmt_gw_ips(self, svc_spec: MgmtGatewaySpec, daemon_spec: CephadmDaemonDeploySpec) -> List[str]:
42+
mgmt_gw_ips = [self.mgr.inventory.get_addr(daemon_spec.host)]
43+
if svc_spec.virtual_ip is not None:
44+
mgmt_gw_ips.append(svc_spec.virtual_ip)
45+
return mgmt_gw_ips
4346

4447
def config_dashboard(self, daemon_descrs: List[DaemonDescription]) -> None:
4548
# we adjust the standby behaviour so rev-proxy can pick correctly the active instance
@@ -56,9 +59,9 @@ def get_external_certificates(self, svc_spec: MgmtGatewaySpec, daemon_spec: Ceph
5659
key = svc_spec.ssl_certificate_key
5760
else:
5861
# not provided on the spec, let's generate self-sigend certificates
59-
addr = self.mgr.inventory.get_addr(daemon_spec.host)
62+
ips = self.get_mgmt_gw_ips(svc_spec, daemon_spec)
6063
host_fqdn = self.mgr.get_fqdn(daemon_spec.host)
61-
cert, key = self.mgr.cert_mgr.generate_cert(host_fqdn, addr)
64+
cert, key = self.mgr.cert_mgr.generate_cert(host_fqdn, ips)
6265
# save certificates
6366
if cert and key:
6467
self.mgr.cert_key_store.save_cert('mgmt_gw_cert', cert)
@@ -67,23 +70,33 @@ def get_external_certificates(self, svc_spec: MgmtGatewaySpec, daemon_spec: Ceph
6770
logger.error("Failed to obtain certificate and key from mgmt-gateway.")
6871
return cert, key
6972

70-
def get_internal_certificates(self, daemon_spec: CephadmDaemonDeploySpec) -> Tuple[str, str]:
71-
node_ip = self.mgr.inventory.get_addr(daemon_spec.host)
73+
def get_internal_certificates(self, svc_spec: MgmtGatewaySpec, daemon_spec: CephadmDaemonDeploySpec) -> Tuple[str, str]:
74+
ips = self.get_mgmt_gw_ips(svc_spec, daemon_spec)
7275
host_fqdn = self.mgr.get_fqdn(daemon_spec.host)
73-
return self.mgr.cert_mgr.generate_cert(host_fqdn, node_ip)
76+
return self.mgr.cert_mgr.generate_cert(host_fqdn, ips)
7477

75-
def get_mgmt_gateway_deps(self) -> List[str]:
76-
# url_prefix for the following services depends on the presence of mgmt-gateway
77-
deps: List[str] = []
78-
deps += [d.name() for d in self.mgr.cache.get_daemons_by_service('prometheus')]
79-
deps += [d.name() for d in self.mgr.cache.get_daemons_by_service('alertmanager')]
80-
deps += [d.name() for d in self.mgr.cache.get_daemons_by_service('grafana')]
81-
deps += [d.name() for d in self.mgr.cache.get_daemons_by_service('oauth2-proxy')]
78+
def get_service_discovery_endpoints(self) -> List[str]:
79+
sd_endpoints = []
8280
for dd in self.mgr.cache.get_daemons_by_service('mgr'):
83-
# we consider mgr a dep even if the dashboard is disabled
84-
# in order to be consistent with _calc_daemon_deps().
85-
deps.append(dd.name())
81+
assert dd.hostname is not None
82+
addr = dd.ip if dd.ip else self.mgr.inventory.get_addr(dd.hostname)
83+
sd_endpoints.append(f"{addr}:{self.mgr.service_discovery_port}")
84+
return sd_endpoints
8685

86+
@staticmethod
87+
def get_dependencies(mgr: "CephadmOrchestrator") -> List[str]:
88+
# url_prefix for the following services depends on the presence of mgmt-gateway
89+
deps = [
90+
f'{d.name()}:{d.ports[0]}' if d.ports else d.name()
91+
for service in ['prometheus', 'alertmanager', 'grafana', 'oauth2-proxy']
92+
for d in mgr.cache.get_daemons_by_service(service)
93+
]
94+
# dashboard and service discovery urls depend on the mgr daemons
95+
deps += [
96+
f'{d.name()}'
97+
for service in ['mgr']
98+
for d in mgr.cache.get_daemons_by_service(service)
99+
]
87100
return deps
88101

89102
def generate_config(self, daemon_spec: CephadmDaemonDeploySpec) -> Tuple[Dict[str, Any], List[str]]:
@@ -94,6 +107,8 @@ def generate_config(self, daemon_spec: CephadmDaemonDeploySpec) -> Tuple[Dict[st
94107
prometheus_endpoints = self.get_service_endpoints('prometheus')
95108
alertmanager_endpoints = self.get_service_endpoints('alertmanager')
96109
grafana_endpoints = self.get_service_endpoints('grafana')
110+
oauth2_proxy_endpoints = self.get_service_endpoints('oauth2-proxy')
111+
service_discovery_endpoints = self.get_service_discovery_endpoints()
97112
try:
98113
grafana_spec = cast(GrafanaSpec, self.mgr.spec_store['grafana'].spec)
99114
grafana_protocol = grafana_spec.protocol
@@ -104,7 +119,9 @@ def generate_config(self, daemon_spec: CephadmDaemonDeploySpec) -> Tuple[Dict[st
104119
'dashboard_endpoints': dashboard_endpoints,
105120
'prometheus_endpoints': prometheus_endpoints,
106121
'alertmanager_endpoints': alertmanager_endpoints,
107-
'grafana_endpoints': grafana_endpoints
122+
'grafana_endpoints': grafana_endpoints,
123+
'oauth2_proxy_endpoints': oauth2_proxy_endpoints,
124+
'service_discovery_endpoints': service_discovery_endpoints
108125
}
109126
server_context = {
110127
'spec': svc_spec,
@@ -117,11 +134,12 @@ def generate_config(self, daemon_spec: CephadmDaemonDeploySpec) -> Tuple[Dict[st
117134
'prometheus_endpoints': prometheus_endpoints,
118135
'alertmanager_endpoints': alertmanager_endpoints,
119136
'grafana_endpoints': grafana_endpoints,
120-
'oauth2_proxy_url': self.get_oauth2_service_url(),
137+
'service_discovery_endpoints': service_discovery_endpoints,
138+
'enable_oauth2_proxy': bool(oauth2_proxy_endpoints),
121139
}
122140

123141
cert, key = self.get_external_certificates(svc_spec, daemon_spec)
124-
internal_cert, internal_pkey = self.get_internal_certificates(daemon_spec)
142+
internal_cert, internal_pkey = self.get_internal_certificates(svc_spec, daemon_spec)
125143
daemon_config = {
126144
"files": {
127145
"nginx.conf": self.mgr.template.render(self.SVC_TEMPLATE_PATH, main_context),
@@ -136,7 +154,7 @@ def generate_config(self, daemon_spec: CephadmDaemonDeploySpec) -> Tuple[Dict[st
136154
daemon_config["files"]["nginx.crt"] = cert
137155
daemon_config["files"]["nginx.key"] = key
138156

139-
return daemon_config, sorted(self.get_mgmt_gateway_deps())
157+
return daemon_config, sorted(MgmtGatewayService.get_dependencies(self.mgr))
140158

141159
def pre_remove(self, daemon: DaemonDescription) -> None:
142160
"""

0 commit comments

Comments
 (0)