Skip to content

Commit a1dc56c

Browse files
committed
feat(cache): add Redis Cluster support for HA deployments
Implements Redis Cluster client support in slurm-web agent to enable high-availability caching across distributed Redis clusters. ## Problem Slurm-web currently supports only standalone Redis instances for caching. In high-availability deployments with Redis Cluster (3+ node clustered Redis), slurm-web agents fail to connect because they use the standard redis.Redis() client instead of the cluster-aware redis.cluster.RedisCluster() client. ## Solution This commit adds optional Redis Cluster support while maintaining full backwards compatibility with standalone Redis deployments. ### Core Changes **slurmweb/cache.py**: - Import RedisCluster and ClusterNode from redis.cluster - Add cluster_mode and cluster_nodes optional parameters to CachingService - Implement cluster mode initialization with RedisCluster client - Parse cluster_nodes from "host:port" string format - Add connection validation with fail-fast error handling **slurmweb/apps/agent.py**: - Pass cluster_mode and cluster_nodes parameters to CachingService - Use getattr() with defaults for backwards compatibility **conf/vendor/agent.yml**: - Add cluster_mode boolean parameter (default: false) - Add cluster_nodes list parameter with string content type - Document configuration with examples ## Features - **Opt-in design**: Cluster mode disabled by default (cluster_mode=false) - **Automatic failover**: Cluster continues if a Redis node fails - **Load distribution**: Requests distributed across cluster nodes - **Backwards compatible**: Existing standalone configurations work unchanged - **Fail-fast validation**: Connection tested at initialization ## Configuration Example ```ini [cache] enabled = yes cluster_mode = yes cluster_nodes = 10.0.0.1:6379 10.0.0.2:6379 10.0.0.3:6379 jobs = 30 nodes = 30 ``` ## Testing Tested on production environment: - Slurm-web 6.0.0 - Redis cluster: 3 nodes - Slurm controllers: 2 nodes - OS: Ubuntu 24.04 - Verified backward compatibility with standalone mode ## Implementation Notes - Uses "host:port" string format for RFL schema compatibility (list content type must be str, not dict) - skip_full_coverage_check=True allows partial cluster visibility - decode_responses=False maintains pickle serialization compatibility - Connection validated with ping() at initialization Closes: #[issue-number]
1 parent f628349 commit a1dc56c

File tree

3 files changed

+79
-2
lines changed

3 files changed

+79
-2
lines changed

conf/vendor/agent.yml

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -458,6 +458,26 @@ cache:
458458
Password to connect to protected Redis server. When this parameter is
459459
not defined, Redis server is accessed without password.
460460
ex: SECR3T
461+
cluster_mode:
462+
type: bool
463+
default: false
464+
doc: |
465+
Enable Redis cluster mode for high-availability caching.
466+
When enabled, the agent connects to a Redis cluster instead of
467+
a standalone instance, providing automatic failover and load distribution.
468+
Requires cluster_nodes to be specified.
469+
ex: yes
470+
cluster_nodes:
471+
type: list
472+
content: str
473+
doc: |
474+
List of Redis cluster node addresses in format host:port.
475+
Only used when cluster_mode is enabled.
476+
Minimum 3 nodes recommended for production HA clusters.
477+
ex:
478+
- "10.0.0.1:6379"
479+
- "10.0.0.2:6379"
480+
- "10.0.0.3:6379"
461481
version:
462482
type: int
463483
default: 1800

slurmweb/apps/agent.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -108,6 +108,8 @@ def __init__(self, seed):
108108
host=self.settings.cache.host,
109109
port=self.settings.cache.port,
110110
password=self.settings.cache.password,
111+
cluster_mode=getattr(self.settings.cache, 'cluster_mode', False),
112+
cluster_nodes=getattr(self.settings.cache, 'cluster_nodes', None),
111113
)
112114
else:
113115
logger.warning("Caching is disabled")

slurmweb/cache.py

Lines changed: 57 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@
88
import logging
99

1010
import redis
11+
from redis.cluster import RedisCluster, ClusterNode
1112
import pickle
1213

1314
from .errors import SlurmwebCacheError
@@ -31,10 +32,64 @@ class CachingService:
3132
KEY_PREFIX_MISS = "cache-miss-"
3233
KEY_PREFIX_HIT = "cache-hit-"
3334

34-
def __init__(self, host: str, port: int, password: t.Union[str, None]):
35+
def __init__(
36+
self,
37+
host: str,
38+
port: int,
39+
password: t.Union[str, None],
40+
cluster_mode: bool = False,
41+
cluster_nodes: t.Optional[t.List[str]] = None,
42+
):
43+
"""Initialize Redis connection (standalone or cluster mode).
44+
45+
Args:
46+
host: Redis server hostname (used in standalone mode)
47+
port: Redis server port (used in standalone mode)
48+
password: Redis password (optional, used in both modes)
49+
cluster_mode: Enable Redis cluster mode (default: False)
50+
cluster_nodes: List of cluster nodes in "host:port" format
51+
Example: ["10.0.0.1:6379", "10.0.0.2:6379"]
52+
Required when cluster_mode=True
53+
"""
3554
self.host = host
3655
self.port = port
37-
self.connection = redis.Redis(host=host, port=port, password=password)
56+
self.cluster_mode = cluster_mode
57+
58+
if cluster_mode:
59+
if not cluster_nodes:
60+
raise ValueError(
61+
"cluster_nodes must be provided when cluster_mode=True"
62+
)
63+
64+
# Parse cluster_nodes from "host:port" string format to ClusterNode objects
65+
startup_nodes = [
66+
ClusterNode(host, int(port))
67+
for node in cluster_nodes
68+
for host, port in [node.split(":", 1)]
69+
]
70+
71+
logger.info(
72+
"Initializing Redis cluster connection with %d nodes",
73+
len(startup_nodes),
74+
)
75+
76+
self.connection = RedisCluster(
77+
startup_nodes=startup_nodes,
78+
password=password,
79+
decode_responses=False, # Binary mode for pickle
80+
skip_full_coverage_check=True, # Allow partial clusters
81+
)
82+
else:
83+
logger.info("Initializing Redis standalone connection to %s:%d", host, port)
84+
self.connection = redis.Redis(host=host, port=port, password=password)
85+
86+
# Validate connection at initialization (fail-fast)
87+
try:
88+
self.connection.ping()
89+
logger.info("Redis connection established successfully")
90+
except redis.exceptions.ConnectionError as error:
91+
logger.error("Failed to connect to Redis: %s", error)
92+
raise
3893

3994
def put(self, key: CacheKey, value: t.Any, expiration: int):
4095
try:

0 commit comments

Comments
 (0)