Skip to content

Conversation

@krhoward-amd
Copy link

Implements Redis Cluster client support in slurm-web agent to enable high-availability caching across distributed Redis clusters.

Problem

Slurm-web currently supports only standalone Redis instances for caching. In high-availability deployments with Redis Cluster (3+ node clustered Redis), slurm-web agents fail to connect because they use the standard redis.Redis() client instead of the cluster-aware redis.cluster.RedisCluster() client.

Solution

This commit adds optional Redis Cluster support while maintaining full backwards compatibility with standalone Redis deployments.

Core Changes

slurmweb/cache.py:

  • Import RedisCluster and ClusterNode from redis.cluster
  • Add cluster_mode and cluster_nodes optional parameters to CachingService
  • Implement cluster mode initialization with RedisCluster client
  • Parse cluster_nodes from "host:port" string format
  • Add connection validation with fail-fast error handling

slurmweb/apps/agent.py:

  • Pass cluster_mode and cluster_nodes parameters to CachingService
  • Use getattr() with defaults for backwards compatibility

conf/vendor/agent.yml:

  • Add cluster_mode boolean parameter (default: false)
  • Add cluster_nodes list parameter with string content type
  • Document configuration with examples

Features

  • Opt-in design: Cluster mode disabled by default (cluster_mode=false)
  • Automatic failover: Cluster continues if a Redis node fails
  • Load distribution: Requests distributed across cluster nodes
  • Backwards compatible: Existing standalone configurations work unchanged
  • Fail-fast validation: Connection tested at initialization

Configuration Example

[cache]
enabled = yes
cluster_mode = yes
cluster_nodes =
    10.0.0.1:6379
    10.0.0.2:6379
    10.0.0.3:6379
jobs = 30
nodes = 30

Testing

Tested on production environment:

  • Slurm-web 6.0.0
  • Redis cluster: 3 nodes
  • Slurm controllers: 2 nodes
  • OS: Ubuntu 24.04
  • Verified backward compatibility with standalone mode

Implementation Notes

  • Uses "host:port" string format for RFL schema compatibility (list content type must be str, not dict)
  • skip_full_coverage_check=True allows partial cluster visibility
  • decode_responses=False maintains pickle serialization compatibility
  • Connection validated with ping() at initialization

Closes: #[issue-number]

Implements Redis Cluster client support in slurm-web agent to enable
high-availability caching across distributed Redis clusters.

## Problem

Slurm-web currently supports only standalone Redis instances for caching.
In high-availability deployments with Redis Cluster (3+ node clustered Redis),
slurm-web agents fail to connect because they use the standard redis.Redis()
client instead of the cluster-aware redis.cluster.RedisCluster() client.

## Solution

This commit adds optional Redis Cluster support while maintaining full
backwards compatibility with standalone Redis deployments.

### Core Changes

**slurmweb/cache.py**:
- Import RedisCluster and ClusterNode from redis.cluster
- Add cluster_mode and cluster_nodes optional parameters to CachingService
- Implement cluster mode initialization with RedisCluster client
- Parse cluster_nodes from "host:port" string format
- Add connection validation with fail-fast error handling

**slurmweb/apps/agent.py**:
- Pass cluster_mode and cluster_nodes parameters to CachingService
- Use getattr() with defaults for backwards compatibility

**conf/vendor/agent.yml**:
- Add cluster_mode boolean parameter (default: false)
- Add cluster_nodes list parameter with string content type
- Document configuration with examples

## Features

- **Opt-in design**: Cluster mode disabled by default (cluster_mode=false)
- **Automatic failover**: Cluster continues if a Redis node fails
- **Load distribution**: Requests distributed across cluster nodes
- **Backwards compatible**: Existing standalone configurations work unchanged
- **Fail-fast validation**: Connection tested at initialization

## Configuration Example

```ini
[cache]
enabled = yes
cluster_mode = yes
cluster_nodes =
    10.0.0.1:6379
    10.0.0.2:6379
    10.0.0.3:6379
jobs = 30
nodes = 30
```

## Testing

Tested on production environment:
- Slurm-web 6.0.0
- Redis cluster: 3 nodes
- Slurm controllers: 2 nodes
- OS: Ubuntu 24.04
- Verified backward compatibility with standalone mode

## Implementation Notes

- Uses "host:port" string format for RFL schema compatibility (list content type must be str, not dict)
- skip_full_coverage_check=True allows partial cluster visibility
- decode_responses=False maintains pickle serialization compatibility
- Connection validated with ping() at initialization

Closes: #[issue-number]
@github-actions
Copy link


Thank you for your submission, we really appreciate it. Like many open-source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution. You can sign the CLA by just posting a Pull Request Comment same as the below format.


I have read the CLA Document and I hereby sign the CLA


You can retrigger this bot by commenting recheck in this Pull Request. Posted by the CLA Assistant Lite bot.

@krhoward-amd
Copy link
Author

krhoward-amd commented Jan 27, 2026 via email

@rezib
Copy link
Contributor

rezib commented Jan 27, 2026

________________________________ I have read the CLA Document and I hereby sign the CLA Kris Howard


Hello @krhoward-amd thank you very much for your interest in Slurm-web and your contribution!

Unfortunately, it seems the CLA assistant bot failed miserably to parse your message sent by email 😕 Can you please the copy/paste the line in a new comment sent in GitHub web interface to sign the CLA?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants