Skip to content

Commit e69965d

Browse files
Update Slurm CI to use pitt-crc/Slurm-Test-Environment with multi-version testing
Co-authored-by: transientlunatic <4365778+transientlunatic@users.noreply.github.com>
1 parent 2715ce0 commit e69965d

File tree

2 files changed

+217
-164
lines changed

2 files changed

+217
-164
lines changed

.github/workflows/SLURM_TESTING_README.md

Lines changed: 134 additions & 65 deletions
Original file line numberDiff line numberDiff line change
@@ -2,47 +2,68 @@
22

33
## Current Status
44

5-
The Slurm testing workflow (`.github/workflows/slurm-tests.yml`) is currently set to **manual trigger** (`workflow_dispatch`) due to the complexity of setting up Slurm in GitHub Actions CI environment.
5+
The Slurm testing workflow (`.github/workflows/slurm-tests.yml`) uses the **pitt-crc/Slurm-Test-Environment** Docker images for automated Slurm testing in GitHub Actions.
66

7-
## Why Manual?
7+
Repository: https://github.com/pitt-crc/Slurm-Test-Environment
88

9-
1. **Docker Image Availability**: The original workflow used a non-existent Docker image (`ghcr.io/natejenkins/slurm-docker-cluster:23.11.7`)
10-
2. **Complexity**: Running Slurm requires:
11-
- Privileged container access
12-
- Munge authentication setup
13-
- Multiple Slurm daemons (controller, compute nodes)
14-
- Proper networking configuration
15-
3. **Maintenance**: Community Slurm Docker images may become outdated or unavailable
9+
## How It Works
1610

17-
## Running Slurm Tests Locally
11+
The workflow:
12+
1. Uses pre-built Slurm Docker images from `ghcr.io/pitt-crc/test-env`
13+
2. Tests against multiple Slurm versions (23.02.5, 23.11.10)
14+
3. Automatically starts Slurm services via the image's entrypoint
15+
4. Runs comprehensive asimov tests with actual Slurm job submission
1816

19-
### Option 1: Using Docker
17+
## Test Matrix
2018

21-
```bash
22-
# Pull a Slurm Docker image
23-
docker pull nathanhess/slurm:latest
19+
The CI tests against multiple Slurm versions to ensure compatibility:
20+
- Slurm 23.02.5 (older stable)
21+
- Slurm 23.11.10 (newer stable)
2422

25-
# Run the container
26-
docker run -it --privileged --hostname slurmctld nathanhess/slurm:latest /bin/bash
23+
Additional versions can be added by updating the matrix in the workflow file.
2724

28-
# Inside the container:
29-
# 1. Start munge
30-
munged
25+
## Running Tests Locally
3126

32-
# 2. Start Slurm daemons
33-
slurmctld
34-
slurmd
27+
### Option 1: Using the Same Docker Image
3528

36-
# 3. Verify Slurm is running
37-
sinfo
38-
squeue
29+
```bash
30+
# Pull the test environment
31+
docker pull ghcr.io/pitt-crc/test-env:23.02.5
3932

40-
# 4. Run asimov tests
33+
# Run interactively
34+
docker run -it ghcr.io/pitt-crc/test-env:23.02.5 /bin/bash
35+
36+
# Inside the container, the entrypoint has already started Slurm
37+
# Run your tests
4138
cd /path/to/asimov
4239
python -m unittest tests.test_scheduler
4340
```
4441

45-
### Option 2: Install Slurm on Ubuntu
42+
### Option 2: Using Docker Compose
43+
44+
Create a `docker-compose.yml`:
45+
46+
```yaml
47+
version: '3'
48+
services:
49+
slurm-test:
50+
image: ghcr.io/pitt-crc/test-env:23.02.5
51+
volumes:
52+
- .:/workspace
53+
working_dir: /workspace
54+
command: /bin/bash -c "/usr/local/bin/entrypoint.sh && /bin/bash"
55+
stdin_open: true
56+
tty: true
57+
```
58+
59+
Then run:
60+
```bash
61+
docker-compose run slurm-test
62+
```
63+
64+
### Option 3: Install Slurm on Ubuntu
65+
66+
For local development without Docker:
4667

4768
```bash
4869
# Install Slurm
@@ -61,59 +82,107 @@ sudo systemctl start slurmd
6182
sinfo
6283
```
6384

64-
### Option 3: Unit Tests Only
85+
### Option 4: Unit Tests Only (No Slurm Required)
6586

66-
The unit tests for Slurm scheduler can run without a real Slurm installation:
87+
The unit tests for Slurm scheduler use mocking and don't require a real Slurm installation:
6788

6889
```bash
6990
cd /path/to/asimov
7091
python -m unittest tests.test_scheduler.SlurmSchedulerTests -v
7192
```
7293

73-
These tests use mocking and don't require an actual Slurm cluster.
94+
## What Gets Tested
95+
96+
The CI workflow tests:
97+
98+
1. **Slurm Detection**: Verifies `asimov init` correctly detects Slurm
99+
2. **Scheduler Unit Tests**: All 30 unit tests for the scheduler abstraction
100+
3. **Job Submission**: Actual Slurm job submission and monitoring
101+
4. **DAG Translation**: HTCondor DAG to Slurm batch script conversion
102+
5. **Integration**: End-to-end workflow with asimov commands
103+
104+
## Customizing the Test Environment
105+
106+
To test with a different Slurm version:
107+
108+
1. Check available versions at: https://github.com/pitt-crc/Slurm-Test-Environment/pkgs/container/test-env
109+
2. Update the matrix in `.github/workflows/slurm-tests.yml`:
110+
111+
```yaml
112+
matrix:
113+
slurm_version:
114+
- "20.11.9"
115+
- "22.05.11"
116+
- "23.02.5"
117+
- "23.11.10"
118+
```
119+
120+
## Building Your Own Slurm Test Image
74121
75-
## Enabling Automatic CI Testing
122+
If you need custom Slurm configuration:
76123
77-
To enable automatic Slurm testing in CI:
124+
```dockerfile
125+
FROM ghcr.io/pitt-crc/test-env:23.02.5
78126

79-
1. **Build Your Own Slurm Container**:
80-
```dockerfile
81-
FROM ubuntu:22.04
82-
RUN apt-get update && apt-get install -y \
83-
slurm-wlm munge sudo python3 python3-pip git
84-
# Add your Slurm configuration
85-
COPY slurm.conf /etc/slurm/slurm.conf
86-
# Add startup script
87-
COPY start-slurm.sh /start-slurm.sh
88-
RUN chmod +x /start-slurm.sh
89-
CMD ["/start-slurm.sh"]
90-
```
127+
# Add your custom Slurm configuration
128+
COPY my-slurm.conf /etc/slurm/slurm.conf
91129

92-
2. **Push to Container Registry**:
93-
```bash
94-
docker build -t your-org/slurm-test:latest .
95-
docker push your-org/slurm-test:latest
96-
```
130+
# Add custom setup
131+
COPY setup-script.sh /usr/local/bin/custom-setup.sh
132+
RUN chmod +x /usr/local/bin/custom-setup.sh
133+
```
134+
135+
Then build and push to your registry:
136+
```bash
137+
docker build -t your-org/slurm-test:custom .
138+
docker push your-org/slurm-test:custom
139+
```
140+
141+
Update the workflow to use your image:
142+
```yaml
143+
container:
144+
image: your-org/slurm-test:custom
145+
```
146+
147+
## Advantages of This Approach
148+
149+
1. **Reliable**: Uses maintained Docker images specifically designed for CI testing
150+
2. **Versioned**: Test against multiple Slurm versions
151+
3. **Pre-configured**: Slurm services start automatically via entrypoint
152+
4. **No Manual Setup**: No need to manually start munge, slurmctld, slurmd
153+
5. **Fast**: Images are optimized for quick startup in CI
154+
6. **Maintained**: The pitt-crc project actively maintains these images
155+
156+
## Troubleshooting
97157
98-
3. **Update Workflow**:
99-
- Edit `.github/workflows/slurm-tests.yml`
100-
- Change `image: nathanhess/slurm:latest` to `image: your-org/slurm-test:latest`
101-
- Change `on: workflow_dispatch` to `on: [push, pull_request]`
158+
### Container Fails to Start
159+
160+
Check the workflow logs for entrypoint errors. The entrypoint script should handle Slurm service startup automatically.
161+
162+
### Slurm Commands Not Found
163+
164+
Ensure the entrypoint has been called:
165+
```bash
166+
/usr/local/bin/entrypoint.sh
167+
```
168+
169+
### Jobs Stay in Pending State
170+
171+
Check node status:
172+
```bash
173+
sinfo -N -o "%N %t %C"
174+
```
102175

103-
## Alternative: Use Existing Public Images
176+
If nodes are down, the entrypoint may not have completed successfully.
104177

105-
Community-maintained Slurm images (use at your own risk):
106-
- `nathanhess/slurm:latest` - Basic Slurm installation
107-
- `agaveapi/slurm:latest` - Includes controller and compute nodes
108-
- `xenonmiddleware/slurm:latest` - For Xenon middleware testing
178+
### Permission Errors
109179

110-
Update the workflow file to use one of these images if they meet your requirements.
180+
The test environment runs as root by default. If you encounter permission issues, check file ownership in the workspace.
111181

112-
## Testing Strategy
182+
## References
113183

114-
The current testing strategy prioritizes:
115-
1. **Unit tests** - Mock-based tests that don't require Slurm (always run)
116-
2. **Integration tests** - Manual or local testing with real Slurm
117-
3. **CI tests** - Manual trigger when Slurm container is available
184+
- [Slurm Test Environment Repository](https://github.com/pitt-crc/Slurm-Test-Environment)
185+
- [Available Docker Images](https://github.com/pitt-crc/Slurm-Test-Environment/pkgs/container/test-env)
186+
- [Slurm Documentation](https://slurm.schedmd.com/)
187+
- [GitHub Actions Container Jobs](https://docs.github.com/en/actions/using-jobs/running-jobs-in-a-container)
118188

119-
This ensures the code is well-tested without blocking CI on Slurm setup complexity.

0 commit comments

Comments
 (0)