Skip to content

Commit e67d258

Browse files
daniel-thomclaude
andauthored
Add automated slurm tests (#182)
* Improve Slurm test framework: add timeout test, debug dumps, filtering, and cleanup - Remove buggy multi_node_single_worker test (contradicts its own intent) - Add timeout_detection test with slow_work.sh script and workflow - Add timeout assertion helpers (parse-logs and logs-analyze) - Extract print_workflow_message() helper in main.rs to deduplicate 5 JSON output blocks - Restore oom_auto_recovery_test and timeout_auto_recovery_test as manual torc watch tests - Fix result assertions to sort by attempt_id instead of using tail -1 - Add --test PATTERN filter option to run_all.sh for running specific tests - Dump workflow debug info (summary, jobs, results) to file on assertion failure - Add .gitignore for slurm-tests output directories - Use | as sed delimiter in placeholder substitution to avoid path issues - Add set -E for ERR trap propagation in run_all.sh - Add is_server_alive() health check after workflow submission phase - Assert OOM return code is specifically 137 (SIGKILL), use assert_sacct_job_state helper - Add assert_slurm_stats_available() and check in multi_node_parallel test - Add assert_resource_metrics_db_has_data() and check in resource_monitoring test - Add shellcheck to pre-commit hook for slurm-tests shell scripts Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Handle all cases of Slurm errors in log files * Improve Slurm log parsing: fix perf, accuracy, and error reporting - Use LazyLock<Regex> for static regex compilation instead of recompiling on every call in hot loops - Restore full file path in SlurmLogError.file (was changed to basename, losing the job_stdio/ subdirectory component) - Tighten extract_slurm_job_id_from_line regex to only match Slurm-specific patterns (StepId=, JobId=, SLURM_JOB_ID=, "slurm job", "batch job") instead of any "job N" occurrence - Change scan_file_for_slurm_errors to return Option<usize> for clearer semantics (None = couldn't open, Some(n) = errors found) - Warn when read_dir fails for main output directory instead of silently skipping Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent ac5a946 commit e67d258

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

63 files changed

+2729
-831
lines changed

.cargo-husky/hooks/pre-commit

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
#!/bin/sh
22
#
33
# Pre-commit hook for torc project
4-
# Runs Rust formatting, linting, and markdown formatting
4+
# Runs Rust formatting, linting, markdown formatting, and shell script linting
55
#
66

77
set -e
@@ -14,3 +14,11 @@ cargo clippy --all --all-targets --all-features -- -D warnings
1414

1515
echo '+dprint check'
1616
dprint check
17+
18+
# Lint shell scripts if shellcheck is available
19+
if command -v shellcheck >/dev/null 2>&1; then
20+
echo '+shellcheck slurm-tests/**/*.sh'
21+
find slurm-tests -name '*.sh' -type f -exec shellcheck {} +
22+
else
23+
echo 'shellcheck not found, skipping shell script linting'
24+
fi

GEMINI.md

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
# GEMINI.md
2+
3+
This file provides guidance to Gemini CLI when working with code in this repository. It establishes
4+
foundational mandates that take precedence over general defaults.
5+
6+
## Development Lifecycle
7+
8+
1. **Research**: Map the codebase and validate assumptions using `grep_search` and `read_file`.
9+
2. **Strategy**: Formulate a plan and share a concise summary.
10+
3. **Execution**:
11+
- **Plan**: Define implementation and testing strategy.
12+
- **Act**: Apply targeted changes.
13+
- **Validate**: Run tests and quality checks.
14+
15+
## Code Quality Requirements
16+
17+
All changes MUST pass these checks:
18+
19+
```bash
20+
# Rust formatting
21+
cargo fmt -- --check
22+
23+
# Rust linting (MUST pass with no warnings)
24+
cargo clippy --all --all-targets --all-features -- -D warnings
25+
26+
# Markdown formatting
27+
dprint check
28+
```
29+
30+
## Component-Specific Guidance
31+
32+
### Rust Client (`src/client/`)
33+
34+
- Command handlers are in `src/client/commands/`.
35+
- Use the `Tabled` trait for tabular CLI output.
36+
- Follow the established `\x1b[1;36m` (cyan) for commands and `\x1b[1;32m` (green) for categories in
37+
help templates.

api/openapi.yaml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2878,6 +2878,12 @@ paths:
28782878
schema:
28792879
$ref: "#/components/schemas/not_found_error_response"
28802880
description: Resource requirements not found
2881+
"422":
2882+
content:
2883+
application/json:
2884+
schema:
2885+
$ref: "#/components/schemas/default_error_response"
2886+
description: Unprocessable content
28812887
"500":
28822888
content:
28832889
application/json:

julia_client/Torc/src/api/apis/api_DefaultApi.jl

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4020,6 +4020,7 @@ const _returntypes_update_resource_requirements_DefaultApi = Dict{Regex,Type}(
40204020
Regex("^" * replace("200", "x"=>".") * "\$") => ResourceRequirementsModel,
40214021
Regex("^" * replace("403", "x"=>".") * "\$") => ForbiddenErrorResponse,
40224022
Regex("^" * replace("404", "x"=>".") * "\$") => NotFoundErrorResponse,
4023+
Regex("^" * replace("422", "x"=>".") * "\$") => DefaultErrorResponse,
40234024
Regex("^" * replace("500", "x"=>".") * "\$") => DefaultErrorResponse,
40244025
)
40254026

python_client/src/torc/openapi_client/api/default_api.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33229,6 +33229,7 @@ def update_resource_requirements(
3322933229
'200': "ResourceRequirementsModel",
3323033230
'403': "ForbiddenErrorResponse",
3323133231
'404': "NotFoundErrorResponse",
33232+
'422': "DefaultErrorResponse",
3323233233
'500': "DefaultErrorResponse",
3323333234
}
3323433235
response_data = self.api_client.call_api(
@@ -33303,6 +33304,7 @@ def update_resource_requirements_with_http_info(
3330333304
'200': "ResourceRequirementsModel",
3330433305
'403': "ForbiddenErrorResponse",
3330533306
'404': "NotFoundErrorResponse",
33307+
'422': "DefaultErrorResponse",
3330633308
'500': "DefaultErrorResponse",
3330733309
}
3330833310
response_data = self.api_client.call_api(
@@ -33377,6 +33379,7 @@ def update_resource_requirements_without_preload_content(
3337733379
'200': "ResourceRequirementsModel",
3337833380
'403': "ForbiddenErrorResponse",
3337933381
'404': "NotFoundErrorResponse",
33382+
'422': "DefaultErrorResponse",
3338033383
'500': "DefaultErrorResponse",
3338133384
}
3338233385
response_data = self.api_client.call_api(

slurm-tests/.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
output/
2+
torc_output/

slurm-tests/lib/server.sh

Lines changed: 112 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,112 @@
1+
#!/bin/bash
2+
# server.sh — Helpers for starting, stopping, and health-checking the torc server.
3+
#
4+
# Requires: TORC_BIN, TORC_SERVER_BIN, RUN_DIR to be set before sourcing.
5+
6+
# start_server DB_PATH PORT HOST
7+
# Starts the torc server with a SQLite database at DB_PATH on PORT.
8+
# HOST is the hostname/IP the server binds to (must be reachable from compute nodes).
9+
# Sets SERVER_PID and TORC_API_URL.
10+
# Enables authentication with two users: "admin" and the current $USER.
11+
start_server() {
12+
local db_path="$1"
13+
local port="$2"
14+
local host="$3"
15+
local log_file="${RUN_DIR}/server.log"
16+
local htpasswd_file="${RUN_DIR}/htpasswd"
17+
18+
# Generate a random password for test users
19+
if [ -f /usr/share/dict/words ] && command -v shuf &>/dev/null; then
20+
TORC_TEST_PASSWORD=$(shuf -n3 /usr/share/dict/words | tr '\n' '-' | sed 's/-$//')
21+
else
22+
TORC_TEST_PASSWORD=$(head -c 24 /dev/urandom | base64 | tr -dc 'a-zA-Z0-9' | head -c 16)
23+
fi
24+
25+
# Create htpasswd file with admin and current user
26+
echo "Creating auth users (admin, $USER)..."
27+
"$TORC_HTPASSWD_BIN" add --file "$htpasswd_file" admin --password "$TORC_TEST_PASSWORD"
28+
"$TORC_HTPASSWD_BIN" add --file "$htpasswd_file" "$USER" --password "$TORC_TEST_PASSWORD"
29+
30+
# Export password so all torc CLI calls authenticate automatically
31+
export TORC_PASSWORD="$TORC_TEST_PASSWORD"
32+
33+
echo "Starting torc server on ${host}:${port} with database $db_path (auth enabled)..."
34+
DATABASE_URL="sqlite:${db_path}" "$TORC_SERVER_BIN" run \
35+
--host "$host" -p "$port" \
36+
--require-auth \
37+
--auth-file "$htpasswd_file" \
38+
--admin-user admin \
39+
--enforce-access-control \
40+
--completion-check-interval-secs 5 \
41+
>"$log_file" 2>&1 &
42+
SERVER_PID=$!
43+
44+
export TORC_API_URL="http://${host}:${port}/torc-service/v1"
45+
46+
# Wait for server to become healthy
47+
wait_for_server "$port" 30
48+
}
49+
50+
# wait_for_server PORT TIMEOUT_SECONDS
51+
# Polls the server health endpoint until it responds or timeout.
52+
wait_for_server() {
53+
local port="$1"
54+
local timeout="${2:-30}"
55+
local elapsed=0
56+
57+
while [ "$elapsed" -lt "$timeout" ]; do
58+
if "$TORC_BIN" ping >/dev/null 2>&1; then
59+
echo "Server is healthy (port $port)."
60+
return 0
61+
fi
62+
sleep 1
63+
elapsed=$((elapsed + 1))
64+
done
65+
66+
echo "ERROR: Server did not become healthy within ${timeout}s."
67+
if [ -f "${RUN_DIR}/server.log" ]; then
68+
echo "Server log (last 20 lines):"
69+
tail -20 "${RUN_DIR}/server.log"
70+
fi
71+
return 1
72+
}
73+
74+
# stop_server
75+
# Kills the server process if running.
76+
stop_server() {
77+
if [ -n "${SERVER_PID:-}" ] && kill -0 "$SERVER_PID" 2>/dev/null; then
78+
echo "Stopping server (PID $SERVER_PID)..."
79+
kill "$SERVER_PID" 2>/dev/null || true
80+
wait "$SERVER_PID" 2>/dev/null || true
81+
SERVER_PID=""
82+
fi
83+
}
84+
85+
# is_server_alive
86+
# Returns 0 if the server process is still running and responsive, 1 otherwise.
87+
is_server_alive() {
88+
if [ -z "${SERVER_PID:-}" ]; then
89+
return 1
90+
fi
91+
if ! kill -0 "$SERVER_PID" 2>/dev/null; then
92+
return 1
93+
fi
94+
# Also verify the server is responsive
95+
"$TORC_BIN" ping >/dev/null 2>&1
96+
}
97+
98+
# find_free_port
99+
# Prints a free TCP port. Falls back to a random port in 10000-60000.
100+
find_free_port() {
101+
if command -v python3 &>/dev/null; then
102+
python3 -c "
103+
import socket
104+
s = socket.socket()
105+
s.bind(('', 0))
106+
print(s.getsockname()[1])
107+
s.close()
108+
" 2>/dev/null && return
109+
fi
110+
# Fallback: random port
111+
echo $((RANDOM % 50000 + 10000))
112+
}

0 commit comments

Comments
 (0)