Skip to content

Commit 7578f71

Browse files
authored
Fix the watcher (#63)
* Fixes many behaviors in the watcher regarding recovery logic. * Only collect dmesg output if failures occurred. * Fix the slurm regenerate command to create optimal Slurm configurations for remaining work. * Add optimizations to server database queries involving job completions. * Make the poll intervals consistent between the job-runner and server. * Improve the summary report. * Add linting of code and docs.
1 parent e787fb5 commit 7578f71

File tree

176 files changed

+10069
-7174
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

176 files changed

+10069
-7174
lines changed

.cargo-husky/hooks/pre-commit

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
#!/bin/sh
2+
#
3+
# Pre-commit hook for torc project
4+
# Runs Rust formatting, linting, and markdown formatting
5+
#
6+
7+
set -e
8+
9+
echo '+cargo fmt -- --check'
10+
cargo fmt -- --check
11+
12+
echo '+cargo clippy -- -D warnings'
13+
cargo clippy -- -D warnings
14+
15+
echo '+dprint check'
16+
dprint check

.github/CROSS_COMPILATION_TEST_RESULTS.md

Lines changed: 23 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -2,14 +2,12 @@
22

33
## Test Summary
44

5-
**Date**: 2025-11-24
6-
**Status**: ✅ **LIKELY TO SUCCEED** with recommended changes
5+
**Date**: 2025-11-24 **Status**: ✅ **LIKELY TO SUCCEED** with recommended changes
76

87
## Binaries Tested
98

10-
`torc` (client + TUI + plot_resources) - Builds successfully
11-
`torc-server` - Builds successfully
12-
`torc-slurm-job-runner` - Builds successfully
9+
`torc` (client + TUI + plot_resources) - Builds successfully ✅ `torc-server` - Builds
10+
successfully ✅ `torc-slurm-job-runner` - Builds successfully
1311

1412
## Dependency Analysis
1513

@@ -18,6 +16,7 @@
1816
**Status**: Should work across all platforms
1917

2018
**Analysis**:
19+
2120
- `plotly_kaleido` downloads pre-built binaries at build time
2221
- Supports all our target platforms:
2322
- ✅ Linux x86_64
@@ -33,19 +32,23 @@
3332
**Status**: Will likely fail on musl builds without changes
3433

3534
**Current situation**:
35+
3636
- `torc-server` depends directly on `openssl` crate
3737
- Main `Cargo.toml` has `openssl-sys` with `vendored` feature for dev-dependencies only
3838
- Musl builds require vendored OpenSSL or static linking
3939

40-
**Issue**: Cross-compilation with `cross` tool might fail for musl targets if OpenSSL isn't vendored for release builds.
40+
**Issue**: Cross-compilation with `cross` tool might fail for musl targets if OpenSSL isn't vendored
41+
for release builds.
4142

4243
### 3. Other Dependencies - ✅ GOOD
4344

4445
**SQLite** (via rusqlite):
46+
4547
- Uses `bundled` feature - will compile from source
4648
- Works perfectly for all targets
4749

4850
**TUI dependencies** (ratatui, crossterm):
51+
4952
- Pure Rust, no native dependencies
5053
- Should work fine everywhere
5154

@@ -84,25 +87,25 @@ Set environment variable in the GitHub workflow:
8487

8588
The `cross` tool includes OpenSSL in its Docker images. The workflow may work as-is.
8689

87-
**Recommendation**: Try the workflow as-is first. If it fails with OpenSSL linking errors, implement Option 1.
90+
**Recommendation**: Try the workflow as-is first. If it fails with OpenSSL linking errors, implement
91+
Option 1.
8892

8993
## Potential Issues to Watch
9094

9195
### 1. Kaleido Download Failures
9296

93-
**Symptom**: Build fails with "failed to download kaleido"
94-
**Cause**: GitHub rate limiting or network issues
95-
**Solution**: Builds run on GitHub Actions should have good connectivity
97+
**Symptom**: Build fails with "failed to download kaleido" **Cause**: GitHub rate limiting or
98+
network issues **Solution**: Builds run on GitHub Actions should have good connectivity
9699

97100
### 2. Windows OpenSSL
98101

99-
**Status**: Already handled correctly
100-
**Evidence**: Existing test workflow successfully builds on Windows using vcpkg
102+
**Status**: Already handled correctly **Evidence**: Existing test workflow successfully builds on
103+
Windows using vcpkg
101104

102105
### 3. Binary Size
103106

104-
**Observation**: Musl builds with vendored OpenSSL will be larger
105-
**Mitigation**: Already using `--release` flag. Consider adding strip step:
107+
**Observation**: Musl builds with vendored OpenSSL will be larger **Mitigation**: Already using
108+
`--release` flag. Consider adding strip step:
106109

107110
```yaml
108111
- name: Strip binaries (Unix)
@@ -116,16 +119,19 @@ The `cross` tool includes OpenSSL in its Docker images. The workflow may work as
116119
## Test Plan
117120

118121
### Phase 1: Test workflow as-is
122+
119123
1. Push a test tag: `git tag v0.7.0-test && git push origin v0.7.0-test`
120124
2. Monitor GitHub Actions
121125
3. Check if all builds succeed
122126

123127
### Phase 2: If musl build fails with OpenSSL errors
128+
124129
1. Implement Option 1 (vendored OpenSSL for musl)
125130
2. Push another test tag: `git tag v0.7.0-test2 && git push origin v0.7.0-test2`
126131
3. Verify success
127132

128133
### Phase 3: Validate binaries
134+
129135
Download and test each binary:
130136

131137
```bash
@@ -144,7 +150,9 @@ Download and test each binary:
144150

145151
## Conclusion
146152

147-
The workflow is well-designed and should work with high probability. The main risk is OpenSSL linking on musl, but this has known solutions. All other dependencies are either pure Rust or download pre-built binaries.
153+
The workflow is well-designed and should work with high probability. The main risk is OpenSSL
154+
linking on musl, but this has known solutions. All other dependencies are either pure Rust or
155+
download pre-built binaries.
148156

149157
**Confidence Level**: 85% success without changes, 99% with OpenSSL vendoring
150158

.github/RELEASE.md

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@ This document explains how to use the automated release build system for Torc.
55
## Overview
66

77
The release workflow builds binaries for:
8+
89
- **macOS**: Apple Silicon (aarch64)
910
- **Linux**:
1011
- x86_64-musl (static, works on all distros)
@@ -14,26 +15,32 @@ The release workflow builds binaries for:
1415
## Binaries Produced
1516

1617
Each platform build produces three binaries:
18+
1719
1. `torc` - Unified CLI with all features
1820
2. `torc-server` - Standalone server
1921
3. `torc-slurm-job-runner` - Slurm job runner
2022

2123
## Triggering a Release
2224

2325
### Automatic (Recommended)
26+
2427
Push a version tag to trigger the build:
28+
2529
```bash
2630
git tag v0.7.0
2731
git push origin v0.7.0
2832
```
2933

3034
This will:
35+
3136
1. Build binaries for all platforms
3237
2. Create a draft GitHub release
3338
3. Upload all binaries to the release
3439

3540
### Manual
41+
3642
You can also trigger builds manually from the GitHub Actions UI:
43+
3744
1. Go to Actions → "Build Release Binaries"
3845
2. Click "Run workflow"
3946
3. Optionally specify a tag name
@@ -45,18 +52,21 @@ Manual builds create artifacts but don't create a GitHub release.
4552
### Which Linux binary should users download?
4653

4754
**For maximum compatibility (recommended for most users):**
55+
4856
- Use `torc-x86_64-unknown-linux-musl.tar.gz`
4957
- This is a fully static binary that works on any Linux distro
5058
- No external dependencies required
5159

5260
**For better performance on modern systems:**
61+
5362
- Use `torc-x86_64-unknown-linux-gnu.tar.gz`
5463
- Built on Ubuntu 20.04, compatible with glibc 2.31+
5564
- Works on Ubuntu 20.04+, Debian 11+, RHEL 8+, etc.
5665

5766
## Adding More Platforms
5867

5968
### Additional Linux Versions
69+
6070
To support older distros, add entries to the matrix in `.github/workflows/release.yml`:
6171

6272
```yaml
@@ -67,6 +77,7 @@ To support older distros, add entries to the matrix in `.github/workflows/releas
6777
```
6878
6979
### Intel macOS
80+
7081
To also build for Intel Macs, add:
7182
7283
```yaml
@@ -77,6 +88,7 @@ To also build for Intel Macs, add:
7788
```
7889
7990
### ARM64 Linux
91+
8092
For ARM64 servers (like AWS Graviton), add:
8193
8294
```yaml
@@ -89,17 +101,23 @@ For ARM64 servers (like AWS Graviton), add:
89101
## Troubleshooting
90102
91103
### Build fails with OpenSSL errors on Windows
104+
92105
The workflow installs OpenSSL via vcpkg. If it fails:
106+
93107
1. Check that vcpkg is available on the runner
94108
2. Verify the OpenSSL environment variables are set correctly
95109
96110
### Build fails with musl linking errors
111+
97112
If you see linker errors with musl:
113+
98114
1. Ensure `cross` is being used (set `use_cross: true`)
99115
2. Or ensure musl-tools are installed for native builds
100116

101117
### Binary size is too large
118+
102119
Release binaries include debug symbols. To reduce size:
120+
103121
1. Add strip step to workflow after build:
104122
```yaml
105123
- name: Strip binaries (Unix)
@@ -135,13 +153,17 @@ Expand-Archive torc-x86_64-pc-windows-msvc.zip
135153
## Automation Tips
136154

137155
### Auto-publish releases
156+
138157
To automatically publish releases (instead of drafts), change in release.yml:
158+
139159
```yaml
140160
draft: false # Change from true
141161
```
142162

143163
### Build on every push
164+
144165
To build binaries on every push (for testing), add to `on:` section:
166+
145167
```yaml
146168
on:
147169
push:
@@ -150,7 +172,9 @@ on:
150172
```
151173

152174
### Notification on failure
175+
153176
Add a notification step at the end of the build job:
177+
154178
```yaml
155179
- name: Notify on failure
156180
if: failure()

.github/TESTING_WORKFLOW.md

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,8 +15,9 @@ git push origin v0.7.0-test1
1515
Go to: https://github.com/NREL/torc/actions/workflows/release.yml
1616

1717
You should see 4 build jobs running in parallel:
18+
1819
- ✅ Build aarch64-apple-darwin (macOS Apple Silicon)
19-
- ⚠️ Build x86_64-unknown-linux-musl (Linux static)
20+
- ⚠️ Build x86_64-unknown-linux-musl (Linux static)
2021
- ✅ Build x86_64-unknown-linux-gnu (Linux glibc)
2122
- ✅ Build x86_64-pc-windows-msvc (Windows)
2223

@@ -40,6 +41,7 @@ git push origin v0.7.0-test2
4041
```
4142
4243
**If kaleido download fails:**
44+
4345
- This is usually a transient network issue
4446
- Re-run the failed job from GitHub Actions UI
4547
@@ -62,13 +64,15 @@ tar xzf torc-aarch64-apple-darwin.tar.gz
6264
### 5. Test on different Linux distributions
6365
6466
**Test musl binary on Alpine:**
67+
6568
```bash
6669
docker run -it --rm -v $(pwd):/workspace alpine:latest sh
6770
cd /workspace
6871
./torc --version
6972
```
7073
7174
**Test glibc binary on Ubuntu 20.04:**
75+
7276
```bash
7377
docker run -it --rm -v $(pwd):/workspace ubuntu:20.04 sh
7478
cd /workspace
@@ -77,6 +81,7 @@ apt-get update && apt-get install -y ca-certificates
7781
```
7882
7983
**Test glibc binary on Ubuntu 24.04:**
84+
8085
```bash
8186
docker run -it --rm -v $(pwd):/workspace ubuntu:24.04 sh
8287
cd /workspace
@@ -106,12 +111,14 @@ git push origin v0.7.0
106111
GitHub Actions has a 6-hour job limit. Our builds should complete in ~10-30 minutes per platform.
107112
108113
If timing out:
114+
109115
- Check if dependencies are being cached (cache hit logs)
110116
- Consider removing less important targets
111117
112118
### Wrong binaries in release
113119
114120
Check the glob patterns in `create-release` job:
121+
115122
```yaml
116123
files: |
117124
artifacts/torc-aarch64-apple-darwin/*.tar.gz
@@ -129,6 +136,7 @@ files: |
129136
### Testing without creating a release
130137
131138
Use manual trigger:
139+
132140
1. Go to Actions → "Build Release Binaries""Run workflow"
133141
2. Leave tag name empty or use "test"
134142
3. Artifacts will be created but no release
@@ -152,15 +160,18 @@ git push origin --delete v0.7.0-test1 v0.7.0-test2
152160
### Faster builds with better caching
153161
154162
The workflow already caches:
163+
155164
- `~/.cargo/registry` (downloaded crates)
156165
- `~/.cargo/git` (git dependencies)
157166
- `target` (compiled artifacts)
158167
159168
Cache is keyed by:
169+
160170
- OS and target triple
161171
- Cargo.lock hash
162172
163173
To bust cache (if needed):
174+
164175
- Update Cargo.lock: `cargo update`
165176
- Or manually delete cache from GitHub UI
166177
@@ -180,6 +191,7 @@ matrix:
180191
## Next Steps
181192
182193
After successful test:
194+
183195
1. Document installation instructions for users
184196
2. Add checksums (sha256) to release notes
185197
3. Consider setting up a release schedule

0 commit comments

Comments
 (0)