[Draft] Add enroot as potential option for deployment#224
[Draft] Add enroot as potential option for deployment#224kjain14 wants to merge 7 commits intoSWE-agent:mainfrom
Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR adds Enroot as a new deployment option for SWE-ReX, enabling deployment on HPC clusters using Slurm and Pyxis. The implementation follows the existing deployment pattern and integrates with the SWE-ReX runtime system.
- Implements
EnrootDeploymentclass with Slurm job management capabilities - Adds configuration support for Enroot-specific parameters like sbatch args, pyxis args, and cluster resources
- Includes robust job lifecycle management with cleanup handlers and signal handling
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| src/swerex/deployment/enroot.py | New deployment implementation with Slurm job submission, container management, and runtime coordination |
| src/swerex/deployment/config.py | Configuration class for Enroot deployment parameters and integration with deployment config union |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
Hi @kjain14 thank you very much for this, looks like quite a bit of work went into this! Absolutely open to have more backends in swe-rex, especially if they don't interfere with any of the other deployments. Since this is going to be hard to test end-to-end (because it needs SLURM etc.), I'm curious if you've tested this on your cluster? |
|
Also I guess this is specific in that it uses both SLURM and enroot, I guess? Totally open to expand in this direction, just need a bit more context |
|
Hi @klieret, yes the idea was to get this to work on SLURM clusters. Enroot is the default container solution for SLURM clusters (comes bundled with SLURM as far as I can tell) - https://github.com/NVIDIA/pyxis. Running docker on SLURM is a big challenge due to it not interacting well with SLURM scheduler, so it seems like supporting an Enroot backend is the only way to get SWE-ReX/SWE-agent to run on these clusters. We have tested it on our clusters. The only part that I am not sure how to solve is the image caching (on our cluster we have a hardcoded path that I just changed to ./images). We probably want to be able to pass in the path somehow. |
|
Going to close this for now, until it is a bit more stable on our end |
|
Hi @kjain14 ! We can also leave it open as a draft if you want! Totally fine with me. I think having some more context of how it's being used etc. would be great before merging it in (especially because it's somewhat specific, I guess). The alternative that I'm also open to would be to add a page in the docs/readme where we link to notable forks/blog posts/etc. What do you think? |
|
Sure, I think either of these options works for us. I think getting it merged would be nice eventually, but also we are testing it extensively on our end so wanted this to get to more stable state (feel free to make it a draft PR). |
Goal: add enroot as a potential deployment option.
Allows for SWE-ReX to be run on HPC clusters.