Conversation
When srun is disabled, assign GPU jobs a unique subset of devices and export CUDA/HIP/ROCR_VISIBLE_DEVICES to prevent all jobs defaulting to GPU0.
Add a persisted task model for long-running server operations and support async workflow initialization.\n\n- Add async_handles table and /tasks/{id} endpoint (404 on unauthorized to avoid enumeration)\n- Prevent duplicate concurrent initialize_jobs tasks via partial unique index\n- Add TaskStatus enum for type-safe task status handling\n- Add CLI support (torc tasks wait) and integration tests
Remove the unreachable 403 response variant for GET /tasks/{id} so
authorization failures return 404 and do not leak existence.
Add a typed 404 variant for the Rust client get_task API.
Includes pre-commit auto-fixes (whitespace/EOF/formatting) across the repo.
Retry transient get_task failures (network/5xx/408/429) with exponential backoff instead of exiting immediately. Add an access-control test ensuring unauthorized task polling returns 404 (not 403). Enforce async_handles.status via a SQLite CHECK constraint. Fix GetTaskError typing by mapping variants from HTTP status.
|
|
||
| [[Back to Model list]](../README.md#models) [[Back to API list]](../README.md#api-endpoints) [[Back to README]](../README.md) | ||
|
|
||
|
|
There was a problem hiding this comment.
These Julia and Python files are auto-generated by running bash make_api_clients.sh from the api directory (Docker required). Whatever you did to remove this extra whitespace needs to be disabled unless we fix it at the source. Otherwise, we'll be ping-ponging constantly.
| shell | ||
| }; | ||
|
|
||
| if let Some(v) = gpu_visible_devices { |
There was a problem hiding this comment.
Is this showing up because the branch is out-of-date with respect to main?
| } | ||
| let _run_id = self.bump_run_id()?; | ||
| self.initialize_files()?; | ||
| self.initialize_jobs_async(false) |
There was a problem hiding this comment.
Does this mean that if the user runs
torc workflows initialize
torc workflows reinitialize
the operation returns immediately? I'm wondering if
- The
initializefunction should be changed to call this function and then wait until the operation is done. - All CLI handlers should call initialize and not initialize_async.
There may be a problem currently. When the user calls torc run or torc submit, we call initialize, which will fail on huge workflows.
The main point is this: it is ok for the CLI commands to be blocked for multiple minutes. If we need the async behavior, we could add torc workflows initialize --async.
|
Ask Codex to review the code using |
When attempting to reinitialize a workflow, a very lengthy process, it will now do so asynchronously. It does so by using the SSE workflow already implemented and returns an ID that the user can then use to check the status.