Skip to content

Reinitialization async#225

Open
nkeilbart wants to merge 4 commits intoNatLabRockies:mainfrom
nkeilbart:reinitialization_async
Open

Reinitialization async#225
nkeilbart wants to merge 4 commits intoNatLabRockies:mainfrom
nkeilbart:reinitialization_async

Conversation

@nkeilbart
Copy link
Copy Markdown
Collaborator

When attempting to reinitialize a workflow, a very lengthy process, it will now do so asynchronously. It does so by using the SSE workflow already implemented and returns an ID that the user can then use to check the status.

When srun is disabled, assign GPU jobs a unique subset of devices and export CUDA/HIP/ROCR_VISIBLE_DEVICES to prevent all jobs defaulting to GPU0.
Add a persisted task model for long-running server operations and support async workflow initialization.\n\n- Add async_handles table and /tasks/{id} endpoint (404 on unauthorized to avoid enumeration)\n- Prevent duplicate concurrent initialize_jobs tasks via partial unique index\n- Add TaskStatus enum for type-safe task status handling\n- Add CLI support (torc tasks wait) and integration tests
Remove the unreachable 403 response variant for GET /tasks/{id} so
authorization failures return 404 and do not leak existence.

Add a typed 404 variant for the Rust client get_task API.

Includes pre-commit auto-fixes (whitespace/EOF/formatting) across the repo.
Retry transient get_task failures (network/5xx/408/429) with
exponential backoff instead of exiting immediately.

Add an access-control test ensuring unauthorized task polling returns
404 (not 403).

Enforce async_handles.status via a SQLite CHECK constraint.

Fix GetTaskError typing by mapping variants from HTTP status.

[[Back to Model list]](../README.md#models) [[Back to API list]](../README.md#api-endpoints) [[Back to README]](../README.md)


Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These Julia and Python files are auto-generated by running bash make_api_clients.sh from the api directory (Docker required). Whatever you did to remove this extra whitespace needs to be disabled unless we fix it at the source. Otherwise, we'll be ping-ponging constantly.

shell
};

if let Some(v) = gpu_visible_devices {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this showing up because the branch is out-of-date with respect to main?

}
let _run_id = self.bump_run_id()?;
self.initialize_files()?;
self.initialize_jobs_async(false)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that if the user runs

torc workflows initialize
torc workflows reinitialize

the operation returns immediately? I'm wondering if

  1. The initialize function should be changed to call this function and then wait until the operation is done.
  2. All CLI handlers should call initialize and not initialize_async.

There may be a problem currently. When the user calls torc run or torc submit, we call initialize, which will fail on huge workflows.

The main point is this: it is ok for the CLI commands to be blocked for multiple minutes. If we need the async behavior, we could add torc workflows initialize --async.

@daniel-thom
Copy link
Copy Markdown
Collaborator

Ask Codex to review the code using .claude/skills/review-api/SKILL.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants