You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add Slurm job state-based phase detection for transfer status
Derive transfer phase from sacct job states instead of relying purely on
file existence in the run directory. Correlates prepare and transfer jobs
by name convention ({name}-prepare / {name}) and enriches with file-based
shard progress when available. Handles edge cases like shards finishing
before the Slurm job exits, and completed jobs with partial shard failures.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: src/xfer/slackbot/claude_agent.py
+16-12Lines changed: 16 additions & 12 deletions
Original file line number
Diff line number
Diff line change
@@ -23,6 +23,7 @@
23
23
get_source_stats,
24
24
get_transfer_progress,
25
25
get_transfer_progress_by_job,
26
+
get_transfer_status_by_thread,
26
27
list_buckets,
27
28
submit_transfer,
28
29
)
@@ -84,9 +85,20 @@
84
85
"name": "check_status",
85
86
"description": """Check the status of transfer jobs in this thread. Use this when the user asks about job status, progress, or wants to know if their transfer is complete.
86
87
87
-
This tool finds all jobs associated with the current Slack thread and returns their status.
88
+
This tool finds all jobs associated with the current Slack thread and returns their status. It correlates prepare and transfer jobs by name and derives a unified phase from Slurm job states, enriched with file-based shard progress when available.
88
89
89
-
When the phase is "building_manifest", the prepare job is listing files at the source. This can take up to several days for large datasets and is normal. The response may include files_listed and bytes_listed if the JSONL writing phase has started, or prepare_phase/prepare_detail for finer-grained progress. Only flag a concern if the job has been in this phase for more than 48 hours with no observable progress.""",
- "prepare_complete" — prepare finished but no transfer job found (anomaly)
95
+
- "waiting_to_start" — prepare done, transfer job is queued
96
+
- "transferring" — transfer job is running
97
+
- "complete" — transfer finished successfully
98
+
- "complete_with_failures" — transfer job completed but some shards failed
99
+
- "failed" — transfer job FAILED/CANCELLED/TIMEOUT
100
+
101
+
When the phase is "preparing", the prepare job may be listing files at the source. This can take up to several days for large datasets and is normal. The prepare job has a 4-day time limit. Only flag a concern if the prepare job has been running for more than 48 hours with no progress. The progress field may include files_listed, bytes_listed, or prepare_phase/prepare_detail for finer-grained tracking.""",
90
102
"input_schema": {
91
103
"type": "object",
92
104
"properties": {
@@ -446,16 +458,8 @@ def execute_tool(
446
458
returnjson.dumps(job.to_dict())
447
459
returnjson.dumps({"error": f"Job {job_id} not found"})
448
460
else:
449
-
# Get all jobs for this thread with progress info
450
-
jobs=get_jobs_by_thread(channel_id, thread_ts)
451
-
results= []
452
-
forjobinjobs:
453
-
ifjob.work_dir:
454
-
progress=get_transfer_progress_by_job(job.job_id)
455
-
ifprogress:
456
-
results.append(progress)
457
-
continue
458
-
results.append(job.to_dict())
461
+
# Get grouped status for all transfers in this thread
0 commit comments