Skip to content

Add Claude-powered Slack bot for data transfer requests#5

Open
fluidnumerics-joe wants to merge 25 commits intomainfrom
feature/slack-claude
Open

Add Claude-powered Slack bot for data transfer requests#5
fluidnumerics-joe wants to merge 25 commits intomainfrom
feature/slack-claude

Conversation

@fluidnumerics-joe
Copy link
Member

Summary

  • Adds a Slack bot that uses Claude to interpret natural language data transfer requests
  • Users can request transfers, check status, and cancel jobs via Slack mentions
  • Jobs are tracked via Slurm comments linking back to Slack threads
  • Two-phase job submission: prepare job (manifest→shard→render) then transfer array job

New Components

Component Purpose
src/xfer/slackbot/ Bot implementation (app, claude_agent, slurm_tools, config)
docs/slack-app-setup.md Slack app setup guide
examples/allowed_backends.yaml Backend allowlist example
tests/test_slackbot_dryrun.py Dry-run tests (no Slack/Slurm needed)

Features

  • Natural language transfer requests via @xfer-bot
  • Backend validation with support channel escalation for new backend requests
  • Progress tracking from Slurm state files
  • Configurable defaults via environment variables

Test plan

  • Dry-run tests pass (python tests/test_slackbot_dryrun.py)
  • Create Slack app and test with mock sbatch
  • End-to-end test on cluster with real transfers

🤖 Generated with Claude Code

fluidnumerics-joe and others added 25 commits February 4, 2026 14:58
Introduces a Slack bot that uses Claude to interpret user requests
and manage data transfers via xfer/Slurm.

Features:
- Natural language transfer requests via Slack mentions
- Two-phase job submission (prepare → transfer array job)
- Job tracking via Slurm comments (slack:channel/thread)
- Progress reporting from state files
- Backend validation with support channel escalation
- Configurable defaults via environment variables

Components:
- src/xfer/slackbot/ - Bot implementation
  - app.py: Slack Bolt event handlers
  - claude_agent.py: Claude API with tool definitions
  - slurm_tools.py: Slurm/xfer operations
  - config.py: Configuration dataclasses
- docs/slack-app-setup.md - Setup guide
- examples/allowed_backends.yaml - Backend config example
- tests/test_slackbot_dryrun.py - Dry-run tests

Also adds --sbatch-extras to xfer run command for comment injection.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add automatic file size analysis to select optimal rclone flags based
  on file distribution (small files, large files, or mixed workloads)
- Add default --stats 600s --progress flags for ETA tracking in logs
- Add user-specified rclone flags support (appended to intelligent defaults)
- Add get_manifest_stats tool for previewing data before transfer
- Add xfer manifest analyze CLI command
- Update documentation with new capabilities

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Update system prompt to instruct Claude to use Slack's mrkdwn format
- Add markdown_to_slack() converter as safety net for any standard
  markdown that slips through
- Apply converter to all bot responses before sending

Slack mrkdwn differs from standard markdown:
- Bold: *text* (not **text**)
- Links: <url|text> (not [text](url))
- No header support (converted to bold)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
New features:
- Add check_path_exists tool to verify bucket/path accessibility
- If path doesn't exist, automatically notifies support channel
- Detects common errors: NoSuchBucket, AccessDenied, NoSuchKey

Bug fixes:
- Add uv sync to prepare script to ensure environment is up to date
- Use 'uv run xfer' instead of bare 'xfer' commands
- Add XFER_INSTALL_DIR config option for xfer repo location

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add array_concurrency parameter to submit_transfer tool
- Cap maximum at 64 to prevent overloading storage systems
- Users can request lower values (e.g., 32, 16) for gentler transfers

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add detailed diagnostics when srun/rclone fails:
- Log the full command that was executed
- Capture all SLURM environment variables
- Print SLURM memory vars to stderr for quick debugging
- Include full stdout and stderr in error log

This helps diagnose environment-related failures like SLURM_MEM_*
variable conflicts.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Remove colons from timestamp format to avoid 'Invalid argument' errors
when creating log files. Changes from ISO format (2026-02-05T00:21:57Z)
to compact format (20260205T002157Z).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Use sbatch --export=NONE to avoid inheriting SLURM_* environment
variables from the bot's parent job. This fixes the SLURM_MEM_PER_CPU
vs SLURM_MEM_PER_NODE conflict when the bot runs as a Slurm job.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Ensure clean environment for all submitted jobs:
- submit.sh template (submits array job from prepare.sh)
- slurm submit CLI command (direct submission)

All batch scripts already define their required environment variables
via export statements, so they are self-contained.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Slurm sets both SLURM_MEM_PER_NODE (from --mem=16G) and
SLURM_MEM_PER_CPU (from cluster DefMemPerCPU), causing srun to fail.
Unset these at the start of the script before running any srun commands.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Large transfers may need more time for manifest building and sharding.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The bot uses in-memory conversation history to track threads it has
participated in. When the bot restarts, this history is lost and it
stops responding to thread messages from other users asking for updates.

This adds a fallback that fetches thread history via the Slack API when
in-memory history is missing, reconstructing the conversation context
so the Claude agent can still identify the relevant transfer job.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Adds a new tool that allows users to inspect running/completed jobs:
- File size distribution histogram from analysis.json
- Suggested rclone flags determined during manifest analysis
- Tail of prepare job stdout/stderr logs
- Extracted rclone commands that were run

This helps users debug issues and understand what the transfer is doing
without needing direct access to the cluster filesystem.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add table of all available bot tools
- Document read_job_logs capability for viewing histograms and logs
- Document thread-based conversation behavior
- Explain restart resilience with automatic context recovery
- Add troubleshooting entry for thread message issues

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Allows users to ask the bot what buckets are available at a given
backend, making it easier to discover data sources before transfers.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Adds an early check using check_path_exists before creating the run
directory or submitting to Slurm, so users get immediate feedback
when the source path is invalid.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The read_job_logs tool previously only returned prepare job logs, missing
the actual transfer errors in shard logs. Now reads failed shard state
files and their attempt logs, and includes recent shard logs for
non-failed jobs. Also clarifies tool response keys (prepare_stdout,
prepare_stderr) and provides explicit notes when analysis or logs are
missing so the agent can explain the situation to users.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Thread replies now go through a fast triage call before responding,
so the bot stays silent when users are talking to each other rather
than addressing the bot. @mentions and DMs bypass triage entirely.
Skipped messages are still stored for conversation context.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Record the Slack user ID in request.json when submitting transfer jobs
and enforce ownership checks in cancel_job so only the submitting user
can cancel their own jobs. Legacy jobs without a submitted_by field
remain cancellable by anyone for backwards compatibility.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Split manifest generation into up to 4 concurrent rclone lsjson workers
to reduce listing time for large datasets. Add `manifest combine` CLI
command, prepare job phase tracking (progress.json), manifest build
progress reporting, and bump prepare resources (250G mem, 4-day limit).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…er job resources

Fix triage model ID (claude-haiku-4-20250414 → claude-haiku-4-5-20251001) that
was causing 404 errors and fail-open duplicate responses. Deduplicate @mention
messages in the thread handler since handle_mention already processes them.
Append S3-specific upload flags when destination is an S3 endpoint. Increase
transfer job defaults to 4-day wall clock, 16 CPUs, and 100G memory.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ripts

The slurm render command received sbatch_extras with literal \n separators
but never converted them to actual newlines. This put all #SBATCH directives
on one line, causing Slurm to misparse the --comment value. Transfer array
jobs ended up with wrong/missing comments, so the bot's thread ownership
check failed with "does not belong to this thread".

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Derive transfer phase from sacct job states instead of relying purely on
file existence in the run directory. Correlates prepare and transfer jobs
by name convention ({name}-prepare / {name}) and enriches with file-based
shard progress when available. Handles edge cases like shards finishing
before the Slurm job exits, and completed jobs with partial shard failures.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant