Add Claude-powered Slack bot for data transfer requests by fluidnumerics-joe · Pull Request #5 · FluidNumerics/xfer

fluidnumerics-joe · 2026-02-04T20:00:53Z

Summary

Adds a Slack bot that uses Claude to interpret natural language data transfer requests
Users can request transfers, check status, and cancel jobs via Slack mentions
Jobs are tracked via Slurm comments linking back to Slack threads
Two-phase job submission: prepare job (manifest→shard→render) then transfer array job

New Components

Component	Purpose
`src/xfer/slackbot/`	Bot implementation (app, claude_agent, slurm_tools, config)
`docs/slack-app-setup.md`	Slack app setup guide
`examples/allowed_backends.yaml`	Backend allowlist example
`tests/test_slackbot_dryrun.py`	Dry-run tests (no Slack/Slurm needed)

Features

Natural language transfer requests via @xfer-bot
Backend validation with support channel escalation for new backend requests
Progress tracking from Slurm state files
Configurable defaults via environment variables

Test plan

Dry-run tests pass (python tests/test_slackbot_dryrun.py)
Create Slack app and test with mock sbatch
End-to-end test on cluster with real transfers

🤖 Generated with Claude Code

Introduces a Slack bot that uses Claude to interpret user requests and manage data transfers via xfer/Slurm. Features: - Natural language transfer requests via Slack mentions - Two-phase job submission (prepare → transfer array job) - Job tracking via Slurm comments (slack:channel/thread) - Progress reporting from state files - Backend validation with support channel escalation - Configurable defaults via environment variables Components: - src/xfer/slackbot/ - Bot implementation - app.py: Slack Bolt event handlers - claude_agent.py: Claude API with tool definitions - slurm_tools.py: Slurm/xfer operations - config.py: Configuration dataclasses - docs/slack-app-setup.md - Setup guide - examples/allowed_backends.yaml - Backend config example - tests/test_slackbot_dryrun.py - Dry-run tests Also adds --sbatch-extras to xfer run command for comment injection. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add automatic file size analysis to select optimal rclone flags based on file distribution (small files, large files, or mixed workloads) - Add default --stats 600s --progress flags for ETA tracking in logs - Add user-specified rclone flags support (appended to intelligent defaults) - Add get_manifest_stats tool for previewing data before transfer - Add xfer manifest analyze CLI command - Update documentation with new capabilities Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Update system prompt to instruct Claude to use Slack's mrkdwn format - Add markdown_to_slack() converter as safety net for any standard markdown that slips through - Apply converter to all bot responses before sending Slack mrkdwn differs from standard markdown: - Bold: *text* (not **text**) - Links: <url|text> (not [text](url)) - No header support (converted to bold) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

New features: - Add check_path_exists tool to verify bucket/path accessibility - If path doesn't exist, automatically notifies support channel - Detects common errors: NoSuchBucket, AccessDenied, NoSuchKey Bug fixes: - Add uv sync to prepare script to ensure environment is up to date - Use 'uv run xfer' instead of bare 'xfer' commands - Add XFER_INSTALL_DIR config option for xfer repo location Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add array_concurrency parameter to submit_transfer tool - Cap maximum at 64 to prevent overloading storage systems - Users can request lower values (e.g., 32, 16) for gentler transfers Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add detailed diagnostics when srun/rclone fails: - Log the full command that was executed - Capture all SLURM environment variables - Print SLURM memory vars to stderr for quick debugging - Include full stdout and stderr in error log This helps diagnose environment-related failures like SLURM_MEM_* variable conflicts. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Remove colons from timestamp format to avoid 'Invalid argument' errors when creating log files. Changes from ISO format (2026-02-05T00:21:57Z) to compact format (20260205T002157Z). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Use sbatch --export=NONE to avoid inheriting SLURM_* environment variables from the bot's parent job. This fixes the SLURM_MEM_PER_CPU vs SLURM_MEM_PER_NODE conflict when the bot runs as a Slurm job. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Ensure clean environment for all submitted jobs: - submit.sh template (submits array job from prepare.sh) - slurm submit CLI command (direct submission) All batch scripts already define their required environment variables via export statements, so they are self-contained. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Slurm sets both SLURM_MEM_PER_NODE (from --mem=16G) and SLURM_MEM_PER_CPU (from cluster DefMemPerCPU), causing srun to fail. Unset these at the start of the script before running any srun commands. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Large transfers may need more time for manifest building and sharding. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

The bot uses in-memory conversation history to track threads it has participated in. When the bot restarts, this history is lost and it stops responding to thread messages from other users asking for updates. This adds a fallback that fetches thread history via the Slack API when in-memory history is missing, reconstructing the conversation context so the Claude agent can still identify the relevant transfer job. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Adds a new tool that allows users to inspect running/completed jobs: - File size distribution histogram from analysis.json - Suggested rclone flags determined during manifest analysis - Tail of prepare job stdout/stderr logs - Extracted rclone commands that were run This helps users debug issues and understand what the transfer is doing without needing direct access to the cluster filesystem. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add table of all available bot tools - Document read_job_logs capability for viewing histograms and logs - Document thread-based conversation behavior - Explain restart resilience with automatic context recovery - Add troubleshooting entry for thread message issues Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Allows users to ask the bot what buckets are available at a given backend, making it easier to discover data sources before transfers. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Adds an early check using check_path_exists before creating the run directory or submitting to Slurm, so users get immediate feedback when the source path is invalid. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The read_job_logs tool previously only returned prepare job logs, missing the actual transfer errors in shard logs. Now reads failed shard state files and their attempt logs, and includes recent shard logs for non-failed jobs. Also clarifies tool response keys (prepare_stdout, prepare_stderr) and provides explicit notes when analysis or logs are missing so the agent can explain the situation to users. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Thread replies now go through a fast triage call before responding, so the bot stays silent when users are talking to each other rather than addressing the bot. @mentions and DMs bypass triage entirely. Skipped messages are still stored for conversation context. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Record the Slack user ID in request.json when submitting transfer jobs and enforce ownership checks in cancel_job so only the submitting user can cancel their own jobs. Legacy jobs without a submitted_by field remain cancellable by anyone for backwards compatibility. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Split manifest generation into up to 4 concurrent rclone lsjson workers to reduce listing time for large datasets. Add `manifest combine` CLI command, prepare job phase tracking (progress.json), manifest build progress reporting, and bump prepare resources (250G mem, 4-day limit). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…er job resources Fix triage model ID (claude-haiku-4-20250414 → claude-haiku-4-5-20251001) that was causing 404 errors and fail-open duplicate responses. Deduplicate @mention messages in the thread handler since handle_mention already processes them. Append S3-specific upload flags when destination is an S3 endpoint. Increase transfer job defaults to 4-day wall clock, 16 CPUs, and 100G memory. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ripts The slurm render command received sbatch_extras with literal \n separators but never converted them to actual newlines. This put all #SBATCH directives on one line, causing Slurm to misparse the --comment value. Transfer array jobs ended up with wrong/missing comments, so the bot's thread ownership check failed with "does not belong to this thread". Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Derive transfer phase from sacct job states instead of relying purely on file existence in the run directory. Correlates prepare and transfer jobs by name convention ({name}-prepare / {name}) and enriches with file-based shard progress when available. Handles edge cases like shards finishing before the Slurm job exits, and completed jobs with partial shard failures. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fluidnumerics-joe and others added 25 commits February 4, 2026 14:58

Fix run_id format to be filename-safe

7ef4a9b

Remove colons from timestamp format to avoid 'Invalid argument' errors when creating log files. Changes from ISO format (2026-02-05T00:21:57Z) to compact format (20260205T002157Z). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Increase prepare job time limit to 2 days

091ff87

Large transfers may need more time for manifest building and sharding. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add list_buckets tool to enumerate buckets at endpoints

cb8c3c4

Allows users to ask the bot what buckets are available at a given backend, making it easier to discover data sources before transfers. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Document list_buckets tool in Slack bot setup guide

ce76a1d

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Verify source path exists before submitting transfer job

460a690

Adds an early check using check_path_exists before creating the run directory or submitting to Slurm, so users get immediate feedback when the source path is invalid. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Reduce thread history limits to mitigate Slack rate limits

c7187b6

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Claude-powered Slack bot for data transfer requests#5

Add Claude-powered Slack bot for data transfer requests#5
fluidnumerics-joe wants to merge 25 commits intomainfrom
feature/slack-claude

fluidnumerics-joe commented Feb 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

fluidnumerics-joe commented Feb 4, 2026

Summary

New Components

Features

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant