Add Claude-powered Slack bot for data transfer requests#5
Open
fluidnumerics-joe wants to merge 25 commits intomainfrom
Open
Add Claude-powered Slack bot for data transfer requests#5fluidnumerics-joe wants to merge 25 commits intomainfrom
fluidnumerics-joe wants to merge 25 commits intomainfrom
Conversation
Introduces a Slack bot that uses Claude to interpret user requests and manage data transfers via xfer/Slurm. Features: - Natural language transfer requests via Slack mentions - Two-phase job submission (prepare → transfer array job) - Job tracking via Slurm comments (slack:channel/thread) - Progress reporting from state files - Backend validation with support channel escalation - Configurable defaults via environment variables Components: - src/xfer/slackbot/ - Bot implementation - app.py: Slack Bolt event handlers - claude_agent.py: Claude API with tool definitions - slurm_tools.py: Slurm/xfer operations - config.py: Configuration dataclasses - docs/slack-app-setup.md - Setup guide - examples/allowed_backends.yaml - Backend config example - tests/test_slackbot_dryrun.py - Dry-run tests Also adds --sbatch-extras to xfer run command for comment injection. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add automatic file size analysis to select optimal rclone flags based on file distribution (small files, large files, or mixed workloads) - Add default --stats 600s --progress flags for ETA tracking in logs - Add user-specified rclone flags support (appended to intelligent defaults) - Add get_manifest_stats tool for previewing data before transfer - Add xfer manifest analyze CLI command - Update documentation with new capabilities Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Update system prompt to instruct Claude to use Slack's mrkdwn format - Add markdown_to_slack() converter as safety net for any standard markdown that slips through - Apply converter to all bot responses before sending Slack mrkdwn differs from standard markdown: - Bold: *text* (not **text**) - Links: <url|text> (not [text](url)) - No header support (converted to bold) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
New features: - Add check_path_exists tool to verify bucket/path accessibility - If path doesn't exist, automatically notifies support channel - Detects common errors: NoSuchBucket, AccessDenied, NoSuchKey Bug fixes: - Add uv sync to prepare script to ensure environment is up to date - Use 'uv run xfer' instead of bare 'xfer' commands - Add XFER_INSTALL_DIR config option for xfer repo location Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add array_concurrency parameter to submit_transfer tool - Cap maximum at 64 to prevent overloading storage systems - Users can request lower values (e.g., 32, 16) for gentler transfers Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add detailed diagnostics when srun/rclone fails: - Log the full command that was executed - Capture all SLURM environment variables - Print SLURM memory vars to stderr for quick debugging - Include full stdout and stderr in error log This helps diagnose environment-related failures like SLURM_MEM_* variable conflicts. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Remove colons from timestamp format to avoid 'Invalid argument' errors when creating log files. Changes from ISO format (2026-02-05T00:21:57Z) to compact format (20260205T002157Z). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Use sbatch --export=NONE to avoid inheriting SLURM_* environment variables from the bot's parent job. This fixes the SLURM_MEM_PER_CPU vs SLURM_MEM_PER_NODE conflict when the bot runs as a Slurm job. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Ensure clean environment for all submitted jobs: - submit.sh template (submits array job from prepare.sh) - slurm submit CLI command (direct submission) All batch scripts already define their required environment variables via export statements, so they are self-contained. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Slurm sets both SLURM_MEM_PER_NODE (from --mem=16G) and SLURM_MEM_PER_CPU (from cluster DefMemPerCPU), causing srun to fail. Unset these at the start of the script before running any srun commands. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Large transfers may need more time for manifest building and sharding. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The bot uses in-memory conversation history to track threads it has participated in. When the bot restarts, this history is lost and it stops responding to thread messages from other users asking for updates. This adds a fallback that fetches thread history via the Slack API when in-memory history is missing, reconstructing the conversation context so the Claude agent can still identify the relevant transfer job. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Adds a new tool that allows users to inspect running/completed jobs: - File size distribution histogram from analysis.json - Suggested rclone flags determined during manifest analysis - Tail of prepare job stdout/stderr logs - Extracted rclone commands that were run This helps users debug issues and understand what the transfer is doing without needing direct access to the cluster filesystem. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add table of all available bot tools - Document read_job_logs capability for viewing histograms and logs - Document thread-based conversation behavior - Explain restart resilience with automatic context recovery - Add troubleshooting entry for thread message issues Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Allows users to ask the bot what buckets are available at a given backend, making it easier to discover data sources before transfers. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Adds an early check using check_path_exists before creating the run directory or submitting to Slurm, so users get immediate feedback when the source path is invalid. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The read_job_logs tool previously only returned prepare job logs, missing the actual transfer errors in shard logs. Now reads failed shard state files and their attempt logs, and includes recent shard logs for non-failed jobs. Also clarifies tool response keys (prepare_stdout, prepare_stderr) and provides explicit notes when analysis or logs are missing so the agent can explain the situation to users. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Thread replies now go through a fast triage call before responding, so the bot stays silent when users are talking to each other rather than addressing the bot. @mentions and DMs bypass triage entirely. Skipped messages are still stored for conversation context. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Record the Slack user ID in request.json when submitting transfer jobs and enforce ownership checks in cancel_job so only the submitting user can cancel their own jobs. Legacy jobs without a submitted_by field remain cancellable by anyone for backwards compatibility. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Split manifest generation into up to 4 concurrent rclone lsjson workers to reduce listing time for large datasets. Add `manifest combine` CLI command, prepare job phase tracking (progress.json), manifest build progress reporting, and bump prepare resources (250G mem, 4-day limit). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…er job resources Fix triage model ID (claude-haiku-4-20250414 → claude-haiku-4-5-20251001) that was causing 404 errors and fail-open duplicate responses. Deduplicate @mention messages in the thread handler since handle_mention already processes them. Append S3-specific upload flags when destination is an S3 endpoint. Increase transfer job defaults to 4-day wall clock, 16 CPUs, and 100G memory. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ripts The slurm render command received sbatch_extras with literal \n separators but never converted them to actual newlines. This put all #SBATCH directives on one line, causing Slurm to misparse the --comment value. Transfer array jobs ended up with wrong/missing comments, so the bot's thread ownership check failed with "does not belong to this thread". Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Derive transfer phase from sacct job states instead of relying purely on
file existence in the run directory. Correlates prepare and transfer jobs
by name convention ({name}-prepare / {name}) and enriches with file-based
shard progress when available. Handles edge cases like shards finishing
before the Slurm job exits, and completed jobs with partial shard failures.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
New Components
src/xfer/slackbot/docs/slack-app-setup.mdexamples/allowed_backends.yamltests/test_slackbot_dryrun.pyFeatures
@xfer-botTest plan
python tests/test_slackbot_dryrun.py)🤖 Generated with Claude Code