Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
37bea61
Add Claude-powered Slack bot for data transfer requests
fluidnumerics-joe Feb 4, 2026
612e476
Add intelligent rclone flag selection and manifest stats
fluidnumerics-joe Feb 4, 2026
55446de
Fix markdown rendering in Slack responses
fluidnumerics-joe Feb 4, 2026
1c01ddb
Add check_path_exists tool and fix uv environment setup
fluidnumerics-joe Feb 4, 2026
b8f4335
Allow users to set lower array concurrency (max 64)
fluidnumerics-joe Feb 4, 2026
82ef039
Improve error logging for manifest build failures
fluidnumerics-joe Feb 5, 2026
7ef4a9b
Fix run_id format to be filename-safe
fluidnumerics-joe Feb 5, 2026
d3e7582
Prevent environment propagation when submitting prepare job
fluidnumerics-joe Feb 5, 2026
94214d0
Add --export=NONE to all sbatch calls
fluidnumerics-joe Feb 5, 2026
36546a5
Unset conflicting SLURM_MEM_* variables in prepare.sh
fluidnumerics-joe Feb 5, 2026
091ff87
Increase prepare job time limit to 2 days
fluidnumerics-joe Feb 5, 2026
5d0928a
Restore thread context from Slack API after bot restart
fluidnumerics-joe Feb 5, 2026
ef6687b
Add read_job_logs tool to access job analysis and logs
fluidnumerics-joe Feb 5, 2026
b9e76d7
Update Slack bot documentation with new features
fluidnumerics-joe Feb 5, 2026
cb8c3c4
Add list_buckets tool to enumerate buckets at endpoints
fluidnumerics-joe Feb 5, 2026
ce76a1d
Document list_buckets tool in Slack bot setup guide
fluidnumerics-joe Feb 5, 2026
460a690
Verify source path exists before submitting transfer job
fluidnumerics-joe Feb 7, 2026
c7187b6
Reduce thread history limits to mitigate Slack rate limits
fluidnumerics-joe Feb 7, 2026
b92f071
Read shard logs and improve log diagnostics in read_job_logs tool
fluidnumerics-joe Feb 7, 2026
9da7243
Add lightweight Haiku triage to filter thread messages
fluidnumerics-joe Feb 7, 2026
ef742db
Add per-user job ownership for cancel authorization
fluidnumerics-joe Feb 7, 2026
1fc7976
Parallelize manifest build and add progress tracking
fluidnumerics-joe Feb 7, 2026
9813bd8
Fix duplicate bot responses, add S3 rclone flags, and increase transf…
fluidnumerics-joe Feb 7, 2026
d994cd3
Fix sbatch_extras literal \n not converted to newlines in rendered sc…
fluidnumerics-joe Feb 7, 2026
4da5042
Add Slurm job state-based phase detection for transfer status
fluidnumerics-joe Feb 8, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 39 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# Changelog

## Unreleased (feature/slack-claude)

### Parallel Manifest Build

- Parallelize manifest generation into up to 4 concurrent `rclone lsjson` workers, reducing listing time for large datasets with many top-level subdirectories (`slurm_tools.py`)
- Add `manifest combine` CLI command to merge parallel lsjson part files (with `.prefix` sidecars) into a unified `manifest.jsonl` (`cli.py`)
- Bump prepare job memory from 16 GB to 250 GB to accommodate large listings (`slurm_tools.py`)
- Add `--max-backlog=1000000` to `rclone lsjson` calls to prevent the walker from stalling on large buckets (`cli.py`, `slurm_tools.py`)
- Report manifest build progress (files listed, bytes listed) via `manifest.jsonl.progress` sidecar file (`cli.py`)
- Track prepare job phases (`listing_source`, `combining_manifest`, `analyzing`, `sharding`, `rendering`, `submitting`) in `progress.json` (`slurm_tools.py`)

### Claude-Powered Slack Bot

- Add Claude-powered Slack bot for interactive data transfer requests via Slack threads
- Add intelligent rclone flag selection based on file size distribution analysis
- Add `check_path_exists` tool to validate source paths before submitting jobs
- Add `list_buckets` tool to enumerate buckets at remote endpoints
- Add `read_job_logs` tool to access job analysis data, prepare logs, and shard transfer logs
- Add lightweight Haiku triage to filter thread messages and skip unrelated chatter
- Add per-user job ownership so only the submitting user can cancel their jobs
- Restore thread context from Slack API after bot restarts
- Report manifest listing progress (files_listed, bytes_listed) in job status during `building_manifest` phase

### Slurm Robustness

- Increase prepare job time limit to 4 days for very large datasets
- Unset conflicting `SLURM_MEM_*` environment variables in prepare.sh
- Add `--export=NONE` to all `sbatch` calls to prevent environment leakage
- Allow users to set lower array concurrency (max 64)
- Verify source path exists before creating run directory and submitting jobs
- Reduce thread history limits to mitigate Slack rate limits

### Bug Fixes

- Fix `run_id` format to be filename-safe (no colons)
- Fix markdown rendering in Slack responses
- Improve error logging for manifest build failures (write to `xfer-err/` with full context)
65 changes: 58 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -150,7 +150,35 @@ run/

---

### 2. Shard the manifest
### 2. Analyze file size distribution (optional)

Analyze the manifest to see file size statistics and get suggested rclone flags:

```bash
uv run xfer manifest analyze \
--in run/manifest.jsonl \
--out run/analysis.json
```

Output (`analysis.json`):
```json
{
"status": "ok",
"total_files": 125000,
"total_bytes": 5368709120,
"total_bytes_human": "5.00 GiB",
"profile": "small_files",
"profile_explanation": "Optimized for small files (82% < 1MB, median 256 KiB)",
"suggested_flags": "--transfers 64 --checkers 128 --fast-list --retries 10 ...",
"histogram": [...]
}
```

The Slack bot runs this analysis automatically on every transfer to select optimal flags.

---

### 3. Shard the manifest

Splits the manifest into balanced shards (by total bytes):

Expand All @@ -174,7 +202,7 @@ run/

---

### 3. Render Slurm scripts
### 4. Render Slurm scripts

Creates:

Expand All @@ -199,7 +227,7 @@ uv run xfer slurm render \

---

### 4. Submit the job
### 5. Submit the job

```bash
uv run xfer slurm submit --run-dir run
Expand Down Expand Up @@ -264,7 +292,21 @@ sbatch run/sbatch_array.sh

## Recommended rclone flags (starting points)

### High-throughput S3↔S3
### Intelligent Auto-Selection (Slack Bot)

When using the Slack bot, rclone flags are **automatically selected** based on file size distribution analysis:

| Profile | Condition | Flags |
|---------|-----------|-------|
| **Small files** | >70% files < 1MB | `--transfers 64 --checkers 128 --fast-list` |
| **Large files** | >50% files > 100MB | `--transfers 16 --checkers 32 --buffer-size 256M` |
| **Mixed** | Default | `--transfers 32 --checkers 64 --fast-list` |

All profiles include `--retries 10 --low-level-retries 20 --stats 600s --progress` for reliability and logging.

### Manual Flag Selection (CLI)

#### High-throughput S3↔S3

```text
--transfers 32
Expand All @@ -275,15 +317,23 @@ sbatch run/sbatch_array.sh
--stats 30s
```

### Small objects (metadata heavy)
#### Small objects (metadata heavy)

```text
--transfers 16
--transfers 64
--checkers 128
--fast-list
```

### Track progress
#### Large objects

```text
--transfers 16
--checkers 32
--buffer-size 256M
```

#### Track progress (always recommended)

```text
--progress --stats 600s
Expand All @@ -296,6 +346,7 @@ sbatch run/sbatch_array.sh
```
run/
manifest.jsonl
analysis.json # File size analysis and suggested flags
shards/
shard_000123.jsonl
shards.meta.json
Expand Down
Loading
Loading