-
Notifications
You must be signed in to change notification settings - Fork 3
Description
Plan by Claude:
Plan: Make braindump work seamlessly inside existing repos
Context
Braindump currently operates as a standalone tool: data lives inside its own data/ directory, config is CLI-only, and generated AGENTS.md files go into 6-generate/ rather than the target repo. To make it a tool you install and run inside your repo (like a linter), we need .braindump/ directory support, auto-detection, config files, AGENTS.md merging, and a GitHub Actions story.
Changes
1. .braindump/ directory layout in target repos
.braindump/
config.toml # Repo-specific settings (committed)
rule_overrides.jsonl # Manual rule corrections (committed)
data/ # Pipeline stage data (gitignored)
1-download/
2-extract/
3-synthesize/
4-place/
5-group/
6-generate/ # Staging area, NOT final output
last_run.json # Tracks last successful run timestamp (gitignored)
2. Config file: .braindump/config.toml
[braindump]
# Defaults applied when CLI args are not provided
authors = "all" # or ["user1", "user2"]
since = "2025-01-01" # initial start date; auto-advances after each run
min_score = 0.5 # group-stage threshold (place stays at 0.3)
max_rules = 100 # cap on total rules
[output]
# Where to write AGENTS.md files, relative to repo root
root = "." # default: repo root
filename = "AGENTS.md" # could be CLAUDE.md, CURSOR.md, etc.Keep it minimal. CLI args override config file, which overrides defaults.
3. Auto-detect repo (no --repo needed)
When --repo is not provided:
- Check if CWD is inside a git repo (
git rev-parse --show-toplevel) - Parse origin remote (
git remote get-url origin) to extractowner/repo - Look for
.braindump/in the repo root - Set data dir to
.braindump/data/and output to repo root
Two modes based on detection:
- Repo mode (CWD is in a git repo): data in
.braindump/data/, output goes to repo root, config loaded from.braindump/config.toml. If.braindump/doesn't exist yet, auto-create it with a confirmation prompt. - Standalone mode (
--repoprovided explicitly, or not in a git repo): current behavior, data in braindump'sdata/dir
4. braindump init command
Creates the .braindump/ directory in the current repo:
- Writes a starter
config.tomlwith commented defaults - Creates empty
rule_overrides.jsonl - Creates
data/dir - Appends
.braindump/data/and.braindump/last_run.jsonto.gitignore - Prints next-steps instructions
5. RepoConfig refactor
File: src/braindump/config.py
Add repo_root parameter to RepoConfig:
class RepoConfig:
def __init__(self, repo: str, repo_root: Path | None = None) -> None:
self.repo = repo
self.repo_root = repo_root # None = standalone mode
if repo_root:
# Repo mode: data inside .braindump/
self.braindump_dir = repo_root / ".braindump"
self.data_dir = self.braindump_dir / "data"
else:
# Standalone mode: data in braindump project
self._project_root = Path(__file__).resolve().parent.parent.parent
self.data_dir = self._project_root / "data" / repo
@property
def overrides_path(self) -> Path:
if self.repo_root:
return self.braindump_dir / "rule_overrides.jsonl"
return self.data_dir / "rule_overrides.jsonl"
@property
def output_root(self) -> Path:
"""Where AGENTS.md files are written to in the actual repo."""
if self.repo_root:
return self.repo_root # directly into repo
return self.stage_dir("6-generate") # staging area
@property
def config_path(self) -> Path | None:
if self.repo_root:
return self.braindump_dir / "config.toml"
return NoneAdd a BraindumpConfig dataclass to load/merge config.toml with CLI args:
@dataclass
class BraindumpConfig:
authors: str = "all"
since: str | None = None
min_score: float | None = None
max_rules: int | None = None
output_root: str = "."
output_filename: str = "AGENTS.md"
@classmethod
def from_toml(cls, path: Path) -> BraindumpConfig: ...
def merge_cli(self, **cli_args) -> BraindumpConfig:
"""CLI args override config file values (if not None/default)."""6. CLI changes (cli.py)
Update the main callback:
@app.callback()
def main(ctx, repo: str = ...) -> None:
if not repo:
# Try auto-detect from git
detected = detect_repo_from_git()
if detected:
repo, repo_root = detected
config = RepoConfig(repo, repo_root=repo_root)
else:
repo = typer.prompt("GitHub repository (owner/repo)")
config = RepoConfig(repo)
else:
config = RepoConfig(repo)
# Load .braindump/config.toml if it exists
braindump_config = BraindumpConfig()
if config.config_path and config.config_path.exists():
braindump_config = BraindumpConfig.from_toml(config.config_path)
ctx.obj["config"] = config
ctx.obj["braindump_config"] = braindump_configAdd detect_repo_from_git():
def detect_repo_from_git() -> tuple[str, Path] | None:
"""Detect owner/repo from git remote, return (repo, repo_root) or None."""
# git rev-parse --show-toplevel → repo_root
# git remote get-url origin → parse owner/repo from URL
# Support both SSH and HTTPS formats7. AGENTS.md merge logic
File: src/braindump/commands/generate.py
When writing to an actual repo (repo mode), merge with existing files:
def merge_with_existing(existing_content: str, braindump_content: str) -> str:
"""Replace content between <!-- braindump --> and <!-- /braindump --> markers."""
START = "<!-- braindump:" # matches "<!-- braindump: rules extracted..."
END = "<!-- /braindump -->"
start_idx = existing_content.find(START)
end_idx = existing_content.find(END)
if start_idx != -1 and end_idx != -1:
# Replace existing braindump section
before = existing_content[:start_idx]
after = existing_content[end_idx + len(END):]
return before + braindump_content + after
else:
# No existing section: append
return existing_content.rstrip() + "\n\n" + braindump_content + "\n"Update write_files() to use merge when repo_root is set:
- If target file exists and has fencing markers → replace fenced section
- If target file exists without markers → append with blank line
- If target file doesn't exist → write directly
8. Auto --since tracking
File: .braindump/last_run.json
{"last_download": "2025-06-15T10:30:00Z", "last_since": "2025-06-01"}In the run command, when --since is not provided:
- Check
last_run.jsonforlast_sincedate - If found, use it as the
--sincevalue - After successful download, update
last_sinceto today - Fall back to config.toml
sincevalue, then to prompting
9. GitHub Actions workflow
Document a reference workflow in the README (not auto-generated). Key design:
- Uses
actions/cacheto persist.braindump/data/between runs - Creates a PR instead of pushing directly (reviewable)
GH_TOKENfrom Actions providesghCLI auth automatically- Triggered weekly + manual dispatch
10. Output filename configurability
Support AGENTS.md, CLAUDE.md, CURSOR.md, etc. via config:
[output]
filename = "AGENTS.md"This means the fencing markers become essential — they identify braindump's section regardless of what else is in the file.
11. Incremental processing strategy
Download + Extract are already incremental:
- Download skips PRs where all 5 files exist
- Extract uses checkpoint system to skip processed comment IDs, appends new extractions
Synthesize+ re-runs fully but uses caches:
- When
extractions.jsonlchanges (new extractions appended), input hash changes → full re-cluster - Embedding cache (
embeddings_keys.json) avoids re-computing embeddings when generalization texts match - Rephrase cache in generate stage avoids re-calling LLM for unchanged rules
- This is acceptable for weekly CI runs — the main cost is LLM calls for cluster synthesis, which is bounded by cluster count
For GHA, actions/cache persists .braindump/data/ between runs so:
- Download: only fetches new PRs since last run
- Extract: only processes new comments
- Synthesize+: re-runs but benefits from embedding caches
- Generate: rephrase cache avoids redundant LLM calls for stable rules
12. Provenance index for lookup
File: .braindump/provenance.json (committed to repo)
Generated during the generate stage, alongside AGENTS.md files:
{
"repo": "owner/repo",
"generated_at": "2025-06-15T10:30:00Z",
"rules": {
"42": {
"text": "Use keyword-only args for public APIs",
"category": "api_design",
"score": 0.85,
"source_prs": [123, 456],
"source_comments": [
{"id": 789, "pr": 123, "author": "reviewer1", "url": "https://github.com/owner/repo/pull/123#discussion_r789"},
{"id": 101, "pr": 456, "author": "reviewer2", "url": "https://github.com/owner/repo/pull/456#discussion_r101"}
]
}
}
}This is small (a few KB) and lets:
braindump lookupwork without full pipeline data — shows PR links + comment URLs- Contributors click through to the original review comment on GitHub
- Lookup falls back gracefully: full verbose mode when pipeline data is available, URL-only mode when only provenance.json exists
Lookup two-mode behavior:
- Full mode (pipeline data available): current behavior — shows comment body, diff hunk, etc.
- Index mode (only provenance.json): shows rule text, score, PR numbers, clickable GitHub URLs
To build the URLs, we need comment IDs + PR numbers (already on rule objects as source_comments and source_prs). We construct https://github.com/{owner}/{repo}/pull/{pr}#discussion_r{comment_id}. Author info can be extracted from the downloaded comment data during generate.
Files to modify
| File | Change |
|---|---|
src/braindump/config.py |
Add repo_root, BraindumpConfig, detect_repo_from_git(), provenance_path |
src/braindump/cli.py |
Auto-detect repo, load config.toml, auto-init, register init command |
src/braindump/commands/init.py |
New: braindump init command |
src/braindump/commands/generate.py |
Add merge logic, write to repo root in repo mode, generate provenance.json |
src/braindump/commands/run.py |
Auto-since tracking, pass braindump_config |
src/braindump/commands/download.py |
Accept since from auto-tracking |
src/braindump/commands/lookup.py |
Two-mode: full (pipeline data) vs index (provenance.json only) |
pyproject.toml |
Add tomli dependency (for TOML parsing on Python < 3.11) |
Implementation order
- RepoConfig refactor — add
repo_rootsupport, keep backward compat - BraindumpConfig — TOML loading + CLI merge
- Auto-detect repo — git remote parsing in cli.py, auto-init with confirmation
braindump init— scaffold.braindump/directory (also called by auto-init)- AGENTS.md merging — fenced section replacement in generate.py
- Provenance index — generate provenance.json during generate stage, update lookup for two-mode
- Auto-since tracking — last_run.json read/write
- Output filename config — support CLAUDE.md etc.
- README — document GitHub Actions workflow
Verification
- Run
braindump initin an existing repo → creates.braindump/with correct structure - Run
braindump statuswithout--repo→ auto-detects from git remote - Run
braindump run→ pipeline writes AGENTS.md directly into repo root - Add manual content above
<!-- braindump -->in AGENTS.md → re-run → manual content preserved - Run again →
--sinceauto-advances, only new PRs processed - Check
.gitignore→.braindump/data/andlast_run.jsonare excluded braindump lookup 42with only.braindump/provenance.json→ shows rule text + GitHub URLsbraindump lookup 42 -vwith full pipeline data → shows full comment bodies--repo owner/repostill works in standalone mode (backward compat)