Skip to content

Make work inside repo #2

@DouweM

Description

@DouweM

Plan by Claude:

Plan: Make braindump work seamlessly inside existing repos

Context

Braindump currently operates as a standalone tool: data lives inside its own data/ directory, config is CLI-only, and generated AGENTS.md files go into 6-generate/ rather than the target repo. To make it a tool you install and run inside your repo (like a linter), we need .braindump/ directory support, auto-detection, config files, AGENTS.md merging, and a GitHub Actions story.

Changes

1. .braindump/ directory layout in target repos

.braindump/
  config.toml            # Repo-specific settings (committed)
  rule_overrides.jsonl    # Manual rule corrections (committed)
  data/                   # Pipeline stage data (gitignored)
    1-download/
    2-extract/
    3-synthesize/
    4-place/
    5-group/
    6-generate/           # Staging area, NOT final output
  last_run.json           # Tracks last successful run timestamp (gitignored)

2. Config file: .braindump/config.toml

[braindump]
# Defaults applied when CLI args are not provided
authors = "all"               # or ["user1", "user2"]
since = "2025-01-01"          # initial start date; auto-advances after each run
min_score = 0.5               # group-stage threshold (place stays at 0.3)
max_rules = 100               # cap on total rules

[output]
# Where to write AGENTS.md files, relative to repo root
root = "."                    # default: repo root
filename = "AGENTS.md"        # could be CLAUDE.md, CURSOR.md, etc.

Keep it minimal. CLI args override config file, which overrides defaults.

3. Auto-detect repo (no --repo needed)

When --repo is not provided:

  1. Check if CWD is inside a git repo (git rev-parse --show-toplevel)
  2. Parse origin remote (git remote get-url origin) to extract owner/repo
  3. Look for .braindump/ in the repo root
  4. Set data dir to .braindump/data/ and output to repo root

Two modes based on detection:

  • Repo mode (CWD is in a git repo): data in .braindump/data/, output goes to repo root, config loaded from .braindump/config.toml. If .braindump/ doesn't exist yet, auto-create it with a confirmation prompt.
  • Standalone mode (--repo provided explicitly, or not in a git repo): current behavior, data in braindump's data/ dir

4. braindump init command

Creates the .braindump/ directory in the current repo:

  • Writes a starter config.toml with commented defaults
  • Creates empty rule_overrides.jsonl
  • Creates data/ dir
  • Appends .braindump/data/ and .braindump/last_run.json to .gitignore
  • Prints next-steps instructions

5. RepoConfig refactor

File: src/braindump/config.py

Add repo_root parameter to RepoConfig:

class RepoConfig:
    def __init__(self, repo: str, repo_root: Path | None = None) -> None:
        self.repo = repo
        self.repo_root = repo_root  # None = standalone mode

        if repo_root:
            # Repo mode: data inside .braindump/
            self.braindump_dir = repo_root / ".braindump"
            self.data_dir = self.braindump_dir / "data"
        else:
            # Standalone mode: data in braindump project
            self._project_root = Path(__file__).resolve().parent.parent.parent
            self.data_dir = self._project_root / "data" / repo

    @property
    def overrides_path(self) -> Path:
        if self.repo_root:
            return self.braindump_dir / "rule_overrides.jsonl"
        return self.data_dir / "rule_overrides.jsonl"

    @property
    def output_root(self) -> Path:
        """Where AGENTS.md files are written to in the actual repo."""
        if self.repo_root:
            return self.repo_root  # directly into repo
        return self.stage_dir("6-generate")  # staging area

    @property
    def config_path(self) -> Path | None:
        if self.repo_root:
            return self.braindump_dir / "config.toml"
        return None

Add a BraindumpConfig dataclass to load/merge config.toml with CLI args:

@dataclass
class BraindumpConfig:
    authors: str = "all"
    since: str | None = None
    min_score: float | None = None
    max_rules: int | None = None
    output_root: str = "."
    output_filename: str = "AGENTS.md"

    @classmethod
    def from_toml(cls, path: Path) -> BraindumpConfig: ...

    def merge_cli(self, **cli_args) -> BraindumpConfig:
        """CLI args override config file values (if not None/default)."""

6. CLI changes (cli.py)

Update the main callback:

@app.callback()
def main(ctx, repo: str = ...) -> None:
    if not repo:
        # Try auto-detect from git
        detected = detect_repo_from_git()
        if detected:
            repo, repo_root = detected
            config = RepoConfig(repo, repo_root=repo_root)
        else:
            repo = typer.prompt("GitHub repository (owner/repo)")
            config = RepoConfig(repo)
    else:
        config = RepoConfig(repo)

    # Load .braindump/config.toml if it exists
    braindump_config = BraindumpConfig()
    if config.config_path and config.config_path.exists():
        braindump_config = BraindumpConfig.from_toml(config.config_path)

    ctx.obj["config"] = config
    ctx.obj["braindump_config"] = braindump_config

Add detect_repo_from_git():

def detect_repo_from_git() -> tuple[str, Path] | None:
    """Detect owner/repo from git remote, return (repo, repo_root) or None."""
    # git rev-parse --show-toplevel → repo_root
    # git remote get-url origin → parse owner/repo from URL
    # Support both SSH and HTTPS formats

7. AGENTS.md merge logic

File: src/braindump/commands/generate.py

When writing to an actual repo (repo mode), merge with existing files:

def merge_with_existing(existing_content: str, braindump_content: str) -> str:
    """Replace content between <!-- braindump --> and <!-- /braindump --> markers."""
    START = "<!-- braindump:"  # matches "<!-- braindump: rules extracted..."
    END = "<!-- /braindump -->"

    start_idx = existing_content.find(START)
    end_idx = existing_content.find(END)

    if start_idx != -1 and end_idx != -1:
        # Replace existing braindump section
        before = existing_content[:start_idx]
        after = existing_content[end_idx + len(END):]
        return before + braindump_content + after
    else:
        # No existing section: append
        return existing_content.rstrip() + "\n\n" + braindump_content + "\n"

Update write_files() to use merge when repo_root is set:

  • If target file exists and has fencing markers → replace fenced section
  • If target file exists without markers → append with blank line
  • If target file doesn't exist → write directly

8. Auto --since tracking

File: .braindump/last_run.json

{"last_download": "2025-06-15T10:30:00Z", "last_since": "2025-06-01"}

In the run command, when --since is not provided:

  1. Check last_run.json for last_since date
  2. If found, use it as the --since value
  3. After successful download, update last_since to today
  4. Fall back to config.toml since value, then to prompting

9. GitHub Actions workflow

Document a reference workflow in the README (not auto-generated). Key design:

  • Uses actions/cache to persist .braindump/data/ between runs
  • Creates a PR instead of pushing directly (reviewable)
  • GH_TOKEN from Actions provides gh CLI auth automatically
  • Triggered weekly + manual dispatch

10. Output filename configurability

Support AGENTS.md, CLAUDE.md, CURSOR.md, etc. via config:

[output]
filename = "AGENTS.md"

This means the fencing markers become essential — they identify braindump's section regardless of what else is in the file.

11. Incremental processing strategy

Download + Extract are already incremental:

  • Download skips PRs where all 5 files exist
  • Extract uses checkpoint system to skip processed comment IDs, appends new extractions

Synthesize+ re-runs fully but uses caches:

  • When extractions.jsonl changes (new extractions appended), input hash changes → full re-cluster
  • Embedding cache (embeddings_keys.json) avoids re-computing embeddings when generalization texts match
  • Rephrase cache in generate stage avoids re-calling LLM for unchanged rules
  • This is acceptable for weekly CI runs — the main cost is LLM calls for cluster synthesis, which is bounded by cluster count

For GHA, actions/cache persists .braindump/data/ between runs so:

  • Download: only fetches new PRs since last run
  • Extract: only processes new comments
  • Synthesize+: re-runs but benefits from embedding caches
  • Generate: rephrase cache avoids redundant LLM calls for stable rules

12. Provenance index for lookup

File: .braindump/provenance.json (committed to repo)

Generated during the generate stage, alongside AGENTS.md files:

{
  "repo": "owner/repo",
  "generated_at": "2025-06-15T10:30:00Z",
  "rules": {
    "42": {
      "text": "Use keyword-only args for public APIs",
      "category": "api_design",
      "score": 0.85,
      "source_prs": [123, 456],
      "source_comments": [
        {"id": 789, "pr": 123, "author": "reviewer1", "url": "https://github.com/owner/repo/pull/123#discussion_r789"},
        {"id": 101, "pr": 456, "author": "reviewer2", "url": "https://github.com/owner/repo/pull/456#discussion_r101"}
      ]
    }
  }
}

This is small (a few KB) and lets:

  • braindump lookup work without full pipeline data — shows PR links + comment URLs
  • Contributors click through to the original review comment on GitHub
  • Lookup falls back gracefully: full verbose mode when pipeline data is available, URL-only mode when only provenance.json exists

Lookup two-mode behavior:

  • Full mode (pipeline data available): current behavior — shows comment body, diff hunk, etc.
  • Index mode (only provenance.json): shows rule text, score, PR numbers, clickable GitHub URLs

To build the URLs, we need comment IDs + PR numbers (already on rule objects as source_comments and source_prs). We construct https://github.com/{owner}/{repo}/pull/{pr}#discussion_r{comment_id}. Author info can be extracted from the downloaded comment data during generate.

Files to modify

File Change
src/braindump/config.py Add repo_root, BraindumpConfig, detect_repo_from_git(), provenance_path
src/braindump/cli.py Auto-detect repo, load config.toml, auto-init, register init command
src/braindump/commands/init.py New: braindump init command
src/braindump/commands/generate.py Add merge logic, write to repo root in repo mode, generate provenance.json
src/braindump/commands/run.py Auto-since tracking, pass braindump_config
src/braindump/commands/download.py Accept since from auto-tracking
src/braindump/commands/lookup.py Two-mode: full (pipeline data) vs index (provenance.json only)
pyproject.toml Add tomli dependency (for TOML parsing on Python < 3.11)

Implementation order

  1. RepoConfig refactor — add repo_root support, keep backward compat
  2. BraindumpConfig — TOML loading + CLI merge
  3. Auto-detect repo — git remote parsing in cli.py, auto-init with confirmation
  4. braindump init — scaffold .braindump/ directory (also called by auto-init)
  5. AGENTS.md merging — fenced section replacement in generate.py
  6. Provenance index — generate provenance.json during generate stage, update lookup for two-mode
  7. Auto-since tracking — last_run.json read/write
  8. Output filename config — support CLAUDE.md etc.
  9. README — document GitHub Actions workflow

Verification

  1. Run braindump init in an existing repo → creates .braindump/ with correct structure
  2. Run braindump status without --repo → auto-detects from git remote
  3. Run braindump run → pipeline writes AGENTS.md directly into repo root
  4. Add manual content above <!-- braindump --> in AGENTS.md → re-run → manual content preserved
  5. Run again → --since auto-advances, only new PRs processed
  6. Check .gitignore.braindump/data/ and last_run.json are excluded
  7. braindump lookup 42 with only .braindump/provenance.json → shows rule text + GitHub URLs
  8. braindump lookup 42 -v with full pipeline data → shows full comment bodies
  9. --repo owner/repo still works in standalone mode (backward compat)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions