Make work inside repo

Plan by Claude:

# Plan: Make braindump work seamlessly inside existing repos

## Context

Braindump currently operates as a standalone tool: data lives inside its own `data/` directory, config is CLI-only, and generated AGENTS.md files go into `6-generate/` rather than the target repo. To make it a tool you install and run *inside* your repo (like a linter), we need `.braindump/` directory support, auto-detection, config files, AGENTS.md merging, and a GitHub Actions story.

## Changes

### 1. `.braindump/` directory layout in target repos

```
.braindump/
  config.toml            # Repo-specific settings (committed)
  rule_overrides.jsonl    # Manual rule corrections (committed)
  data/                   # Pipeline stage data (gitignored)
    1-download/
    2-extract/
    3-synthesize/
    4-place/
    5-group/
    6-generate/           # Staging area, NOT final output
  last_run.json           # Tracks last successful run timestamp (gitignored)
```

### 2. Config file: `.braindump/config.toml`

```toml
[braindump]
# Defaults applied when CLI args are not provided
authors = "all"               # or ["user1", "user2"]
since = "2025-01-01"          # initial start date; auto-advances after each run
min_score = 0.5               # group-stage threshold (place stays at 0.3)
max_rules = 100               # cap on total rules

[output]
# Where to write AGENTS.md files, relative to repo root
root = "."                    # default: repo root
filename = "AGENTS.md"        # could be CLAUDE.md, CURSOR.md, etc.
```

Keep it minimal. CLI args override config file, which overrides defaults.

### 3. Auto-detect repo (no `--repo` needed)

When `--repo` is not provided:
1. Check if CWD is inside a git repo (`git rev-parse --show-toplevel`)
2. Parse origin remote (`git remote get-url origin`) to extract `owner/repo`
3. Look for `.braindump/` in the repo root
4. Set data dir to `.braindump/data/` and output to repo root

**Two modes** based on detection:
- **Repo mode** (CWD is in a git repo): data in `.braindump/data/`, output goes to repo root, config loaded from `.braindump/config.toml`. If `.braindump/` doesn't exist yet, auto-create it with a confirmation prompt.
- **Standalone mode** (`--repo` provided explicitly, or not in a git repo): current behavior, data in braindump's `data/` dir

### 4. `braindump init` command

Creates the `.braindump/` directory in the current repo:
- Writes a starter `config.toml` with commented defaults
- Creates empty `rule_overrides.jsonl`
- Creates `data/` dir
- Appends `.braindump/data/` and `.braindump/last_run.json` to `.gitignore`
- Prints next-steps instructions

### 5. RepoConfig refactor

**File:** `src/braindump/config.py`

Add `repo_root` parameter to `RepoConfig`:

```python
class RepoConfig:
    def __init__(self, repo: str, repo_root: Path | None = None) -> None:
        self.repo = repo
        self.repo_root = repo_root  # None = standalone mode

        if repo_root:
            # Repo mode: data inside .braindump/
            self.braindump_dir = repo_root / ".braindump"
            self.data_dir = self.braindump_dir / "data"
        else:
            # Standalone mode: data in braindump project
            self._project_root = Path(__file__).resolve().parent.parent.parent
            self.data_dir = self._project_root / "data" / repo

    @property
    def overrides_path(self) -> Path:
        if self.repo_root:
            return self.braindump_dir / "rule_overrides.jsonl"
        return self.data_dir / "rule_overrides.jsonl"

    @property
    def output_root(self) -> Path:
        """Where AGENTS.md files are written to in the actual repo."""
        if self.repo_root:
            return self.repo_root  # directly into repo
        return self.stage_dir("6-generate")  # staging area

    @property
    def config_path(self) -> Path | None:
        if self.repo_root:
            return self.braindump_dir / "config.toml"
        return None
```

Add a `BraindumpConfig` dataclass to load/merge config.toml with CLI args:

```python
@dataclass
class BraindumpConfig:
    authors: str = "all"
    since: str | None = None
    min_score: float | None = None
    max_rules: int | None = None
    output_root: str = "."
    output_filename: str = "AGENTS.md"

    @classmethod
    def from_toml(cls, path: Path) -> BraindumpConfig: ...

    def merge_cli(self, **cli_args) -> BraindumpConfig:
        """CLI args override config file values (if not None/default)."""
```

### 6. CLI changes (`cli.py`)

Update the `main` callback:

```python
@app.callback()
def main(ctx, repo: str = ...) -> None:
    if not repo:
        # Try auto-detect from git
        detected = detect_repo_from_git()
        if detected:
            repo, repo_root = detected
            config = RepoConfig(repo, repo_root=repo_root)
        else:
            repo = typer.prompt("GitHub repository (owner/repo)")
            config = RepoConfig(repo)
    else:
        config = RepoConfig(repo)

    # Load .braindump/config.toml if it exists
    braindump_config = BraindumpConfig()
    if config.config_path and config.config_path.exists():
        braindump_config = BraindumpConfig.from_toml(config.config_path)

    ctx.obj["config"] = config
    ctx.obj["braindump_config"] = braindump_config
```

Add `detect_repo_from_git()`:
```python
def detect_repo_from_git() -> tuple[str, Path] | None:
    """Detect owner/repo from git remote, return (repo, repo_root) or None."""
    # git rev-parse --show-toplevel → repo_root
    # git remote get-url origin → parse owner/repo from URL
    # Support both SSH and HTTPS formats
```

### 7. AGENTS.md merge logic

**File:** `src/braindump/commands/generate.py`

When writing to an actual repo (repo mode), merge with existing files:

```python
def merge_with_existing(existing_content: str, braindump_content: str) -> str:
    """Replace content between  and  markers."""
    START = ""

    start_idx = existing_content.find(START)
    end_idx = existing_content.find(END)

    if start_idx != -1 and end_idx != -1:
        # Replace existing braindump section
        before = existing_content[:start_idx]
        after = existing_content[end_idx + len(END):]
        return before + braindump_content + after
    else:
        # No existing section: append
        return existing_content.rstrip() + "\n\n" + braindump_content + "\n"
```

Update `write_files()` to use merge when `repo_root` is set:
- If target file exists and has fencing markers → replace fenced section
- If target file exists without markers → append with blank line
- If target file doesn't exist → write directly

### 8. Auto `--since` tracking

**File:** `.braindump/last_run.json`
```json
{"last_download": "2025-06-15T10:30:00Z", "last_since": "2025-06-01"}
```

In the `run` command, when `--since` is not provided:
1. Check `last_run.json` for `last_since` date
2. If found, use it as the `--since` value
3. After successful download, update `last_since` to today
4. Fall back to config.toml `since` value, then to prompting

### 9. GitHub Actions workflow

Document a reference workflow in the README (not auto-generated). Key design:
- Uses `actions/cache` to persist `.braindump/data/` between runs
- Creates a PR instead of pushing directly (reviewable)
- `GH_TOKEN` from Actions provides `gh` CLI auth automatically
- Triggered weekly + manual dispatch

### 10. Output filename configurability

Support `AGENTS.md`, `CLAUDE.md`, `CURSOR.md`, etc. via config:
```toml
[output]
filename = "AGENTS.md"
```

This means the fencing markers become essential — they identify braindump's section regardless of what else is in the file.

### 11. Incremental processing strategy

**Download + Extract are already incremental:**
- Download skips PRs where all 5 files exist
- Extract uses checkpoint system to skip processed comment IDs, appends new extractions

**Synthesize+ re-runs fully but uses caches:**
- When `extractions.jsonl` changes (new extractions appended), input hash changes → full re-cluster
- Embedding cache (`embeddings_keys.json`) avoids re-computing embeddings when generalization texts match
- Rephrase cache in generate stage avoids re-calling LLM for unchanged rules
- This is acceptable for weekly CI runs — the main cost is LLM calls for cluster synthesis, which is bounded by cluster count

**For GHA**, `actions/cache` persists `.braindump/data/` between runs so:
- Download: only fetches new PRs since last run
- Extract: only processes new comments
- Synthesize+: re-runs but benefits from embedding caches
- Generate: rephrase cache avoids redundant LLM calls for stable rules

### 12. Provenance index for lookup

**File:** `.braindump/provenance.json` (committed to repo)

Generated during the `generate` stage, alongside AGENTS.md files:

```json
{
  "repo": "owner/repo",
  "generated_at": "2025-06-15T10:30:00Z",
  "rules": {
    "42": {
      "text": "Use keyword-only args for public APIs",
      "category": "api_design",
      "score": 0.85,
      "source_prs": [123, 456],
      "source_comments": [
        {"id": 789, "pr": 123, "author": "reviewer1", "url": "https://github.com/owner/repo/pull/123#discussion_r789"},
        {"id": 101, "pr": 456, "author": "reviewer2", "url": "https://github.com/owner/repo/pull/456#discussion_r101"}
      ]
    }
  }
}
```

This is small (a few KB) and lets:
- `braindump lookup` work without full pipeline data — shows PR links + comment URLs
- Contributors click through to the original review comment on GitHub
- Lookup falls back gracefully: full verbose mode when pipeline data is available, URL-only mode when only provenance.json exists

**Lookup two-mode behavior:**
- **Full mode** (pipeline data available): current behavior — shows comment body, diff hunk, etc.
- **Index mode** (only provenance.json): shows rule text, score, PR numbers, clickable GitHub URLs

To build the URLs, we need comment IDs + PR numbers (already on rule objects as `source_comments` and `source_prs`). We construct `https://github.com/{owner}/{repo}/pull/{pr}#discussion_r{comment_id}`. Author info can be extracted from the downloaded comment data during generate.

## Files to modify

| File | Change |
|------|--------|
| `src/braindump/config.py` | Add `repo_root`, `BraindumpConfig`, `detect_repo_from_git()`, `provenance_path` |
| `src/braindump/cli.py` | Auto-detect repo, load config.toml, auto-init, register `init` command |
| `src/braindump/commands/init.py` | **New:** `braindump init` command |
| `src/braindump/commands/generate.py` | Add merge logic, write to repo root in repo mode, generate provenance.json |
| `src/braindump/commands/run.py` | Auto-since tracking, pass braindump_config |
| `src/braindump/commands/download.py` | Accept since from auto-tracking |
| `src/braindump/commands/lookup.py` | Two-mode: full (pipeline data) vs index (provenance.json only) |
| `pyproject.toml` | Add `tomli` dependency (for TOML parsing on Python < 3.11) |

## Implementation order

1. **RepoConfig refactor** — add `repo_root` support, keep backward compat
2. **BraindumpConfig** — TOML loading + CLI merge
3. **Auto-detect repo** — git remote parsing in cli.py, auto-init with confirmation
4. **`braindump init`** — scaffold `.braindump/` directory (also called by auto-init)
5. **AGENTS.md merging** — fenced section replacement in generate.py
6. **Provenance index** — generate provenance.json during generate stage, update lookup for two-mode
7. **Auto-since tracking** — last_run.json read/write
8. **Output filename config** — support CLAUDE.md etc.
9. **README** — document GitHub Actions workflow

## Verification

1. Run `braindump init` in an existing repo → creates `.braindump/` with correct structure
2. Run `braindump status` without `--repo` → auto-detects from git remote
3. Run `braindump run` → pipeline writes AGENTS.md directly into repo root
4. Add manual content above `` in AGENTS.md → re-run → manual content preserved
5. Run again → `--since` auto-advances, only new PRs processed
6. Check `.gitignore` → `.braindump/data/` and `last_run.json` are excluded
7. `braindump lookup 42` with only `.braindump/provenance.json` → shows rule text + GitHub URLs
8. `braindump lookup 42 -v` with full pipeline data → shows full comment bodies
9. `--repo owner/repo` still works in standalone mode (backward compat)

File	Change
`src/braindump/config.py`	Add `repo_root`, `BraindumpConfig`, `detect_repo_from_git()`, `provenance_path`
`src/braindump/cli.py`	Auto-detect repo, load config.toml, auto-init, register `init` command
`src/braindump/commands/init.py`	New: `braindump init` command
`src/braindump/commands/generate.py`	Add merge logic, write to repo root in repo mode, generate provenance.json
`src/braindump/commands/run.py`	Auto-since tracking, pass braindump_config
`src/braindump/commands/download.py`	Accept since from auto-tracking
`src/braindump/commands/lookup.py`	Two-mode: full (pipeline data) vs index (provenance.json only)
`pyproject.toml`	Add `tomli` dependency (for TOML parsing on Python < 3.11)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make work inside repo #2

Plan: Make braindump work seamlessly inside existing repos

Context

Changes

1. `.braindump/` directory layout in target repos

2. Config file: `.braindump/config.toml`

3. Auto-detect repo (no `--repo` needed)

4. `braindump init` command

5. RepoConfig refactor

6. CLI changes (`cli.py`)

7. AGENTS.md merge logic

8. Auto `--since` tracking

9. GitHub Actions workflow

10. Output filename configurability

11. Incremental processing strategy

12. Provenance index for lookup

Files to modify

Implementation order

Verification

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Make work inside repo #2

Description

Plan: Make braindump work seamlessly inside existing repos

Context

Changes

1. .braindump/ directory layout in target repos

2. Config file: .braindump/config.toml

3. Auto-detect repo (no --repo needed)

4. braindump init command

5. RepoConfig refactor

6. CLI changes (cli.py)

7. AGENTS.md merge logic

8. Auto --since tracking

9. GitHub Actions workflow

10. Output filename configurability

11. Incremental processing strategy

12. Provenance index for lookup

Files to modify

Implementation order

Verification

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. `.braindump/` directory layout in target repos

2. Config file: `.braindump/config.toml`

3. Auto-detect repo (no `--repo` needed)

4. `braindump init` command

6. CLI changes (`cli.py`)

8. Auto `--since` tracking