|
| 1 | +--- |
| 2 | +title: 'SPEC-14: Cloud Git Versioning & GitHub Backup' |
| 3 | +type: spec |
| 4 | +permalink: specs/spec-14-cloud-git-versioning |
| 5 | +tags: |
| 6 | +- git |
| 7 | +- github |
| 8 | +- backup |
| 9 | +- versioning |
| 10 | +- cloud |
| 11 | +related: |
| 12 | +- specs/spec-9-multi-project-bisync |
| 13 | +- specs/spec-9-follow-ups-conflict-sync-and-observability |
| 14 | +status: deferred |
| 15 | +--- |
| 16 | + |
| 17 | +# SPEC-14: Cloud Git Versioning & GitHub Backup |
| 18 | + |
| 19 | +**Status: DEFERRED** - Postponed until multi-user/teams feature development. Using S3 versioning (SPEC-9.1) for v1 instead. |
| 20 | + |
| 21 | +## Why Deferred |
| 22 | + |
| 23 | +**Original goals can be met with simpler solutions:** |
| 24 | +- Version history → **S3 bucket versioning** (automatic, zero config) |
| 25 | +- Offsite backup → **Tigris global replication** (built-in) |
| 26 | +- Restore capability → **S3 version restore** (`bm cloud restore --version-id`) |
| 27 | +- Collaboration → **Deferred to teams/multi-user feature** (not v1 requirement) |
| 28 | + |
| 29 | +**Complexity vs value trade-off:** |
| 30 | +- Git integration adds: committer service, puller service, webhooks, LFS, merge conflicts |
| 31 | +- Risk: Loop detection between Git ↔ rclone bisync ↔ local edits |
| 32 | +- S3 versioning gives 80% of value with 5% of complexity |
| 33 | + |
| 34 | +**When to revisit:** |
| 35 | +- Teams/multi-user features (PR-based collaboration workflow) |
| 36 | +- User requests for commit messages and branch-based workflows |
| 37 | +- Need for fine-grained audit trail beyond S3 object metadata |
| 38 | + |
| 39 | +--- |
| 40 | + |
| 41 | +## Original Specification (for reference) |
| 42 | + |
| 43 | +## Why |
| 44 | +Early access users want **transparent version history**, easy **offsite backup**, and a familiar **restore/branching** workflow. Git/GitHub integration would provide: |
| 45 | +- Auditable history of every change (who/when/why) |
| 46 | +- Branches/PRs for review and collaboration |
| 47 | +- Offsite private backup under the user's control |
| 48 | +- Escape hatch: users can always `git clone` their knowledge base |
| 49 | + |
| 50 | +**Note:** These goals are now addressed via S3 versioning (SPEC-9.1) for single-user use case. |
| 51 | + |
| 52 | +## Goals |
| 53 | +- **Transparent**: Users keep using Basic Memory; Git runs behind the scenes. |
| 54 | +- **Private**: Push to a **private GitHub repo** that the user owns (or tenant org). |
| 55 | +- **Reliable**: No data loss, deterministic mapping of filesystem ↔ Git. |
| 56 | +- **Composable**: Plays nicely with SPEC‑9 bisync and upcoming conflict features (SPEC‑9 Follow‑Ups). |
| 57 | + |
| 58 | +**Non‑Goals (for v1):** |
| 59 | +- Fine‑grained per‑file encryption in Git history (can be layered later). |
| 60 | +- Large media optimization beyond Git LFS defaults. |
| 61 | + |
| 62 | +## User Stories |
| 63 | +1. *As a user*, I connect my GitHub and choose a private backup repo. |
| 64 | +2. *As a user*, every change I make in cloud (or via bisync) is **committed** and **pushed** automatically. |
| 65 | +3. *As a user*, I can **restore** a file/folder/project to a prior version. |
| 66 | +4. *As a power user*, I can **git pull/push** directly to collaborate outside the app. |
| 67 | +5. *As an admin*, I can enforce repo ownership (tenant org) and least‑privilege scopes. |
| 68 | + |
| 69 | +## Scope |
| 70 | +- **In scope:** Full repo backup of `/app/data/` (all projects) with optional selective subpaths. |
| 71 | +- **Out of scope (v1):** Partial shallow mirrors; encrypted Git; cross‑provider SCM (GitLab/Bitbucket). |
| 72 | + |
| 73 | +## Architecture |
| 74 | +### Topology |
| 75 | +- **Authoritative working tree**: `/app/data/` (bucket mount) remains the source of truth (SPEC‑9). |
| 76 | +- **Bare repo** lives alongside: `/app/git/${tenant}/knowledge.git` (server‑side). |
| 77 | +- **Mirror remote**: `github.com/<owner>/<repo>.git` (private). |
| 78 | + |
| 79 | +```mermaid |
| 80 | +flowchart LR |
| 81 | + A[/Users & Agents/] -->|writes/edits| B[/app/data/] |
| 82 | + B -->|file events| C[Committer Service] |
| 83 | + C -->|git commit| D[(Bare Repo)] |
| 84 | + D -->|push| E[(GitHub Private Repo)] |
| 85 | + E -->|webhook (push)| F[Puller Service] |
| 86 | + F -->|git pull/merge| D |
| 87 | + D -->|checkout/merge| B |
| 88 | +``` |
| 89 | + |
| 90 | +### Services |
| 91 | +- **Committer Service** (daemon): |
| 92 | + - Watches `/app/data/` for changes (inotify/poll) |
| 93 | + - Batches changes (debounce e.g. 2–5s) |
| 94 | + - Writes `.bmmeta` (if present) into commit message trailer (see Follow‑Ups) |
| 95 | + - `git add -A && git commit -m "chore(sync): <summary> |
| 96 | + |
| 97 | +BM-Meta: <json>"` |
| 98 | + - Periodic `git push` to GitHub mirror (configurable interval) |
| 99 | +- **Puller Service** (webhook target): |
| 100 | + - Receives GitHub webhook (push) → `git fetch` |
| 101 | + - **Fast‑forward** merges to `main` only; reject non‑FF unless policy allows |
| 102 | + - Applies changes back to `/app/data/` via clean checkout |
| 103 | + - Emits sync events for Basic Memory indexers |
| 104 | + |
| 105 | +### Auth & Security |
| 106 | +- **GitHub App** (recommended): minimal scopes: `contents:read/write`, `metadata:read`, webhook. |
| 107 | +- Tenant‑scoped installation; repo created in user account or tenant org. |
| 108 | +- Tokens stored in KMS/secret manager; rotated automatically. |
| 109 | +- Optional policy: allow only **FF merges** on `main`; non‑FF requires PR. |
| 110 | + |
| 111 | +### Repo Layout |
| 112 | +- **Monorepo** (default): one repo per tenant mirrors `/app/data/` with subfolders per project. |
| 113 | +- Optional multi‑repo mode (later): one repo per project. |
| 114 | + |
| 115 | +### File Handling |
| 116 | +- Honor `.gitignore` generated from `.bmignore.rclone` + BM defaults (cache, temp, state). |
| 117 | +- **Git LFS** for large binaries (images, media) — auto track by extension/size threshold. |
| 118 | +- Normalize newline + Unicode (aligns with Follow‑Ups). |
| 119 | + |
| 120 | +### Conflict Model |
| 121 | +- **Primary concurrency**: SPEC‑9 Follow‑Ups (`.bmmeta`, conflict copies) stays the first line of defense. |
| 122 | +- **Git merges** are a **secondary** mechanism: |
| 123 | + - Server only auto‑merges **text** conflicts when trivial (FF or clean 3‑way). |
| 124 | + - Otherwise, create `name (conflict from <branch>, <ts>).md` and surface via events. |
| 125 | + |
| 126 | +### Data Flow vs Bisync |
| 127 | +- Bisync (rclone) continues between local sync dir ↔ bucket. |
| 128 | +- Git sits **cloud‑side** between bucket and GitHub. |
| 129 | +- On **pull** from GitHub → files written to `/app/data/` → picked up by indexers & eventually by bisync back to users. |
| 130 | + |
| 131 | +## CLI & UX |
| 132 | +New commands (cloud mode): |
| 133 | +- `bm cloud git connect` — Launch GitHub App installation; create private repo; store installation id. |
| 134 | +- `bm cloud git status` — Show connected repo, last push time, last webhook delivery, pending commits. |
| 135 | +- `bm cloud git push` — Manual push (rarely needed). |
| 136 | +- `bm cloud git pull` — Manual pull/FF (admin only by default). |
| 137 | +- `bm cloud snapshot -m "message"` — Create a tagged point‑in‑time snapshot (git tag). |
| 138 | +- `bm restore <path> --to <commit|tag>` — Restore file/folder/project to prior version. |
| 139 | + |
| 140 | +Settings: |
| 141 | +- `bm config set git.autoPushInterval=5s` |
| 142 | +- `bm config set git.lfs.sizeThreshold=10MB` |
| 143 | +- `bm config set git.allowNonFF=false` |
| 144 | + |
| 145 | +## Migration & Backfill |
| 146 | +- On connect, if repo empty: initial commit of entire `/app/data/`. |
| 147 | +- If repo has content: require **one‑time import** path (clone to staging, reconcile, choose direction). |
| 148 | + |
| 149 | +## Edge Cases |
| 150 | +- Massive deletes: gated by SPEC‑9 `max_delete` **and** Git pre‑push hook checks. |
| 151 | +- Case changes and rename detection: rely on git rename heuristics + Follow‑Ups move hints. |
| 152 | +- Secrets: default ignore common secret patterns; allow custom deny list. |
| 153 | + |
| 154 | +## Telemetry & Observability |
| 155 | +- Emit `git_commit`, `git_push`, `git_pull`, `git_conflict` events with correlation IDs. |
| 156 | +- `bm sync --report` extended with Git stats (commit count, delta bytes, push latency). |
| 157 | + |
| 158 | +## Phased Plan |
| 159 | +### Phase 0 — Prototype (1 sprint) |
| 160 | +- Server: bare repo init + simple committer (batch every 10s) + manual GitHub token. |
| 161 | +- CLI: `bm cloud git connect --token <PAT>` (dev‑only) |
| 162 | +- Success: edits in `/app/data/` appear in GitHub within 30s. |
| 163 | + |
| 164 | +### Phase 1 — GitHub App & Webhooks (1–2 sprints) |
| 165 | +- Switch to GitHub App installs; create private repo; store installation id. |
| 166 | +- Committer hardened (debounce 2–5s, backoff, retries). |
| 167 | +- Puller service with webhook → FF merge → checkout to `/app/data/`. |
| 168 | +- LFS auto‑track + `.gitignore` generation. |
| 169 | +- CLI surfaces status + logs. |
| 170 | + |
| 171 | +### Phase 2 — Restore & Snapshots (1 sprint) |
| 172 | +- `bm restore` for file/folder/project with dry‑run. |
| 173 | +- `bm cloud snapshot` tags + list/inspect. |
| 174 | +- Policy: PR‑only non‑FF, admin override. |
| 175 | + |
| 176 | +### Phase 3 — Selective & Multi‑Repo (nice‑to‑have) |
| 177 | +- Include/exclude projects; optional per‑project repos. |
| 178 | +- Advanced policies (branch protections, required reviews). |
| 179 | + |
| 180 | +## Acceptance Criteria |
| 181 | +- Changes to `/app/data/` are committed and pushed automatically within configurable interval (default ≤5s). |
| 182 | +- GitHub webhook pull results in updated files in `/app/data/` (FF‑only by default). |
| 183 | +- LFS configured and functioning; large files don't bloat history. |
| 184 | +- `bm cloud git status` shows connected repo and last push/pull times. |
| 185 | +- `bm restore` restores a file/folder to a prior commit with a clear audit trail. |
| 186 | +- End‑to‑end works alongside SPEC‑9 bisync without loops or data loss. |
| 187 | + |
| 188 | +## Risks & Mitigations |
| 189 | +- **Loop risk (Git ↔ Bisync)**: Writes to `/app/data/` → bisync → local → user edits → back again. *Mitigation*: Debounce, commit squashing, idempotent `.bmmeta` versioning, and watch exclusion windows during pull. |
| 190 | +- **Repo bloat**: Lots of binary churn. *Mitigation*: default LFS, size threshold, optional media‑only repo later. |
| 191 | +- **Security**: Token leakage. *Mitigation*: GitHub App with short‑lived tokens, KMS storage, scoped permissions. |
| 192 | +- **Merge complexity**: Non‑trivial conflicts. *Mitigation*: prefer FF; otherwise conflict copies + events; require PR for non‑FF. |
| 193 | + |
| 194 | +## Open Questions |
| 195 | +- Do we default to **monorepo** per tenant, or offer project‑per‑repo at connect time? |
| 196 | +- Should `restore` write to a branch and open a PR, or directly modify `main`? |
| 197 | +- How do we expose Git history in UI (timeline view) without users dropping to CLI? |
| 198 | + |
| 199 | +## Appendix: Sample Config |
| 200 | +```json |
| 201 | +{ |
| 202 | + "git": { |
| 203 | + "enabled": true, |
| 204 | + "repo": "https://github.com/<owner>/<repo>.git", |
| 205 | + "autoPushInterval": "5s", |
| 206 | + "allowNonFF": false, |
| 207 | + "lfs": { "sizeThreshold": 10485760 } |
| 208 | + } |
| 209 | +} |
| 210 | +``` |
0 commit comments