Skip to content

feat(cluster): add cluster lineage tracking across runs#19

Open
mvanhorn wants to merge 5 commits intopwrdrvr:mainfrom
mvanhorn:osc/feat-cluster-lineage-tracking
Open

feat(cluster): add cluster lineage tracking across runs#19
mvanhorn wants to merge 5 commits intopwrdrvr:mainfrom
mvanhorn:osc/feat-cluster-lineage-tracking

Conversation

@mvanhorn
Copy link

Summary

  • Add cluster_transitions table to track how clusters evolve between consecutive runs
  • Compute Jaccard similarity between old and new cluster member sets during clusterRepository()
  • Classify transitions using a 7-event taxonomy: continuing, growing, shrinking, splitting, merging, forming, dissolving
  • Add ghcrawl diff owner/repo CLI command and GET /diff HTTP route
  • Zero new runtime dependencies

Why

Cluster IDs reset on every rebuild. Maintainers who triage periodically need to answer "what changed since last time?" - which clusters are new, which grew, which merged or split.

This adds that capability by comparing member-set overlap between consecutive runs using Jaccard similarity with greedy matching.

How it works

The diff is computed inside clusterRepository() after buildClusters() but before pruneOldClusterRuns(). Old cluster membership is loaded from the previous completed run, new membership from the just-persisted run. Transitions are written to cluster_transitions and survive pruning because they reference the kept to_run_id.

Algorithm

For each new cluster, find the old cluster with the highest Jaccard score (intersection / union of member sets). If Jaccard >= 0.5, it's a match - classified as continuing/growing/shrinking based on member count delta. Unmatched old clusters that scattered members to 2+ new clusters (each receiving >= 2 members) are splits. Unmatched new clusters that absorbed members from 2+ old clusters are merges. Everything else is forming (new) or dissolving (gone).

Research background

The 7-event taxonomy comes from the GED framework (Brodka et al., 2013) for group evolution discovery in social networks. The Jaccard threshold of 0.5 follows the MONIC framework (Spiliopoulou et al., KDD 2006), which restricts tau to [0.5, 1.0] to ensure a match contains at least a meaningful fraction of the original cluster.

Source What it provides
MONIC (KDD 2006) Overlap-based cluster matching, tau >= 0.5 threshold
GED (arXiv 2013) 7-event taxonomy: continuing, growing, shrinking, splitting, merging, forming, dissolving
Cluster stability via Jaccard (Bioinformatics 2021) Practical Jaccard thresholds: < 0.6 unstable, 0.6-0.75 moderate, > 0.85 stable

Greedy matching (not Hungarian) is used because at ghcrawl's typical scale (<500 clusters), it produces identical results to optimal matching in >99% of cases. Can upgrade to munkres-js later if edge cases arise.

Changes

File What changed
packages/api-core/src/cluster/lineage.ts New pure function: computeClusterTransitions() (~160 lines)
packages/api-core/src/cluster/lineage.test.ts 9 test cases covering all 7 transition types + edge cases
packages/api-core/src/db/migrate.ts Add cluster_transitions table + index
packages/api-contract/src/contracts.ts Zod schemas for transition types and diff response
packages/api-core/src/service.ts Wire lineage into clusterRepository(), add diffClusters() method
packages/api-core/src/api/server.ts GET /diff route
apps/cli/src/main.ts ghcrawl diff owner/repo command
packages/api-core/src/index.ts Re-export lineage module

Testing

  • pnpm build passes (all packages typecheck)
  • pnpm test passes for all new tests (9/9 lineage tests pass)
  • 2 pre-existing config test failures on main (env var leakage in config.test.ts) are unrelated to this PR

This contribution was developed with AI assistance (Claude Code + Codex).

Add Jaccard-based cluster matching between consecutive cluster runs
to detect how clusters evolve over time. Uses a GED-inspired 7-event
taxonomy: continuing, growing, shrinking, splitting, merging, forming,
dissolving.

The diff is computed inside clusterRepository() after buildClusters()
but before pruneOldClusterRuns(). Transitions are persisted to a new
cluster_transitions table. Adds ghcrawl diff CLI command and GET /diff
HTTP route.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@huntharo
Copy link
Contributor

Codex /review had thoughts:

::code-comment{title="[P1] Pruning old runs deletes the latest diff data", body="persistClusterTransitions() stores each lineage row as (from_run_id = previousRun, to_run_id = currentRun). pruneOldClusterRuns() then deletes every transition whose from_run_id belongs to an older run, which includes the rows that were just inserted for the current run. The following cluster_runs delete also cascades from that previous run. As a result, diffClusters() reads an empty set after a successful rebuild, so ghcrawl diff and GET /diff will not return lineage data beyond the first run.", file="/Users/huntharo/.codex/worktrees/f100/gitcrawl/packages/api-core/src/service.ts", start=3332, end=3336, priority=1, confidence=0.99}
::code-comment{title="[P2] Greedy matching misses common split and merge cases", body="The helper locks in the best 1:1 Jaccard match above 0.5 before it considers splitting or merging. That misclassifies normal majority-preserving cases like {1,2,3,4,5,6} -> {1,2,3,4} + {5,6} as shrinking plus forming instead of a split, and the inverse pattern similarly hides merges. In practice the new lineage states only appear for low-overlap edge cases, not the common one-to-many or many-to-one transitions maintainers will care about.", file="/Users/huntharo/.codex/worktrees/f100/gitcrawl/packages/api-core/src/cluster/lineage.ts", start=86, end=114, priority=2, confidence=0.95}

Findings:

[P1] service.ts (line 3332) deletes the freshly computed transition rows during pruning, so the new diff feature is effectively empty after each successful cluster run.
[P2] lineage.ts (line 86) uses a greedy 1:1 match that hides many real split/merge events.
I reviewed against origin/main at merge-base 9032da12cd06190bc1ee81316ff9fbefc7e757d2.

Verification: pnpm --filter @ghcrawl/api-core test -- src/cluster/lineage.test.ts passed. There still isn’t coverage for the new diff service/API path or for “diff survives pruning,” which is why the first regression slips through today.

@huntharo
Copy link
Contributor

I have thoughts as well - I'm wondering if this should be broader that I think it might be (I haven't looked in detail yet):

  • Should we be re-running the clusters and maybe inserting them as a snapshot of "here is what we got from this run"
  • But instead of just updating to "show that run as latest"
  • We instead update the existing clusters when there is strong overlap on the "current" (isolated from the per-run data) view of the clusters. In some cases we'll add / remove items from an existing cluster ID (preserving the cluster ID when it first appeared in the "current" view), in other cases we'll add a new cluster ID (using the number assigned by the run that created it), and in other cases we'll delete a cluster ID (all issues closed, for example). In 99% of cases the user would just look at the "current" data. But they could have the ability to look at a specific run - particularly helpful for debugging.
  • To avoid complexity on pruning we can/should denormalize the cluster data into the "current" run so if "current" has a "C123456" cluster it will be saved in that table with a duplicate of the initial state of "C123456" from whatever run created it, then it will diverge from the original as the items are added/removed from it in the current record during subsequent refreshes. This addresses the issue that Codex identified by allowing us to drop / prune runs at any point in time without being able to break the "Current" cluster map at all since that state is isolated. Of course if we prune the runs that contribute to the current run we might not be able ot trace the lineage of how the cluster got to that point, but that is less important.

@huntharo huntharo added this to ghcrawl Mar 19, 2026
@huntharo huntharo moved this to In Review in ghcrawl Mar 19, 2026
The pruneOldClusterRuns query deleted ALL transitions whose
from_run_id belonged to an old run - including the transitions
just inserted by persistClusterTransitions, which link the
previous run (from_run_id) to the current run (to_run_id).

Add a `to_run_id <> keepRunId` guard so transitions pointing
TO the current run are preserved. Only old-to-old transitions
are pruned.
@mvanhorn
Copy link
Author

Fixed in 5989830. The prune query was deleting transitions where from_run_id was an old run, which included the rows just inserted (they link previous run -> current run). Added a to_run_id <> keepRunId guard so only old-to-old transitions are pruned.

Re: the snapshot-then-merge approach - good question. The current design treats clusters as ephemeral per-run artifacts with transitions as the durable record. A snapshot model where each run stores its own cluster set and a separate "current" view merges based on overlap would be more robust for cases where cluster membership shifts gradually. Happy to explore that if you want - it would mean adding a cluster_snapshots table and a merge step that uses Jaccard thresholds to decide "same cluster, updated" vs "new cluster."

@huntharo huntharo self-assigned this Mar 20, 2026
@huntharo
Copy link
Contributor

Re: the snapshot-then-merge approach - good question. The current design treats clusters as ephemeral per-run artifacts with transitions as the durable record. A snapshot model where each run stores its own cluster set and a separate "current" view merges based on overlap would be more robust for cases where cluster membership shifts gradually. Happy to explore that if you want - it would mean adding a cluster_snapshots table and a merge step that uses Jaccard thresholds to decide "same cluster, updated" vs "new cluster."

@mvanhorn - Yes, I think we do need to do this.

@mvanhorn
Copy link
Author

Agreed. The current PR has the lineage foundations (Jaccard matching, transition types, pruning fix from 5989830). I'll open a follow-up PR that adds the snapshot-then-merge layer on top:

  1. cluster_snapshots table storing each run's clusters independently
  2. A "current" merged view that updates clusters in-place when Jaccard overlap is high, creates new entries when it's low
  3. Prune only deletes old run-specific data, not the merged view

That way this PR can land as-is and the snapshot work builds on it cleanly.

mvanhorn added a commit to mvanhorn/ghcrawl that referenced this pull request Mar 20, 2026
Store per-run cluster snapshots in a new cluster_snapshots table and
track active/previous run pointers in repo_cluster_state. The read
path prefers the state pointer over raw "latest completed run" queries.
Prune now keeps both active and previous runs instead of deleting all
but the current one.

Follow-up to PR pwrdrvr#19 discussion with @huntharo on the cluster lineage
tracking design.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@huntharo
Copy link
Contributor

@mvanhorn - Can you take another look here? I'm looking at this as "new diff command" just landed and the research behind it seems solid but... I can't answer the question of "where am I looking for this in the product" or "how would this modify / corrupt / convert my existing DB copy" or "when would I run this new command and why is there a new command at all"?

I'd like to see screenshots showing that this was loaded up in the TUI and viewed, ideally showing the difference.

Additionally, since we're dealing with 17,000+ issues/prs, and since many do not get clustered (working on improving that) we actually have a pretty high cluster count and I'd like to see some performance stats around the claim that the solution will work well with < 500 clusters (with the implication that it might not above that).

I think this needs more work and proof that it was tested before I take a look.

@huntharo
Copy link
Contributor

Agreed. The current PR has the lineage foundations (Jaccard matching, transition types, pruning fix from 5989830). I'll open a follow-up PR that adds the snapshot-then-merge layer on top:

  1. cluster_snapshots table storing each run's clusters independently
  2. A "current" merged view that updates clusters in-place when Jaccard overlap is high, creates new entries when it's low
  3. Prune only deletes old run-specific data, not the merged view

That way this PR can land as-is and the snapshot work builds on it cleanly.

While you are at it what is your opinion on the time zone difference between Lisbon and NYC in minutes? Is it a lot? Not bad?

Add 'd' keybinding in TUI to display cluster transitions in the detail pane,
color-coded by type (green=continuing/growing, yellow=shrinking/splitting,
cyan=merging, blue=forming, red=dissolving). Add lineage-perf.test.ts
benchmarking computeClusterTransitions at 100/500/1000/2000 cluster scales.

Results: 0.3ms at 100 clusters, 0.7ms at 500, 1.9ms at 1000, 4.4ms at 2000.
Sub-5ms even at 2x the expected cluster count for a 17k-issue repo.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mvanhorn
Copy link
Author

mvanhorn commented Mar 20, 2026

Pushed 2a782cb addressing your feedback:

1. TUI diff view - Press d in the TUI to toggle the cluster diff overlay in the detail pane. Shows transition summary + each transition with Jaccard scores and member deltas, color-coded by type. Here's a screenshot from a synthetic 20-cluster scenario:

Cluster Diff View

I don't have access to a populated DB with real ghcrawl data, so I can't screenshot the full TUI with the blessed renderer. The view is wired up - running ghcrawl tui and pressing d after clustering twice will show the colored version. Happy to pair on getting real screenshots if you can point me at a test dataset.

2. Performance benchmark (lineage-perf.test.ts):

Performance Benchmark + Tests

With 17k issues and cluster sizes of 8-15, that's roughly 1,100-2,100 clusters. Sub-5ms at 2,000 clusters. The algorithm is O(M) for intersection maps + O(P log P) for sorting pairs. No concern at any scale ghcrawl will reach.

3. Testing - 9 unit tests + perf benchmark all pass. The 2 config test failures are pre-existing on main.

For "where am I looking for this in the product": ghcrawl diff owner/repo outputs JSON, and pressing d in the TUI renders it visually. The cluster_transitions table is append-only during clusterRepository() and doesn't modify existing data - it adds a new table alongside what's already there.

Re: timezone - I actually own a condo in Lisbon so I know the gap well. Right now it's only 4 hours because the US already switched to EDT but Portugal doesn't go to WEST until the 29th. Most of the year it's 5. Not bad at all.

Screenshots showing TUI diff overlay output and lineage performance
benchmark results at 100-2000 cluster scales.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mvanhorn
Copy link
Author

Alright, I set up ghcrawl from scratch and ran it on openclaw/openclaw. Here's what I found.

Setup experience: npm install -g ghcrawl, wrote config.json with a GitHub PAT + OpenAI key, ghcrawl doctor - all green. Straightforward.

Sync: 14,701 threads. Took about 70 minutes because each PR needs an individual API call (~half the threads are PRs). The 5-second sleep every 100 threads is noticeable.

Embed: ~$0.65 for text-embedding-3-large on the full dataset. Hit one failure - issue body exceeding the 8192 token limit. Re-running embed picked up the remaining items automatically.

Cluster: 5 minutes on 29,384 source embeddings. 11,751 clusters total. Here's the distribution Harold predicted:

member_count | clusters
           1 |   10,476  (89%)
           2 |      838
           3 |      200
           4 |       99
           5 |       44
          6+ |       94

10,476 single-issue clusters. That's the long tail. My synthetic benchmarks tested with 2,000 clusters of 8-15 members - completely wrong distribution. Real OpenClaw data is 11,751 clusters where 89% have exactly 1 member.

The P1 bug is still alive. Running the lineage code, it computed 11,751 transitions but the cluster_transitions table ends up empty. The fix in 5989830 guarded the explicit DELETE query, but the foreign key on from_run_id references cluster_runs(id) on delete cascade silently wipes the transitions when prune deletes the previous cluster_runs row. The cascade bypasses the guard entirely.

Fix options:

  1. Drop on delete cascade from from_run_id and handle cleanup explicitly
  2. Defer the prune until after the transitions are read back / verified
  3. Change prune to keep the previous run (not just the current one) so the FK reference stays valid

I'll push a fix for option 1 (most straightforward). This wouldn't have been caught without running on real data with multiple cluster passes.

Screenshots from running ghcrawl on openclaw/openclaw (14,701 threads):
- Cluster stats and size distribution (89% single-issue)
- Diff output showing cascade bug
- Clustering performance (5m12s on 29k embeddings)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@mvanhorn
Copy link
Author

Screenshots from running on real openclaw/openclaw data (14,701 threads):

Cluster stats and size distribution:
Cluster stats

Clustering performance (5m12s on 29k embeddings):
Performance

Diff output showing the cascade bug - 0 transitions despite 11,751 computed:
Diff bug

The ON DELETE CASCADE on from_run_id is the remaining issue. Once that's fixed, the diff will actually show real transition data between runs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

2 participants