ghcrawl is a local-first GitHub issue and pull request crawler for maintainers. It ingests repository discussion state into local storage, enriches it with LLM summaries and embeddings, and surfaces similarity clusters so maintainers can see which PRs and issues are really about the same problem area.
The target user is a maintainer running the tool locally. V1 does not need hosted deployment, multi-user auth, or cloud infrastructure.
Use discrawl as the main product pattern:
- local-first
- deterministic CLI entry points
- explicit
init/doctor/syncstyle commands - SQLite as the canonical local store
- optional higher-level search on top of the local store
Use jeerreview for the JavaScript/TypeScript app pattern:
.env.localloaded explicitly withdotenvGITHUB_TOKENfor GitHub API authOPENAI_API_KEYfor OpenAI auth- small local HTTP API
- small React UI for browsing results
Use dupcanon selectively for:
- persisted run history
- auditable similarity edges
- deterministic connected-component clustering
Do not copy dupcanon's Postgres-first runtime, close-planning workflow, or approval flow.
- Replicate the operational feel of
discrawl: local setup, local data, clear subcommands, no hosted dependency. - Support GitHub API ingestion for issues, PRs, comments, reviews, review comments, labels, assignees, and timeline metadata.
- Support OpenAI-backed summarization and embeddings.
- Evaluate and support local vector search, with a clean path to Dockerized OpenSearch 3.3.
- Produce useful clusters of similar issues and PRs, even when they are not exact duplicates.
- Stay project-agnostic. OpenClaw is the first target, not the only target.
- No write-back to GitHub in V1.
- No SaaS deployment story in V1.
- No dependency on OpenSearch for first boot.
- No requirement that clusters be mathematically perfect. They need to be operationally useful.
- Package manager:
pnpm - Monorepo layout:
packages/api-corepackages/api-contractapps/cliapps/webas a deferred placeholder
- Runtime: Node.js + TypeScript
- CLI: single
ghcrawlcommand with subcommands, following thediscrawlUX pattern - Local DB: SQLite
- API server: local HTTP server mounted in-process by the CLI
- UI: future React + Vite app using
shadcn/uiprimitives with a custom visual system - LLM provider: OpenAI
- Vector backends:
- baseline: exact cosine search in-process over vectors stored in SQLite
- optional: OpenSearch 3.3 in Docker for ANN / filtered kNN
For the current corpus size, ghcrawl should use exact local similarity only.
- store embeddings in SQLite
- load embeddings for the active repository into process memory
- compute cosine similarity directly in Node
- do not require Docker, OpenSearch, Lucene, or Faiss for normal local use
Reasoning:
- a few thousand summarized threads is small enough for exact search
- this avoids JVM or native vector-service operational overhead on modest machines
- the TypeScript/Node stack stays simpler and easier to debug
- we can defer service-backed ANN until there is real evidence that latency or filtering needs it
discrawl is the operational pattern, not the language mandate. For this project:
jeerreviewalready provides working GitHub and OpenAI access patterns in TypeScript.- OpenAI SDK support is straightforward in TypeScript.
- React UI integration is simpler in a single Node/TS workspace.
- The corpus size is small enough that Go is not required for performance in V1.
Primary interface should feel like discrawl:
ghcrawl init
ghcrawl doctor
ghcrawl sync --owner openclaw --repo openclaw
ghcrawl summarize --since 30d
ghcrawl embed --since 30d
ghcrawl cluster --open-only
ghcrawl search "download stalls on large files"
ghcrawl serveRecommended initial commands:
init: write config and local pathsdoctor: verify env, GitHub auth, OpenAI auth, DB, and optional OpenSearch reachabilitysync: fetch repository data into SQLitesummarize: generate or refresh thread summariesembed: generate embeddings for summary documentscluster: build or refresh similarity clusterssearch: keyword + semantic search over local dataserve: start the local HTTP API for inspection and future UI consumption
V1 does not run a permanent daemon.
apps/cliis the only supported runtime entrypoint- the CLI calls
packages/api-coredirectly for command execution ghcrawl servemounts the same core services behind a local HTTP API- future web code must talk to that HTTP API through
packages/api-contract - browser code must never access SQLite, GitHub, or OpenAI directly
Use explicit local config plus env vars.
Environment variables:
GITHUB_TOKENOPENAI_API_KEYGHCRAWL_DB_PATHwith defaultdata/ghcrawl.dbGHCRAWL_API_PORTwith default5179GHCRAWL_SUMMARY_MODELwith defaultgpt-5-miniGHCRAWL_EMBED_MODELwith defaulttext-embedding-3-smallGHCRAWL_OPENSEARCH_URLoptionalGHCRAWL_OPENSEARCH_INDEXoptional
Local config file:
- repo-local
.env.localfor secrets and dev defaults - optional user config later if we want a
discrawl-style persisted runtime config
Owns:
- SQLite access
- GitHub API access
- OpenAI access
- crawl / summarize / embed / search / cluster services
- HTTP route handlers
Owns:
- request/response schemas
- shared DTOs
- typed HTTP client
Owns:
- CLI command parsing
- local process lifecycle
- optional in-process HTTP hosting
Owns later:
- Vite frontend
shadcn/ui-based component layer- HTTP-only integration through
api-contract
The system must support multiple target repositories over time, even if V1 is usually run against one.
Core entities:
repositoriesissuespull_requestsissue_commentsreviewsreview_commentstimeline_eventsdocumentsdocument_summariesdocument_embeddingssimilarity_edgesclusterscluster_memberssync_runsembedding_runsclustering_runs
Do not embed raw GitHub payloads directly. Build one canonical search document per issue or PR thread.
Document inputs:
- title
- body
- non-bot issue comments
- non-bot review summaries
- non-bot review comments
- selected timeline facts like closed / reopened / merged if useful
- selected metadata like labels and affected paths for PRs
Normalization rules:
- skip bot-authored review comments and routine automation chatter
- preserve author, timestamps, labels, state, and links in structured columns
- keep raw JSON separately for traceability
The user’s proposed flow is correct: summarize first, embed second.
Recommended summary artifacts per thread:
problem_summary: what the author says is wrong or neededsolution_summary: what the PR changes, if applicablemaintainer_signal_summary: what reviewers or commenters are worried aboutdedupe_summary: a compact, embedding-oriented summary optimized for semantic similarity
Why this split helps:
- cluster quality improves when embeddings are fed stable, compressed language
- token cost stays bounded
- later search and UI can still show a human-readable explanation
Use the GitHub REST API first, reusing the jeerreview auth/header pattern:
- bearer token from
GITHUB_TOKEN Accept: application/vnd.github+jsonX-GitHub-Api-Version: 2022-11-28- explicit user agent
Fetch in pages and store cursors/checkpoints locally.
Initial sync scope:
- repository metadata
- open issues
- open PRs
- recent closed issues and PRs
- comments, reviews, review comments
- timeline metadata where available
Recommended sync behavior:
sync --full: backfill everything practicalsync --since: incremental refresh- idempotent upserts
- per-endpoint rate limit handling and retry with backoff
Use OpenAI for two distinct jobs:
- summarization
- embeddings
Default models:
- summarization:
gpt-5-mini - embeddings:
text-embedding-3-small
Relevant official constraints to design around:
- OpenAI embeddings support batching and a dimensions parameter, with
text-embedding-3-smalldefaulting to 1536 dimensions. - The embeddings API enforces per-input and per-request token limits, so batching should be token-aware rather than count-only.
Implementation:
- store vectors in SQLite
- load vectors for the working repo into memory
- compute cosine similarity directly in process
Pros:
- simplest
- no extra service
- exact results
- enough for a few thousand documents
Cons:
- slower if corpus grows substantially
- fewer advanced filtering / ranking options
Recommendation:
- start here first
- keep this as the default until measured performance proves otherwise
Implementation:
- run local Docker OpenSearch
- index one document per issue/PR thread with metadata filters
- use
knn_vector - use Lucene HNSW as the first ANN backend
Pros:
- good fit for smaller deployments
- filtering during search is strong
- easier operational story than Faiss for this scale
Cons:
- adds Docker dependency
- approximate rather than exact unless configured otherwise
Recommendation:
- first optional vector backend
Implementation:
- same indexing model, but use Faiss-backed HNSW or IVF
Pros:
- better indexing throughput
- better scale path if the corpus or chunk count grows sharply
Cons:
- more tuning surface
- IVF requires training
- benefits are unlikely to matter at current scale
Recommendation:
- defer until Lucene is proven insufficient
Phase recommendation:
- exact cosine similarity over SQLite-backed vectors
- optional OpenSearch 3.3 Lucene/HNSW backend
- evaluate Faiss only if query latency, filtering, or scale justify it
This is the right trade for the expected corpus size. A few thousand summarized threads is small enough that exact local similarity is cheap and easier to debug.
Current execution decision:
- exact local kNN is the only planned default path right now
- OpenSearch is explicitly deferred
- Lucene and Faiss are not implementation targets unless the local exact path proves insufficient
The clustering problem is operational, not academic. We need clusters that help a maintainer say, "these all belong to the same problem area."
Recommended first-pass algorithm:
- Build one dedupe summary and one embedding per issue/PR thread.
- For each active thread, fetch top
knearest neighbors. - Keep edges above a similarity threshold.
- Add metadata boosts:
- same labels
- overlapping touched paths for PRs
- shared title keywords
- same error strings or stack fragments
- Build connected components or union-find groups from accepted edges.
- Compute cluster centroid and representative thread.
Recommended defaults:
- compare within the same repository first
- support issue-to-PR and PR-to-PR edges
- use stricter thresholds for cross-type matches if needed
- keep edge explanations so users can see why two items matched
Search should be hybrid:
- keyword search over SQLite FTS
- semantic search over embeddings
- cluster-aware result grouping
This lets maintainers find either:
- exact phrases and stack traces
- semantically similar discussions
- broader groups of related work
Use a small local API plus React UI, following the jeerreview pattern.
Primary UI views:
- repository overview
- sync / health status
- issue/PR list with filters
- document detail
- cluster list
- cluster detail with issues and PRs mixed together
- search results with keyword and semantic tabs
The first UI can be intentionally plain, but it is explicitly deferred until the last phase. The important part of the current design is inspectability and future compatibility:
- show raw source excerpts
- show summaries
- show nearest neighbors with scores
- show cluster membership and rationale
Recommended initial layout:
ghcrawl/
packages/
api-core/
src/
api/
cluster/
db/
documents/
github/
openai/
search/
api-contract/
src/
apps/
cli/
src/
web/
src/
Testing must prove the local pipeline works end to end.
Unit tests:
- GitHub payload normalization
- bot-comment filtering
- summary prompt output parsing
- cosine similarity scoring
- cluster graph construction
Integration tests:
- SQLite migrations
- GitHub pagination and checkpoint resume
- summarization and embedding job orchestration with mocked providers
- OpenSearch indexing and query behavior behind an interface
Smoke tests:
- GitHub auth with real token
- OpenAI auth with real key
- optional OpenSearch local connectivity
Golden tests:
- clustering on a fixed fixture corpus
- hybrid search ranking on known examples
- GitHub timeline data can be uneven. Mitigation: treat raw issue/PR bodies and comments as the primary truth.
- Bot noise can drown similarity. Mitigation: aggressive author filtering and normalization.
- Summaries can over-compress. Mitigation: keep raw source excerpts and allow re-embedding from adjusted prompts.
- OpenSearch can add unnecessary complexity. Mitigation: make it optional and keep SQLite exact search as the baseline.
- JVM or native vector backends can overcomplicate local setup on low-memory machines. Mitigation: keep exact local search as the primary path and postpone service-backed ANN.
- Cluster thresholds will need tuning. Mitigation: persist neighbor edges and inspect false positives directly in the UI.
Build V1 in this order:
- TypeScript workspace scaffold
- GitHub sync into SQLite
- summary generation
- exact vector search in process
- clustering
- API + UI
- optional OpenSearch backend
That gets to useful maintainer value fastest while keeping the architecture clean enough to scale later.