Skip to content

Conversation

@prosdev
Copy link
Collaborator

@prosdev prosdev commented Nov 24, 2025

Fixes #36

Problem

The GitHub indexer was using in-memory storage (Map), causing indexed data to be lost between CLI command invocations. Users had to re-index before every search, breaking the dogfooding workflow.

# Before (BUG)
$ dev gh index
✔ GitHub data indexed!

$ dev gh search "MCP"
⚠ GitHub data not indexed  # Data lost!

Solution

Migrated GitHub indexer to use persistent storage:

  1. Vector Storage - Store documents in LanceDB for semantic search
  2. State File - Persist metadata in .dev-agent/github-state.json
  3. Auto-Update - Background updates when data is stale (default: 15 min)
  4. Incremental Updates - Support for since parameter (future enhancement)

Changes

Core Indexer (packages/subagents/src/github/indexer.ts)

  • Added VectorStorage integration for persistent document storage
  • Added state file persistence (version, lastIndexed, stats)
  • Implemented isStale() check for auto-updates
  • Updated search to use vector storage semantic search
  • Added initialize() and close() lifecycle methods

Configuration (packages/subagents/src/github/types.ts)

  • Added GitHubIndexerConfig interface
  • Added GitHubIndexerState interface for persistence

Agent (packages/subagents/src/github/agent.ts)

  • Updated GitHubAgentConfig to pass vector storage paths
  • Removed dependency on RepositoryIndexer (was unused)
  • Call initialize() during agent startup

CLI (packages/cli/src/commands/gh.ts)

  • Pass vector storage config to GitHubIndexer
  • Use separate storage path for GitHub data (.vectors-github)

Tests

  • Added comprehensive persistence tests (indexer.test.ts)
  • Tested state save/load, vector storage, auto-update, statistics
  • Fixed integration tests to use new config format
  • All 646 tests passing ✅

Verification

# After (FIXED)
$ dev gh index
✔ GitHub data indexed!  # 36 documents

$ dev gh search "MCP"
✔ Found 10 results  # Persistence works!

State file created:

.dev-agent/github-state.json:
{
  "version": "1.0.0",
  "repository": "lytics/dev-agent",
  "lastIndexed": "2025-11-24T13:35:09.975Z",
  "totalDocuments": 36,
  "byType": { "issue": 23, "pull_request": 13 },
  "byState": { "open": 11, "closed": 12, "merged": 13 }
}

Impact

  • ✅ Data persists between CLI invocations
  • ✅ Semantic search with embeddings (better results)
  • ✅ Auto-updates keep data fresh
  • ✅ Dogfooding workflow unblocked
  • ✅ Consistent with RepositoryIndexer architecture

- Cast through 'unknown' to access private properties in tests
- Fix template literal usage in CLI (auto-fixed by biome)
- Remove biome-ignore comments in favor of proper type safety

All tests passing, lint clean.
@prosdev prosdev merged commit ff16990 into main Nov 24, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug(github): GitHub indexer doesn't persist data between CLI invocations

1 participant