Skip to content

Commit 2a81310

Browse files
committed
feat(core): implement detailed statistics aggregation infrastructure
Implement data collection infrastructure for dashboard visualizations with language breakdown, component type tracking, and monorepo support. **New Features:** - StatsAggregator class with streaming O(1) aggregation - Language-specific stats (TypeScript, JavaScript, Go, Markdown) - Component type counting (function, class, interface, type, variable) - Package/monorepo detection infrastructure - TypeScript vs JavaScript distinction based on file extension **Types Added:** - SupportedLanguage type - LanguageStats interface (files, components, lines) - PackageStats interface (name, path, languages) - DetailedIndexStats extending IndexStats **Changes:** - Extended IndexStats with optional byLanguage, byComponentType, byPackage - Fixed TypeScript scanner to detect .ts vs .js based on extension - Integrated StatsAggregator into index() and update() methods - All stats backward compatible (optional fields) **Testing:** - 14 unit tests for StatsAggregator (all passing) - 6 integration tests for detailed stats (all passing) - Performance: <5% overhead (10k documents in <100ms) - 508 total core tests passing Related: #146 (Data Collection Infrastructure) Part of Epic #145 (Dashboard & Visualization System)
1 parent b86fe9c commit 2a81310

File tree

8 files changed

+1253
-16
lines changed

8 files changed

+1253
-16
lines changed

PLAN.md

Lines changed: 223 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -240,6 +240,40 @@ Git history is valuable context that LLMs can't easily access. We add intelligen
240240

241241
---
242242

243+
## Current: Performance & Reliability (v0.6.x - v0.7.x)
244+
245+
> Critical high-impact improvements for production readiness and user experience.
246+
247+
**Epic:** #104 (Progress: 6/9 complete)
248+
249+
### Completed Improvements ✅
250+
251+
| Feature | Status | Version | Impact |
252+
|---------|--------|---------|--------|
253+
| Index size reporting | ✅ Done | v0.4.3 | Track disk usage growth |
254+
| Adaptive concurrency | ✅ Done | v0.6.0 | Auto-detect optimal batch size by CPU/memory |
255+
| Incremental indexing | ✅ Done | v0.5.1 | <30s updates for single file changes (#122) |
256+
| Progress indicators | ✅ Done | v0.1.0 | Real-time feedback for long operations |
257+
| Error handling | ✅ Done | v0.3.0 | Graceful degradation |
258+
| Basic validation | ✅ Done | v0.2.0 | Git repo and path checks |
259+
260+
### Remaining Work 🔄
261+
262+
| Issue | Priority | Impact | Status |
263+
|-------|----------|--------|--------|
264+
| #152 - MCP lazy initialization | P0 | Reduce startup from 2-5s to <500ms | 🔲 Todo |
265+
| #153 - GitHub history in planner | P0 | Add commit context to AI plans | 🔲 Todo |
266+
| #154 - Memory monitoring | P1 | Prevent leaks, maintain <500MB usage | 🔲 Todo |
267+
268+
**Success Metrics:**
269+
- ✅ Large repo indexing: <5min for 50k files
270+
- ✅ Incremental updates: <30s for single file changes
271+
- 🔲 MCP server startup: <500ms (currently 2-5s)
272+
- 🔲 Memory usage: <500MB steady state
273+
- 🔲 Planner quality: Include git history context
274+
275+
---
276+
243277
## Next: Extended Git Intelligence (v0.5.0)
244278

245279
> Building on git history with deeper insights.
@@ -277,7 +311,195 @@ Git history is valuable context that LLMs can't easily access. We add intelligen
277311

278312
---
279313

280-
## Future: Extended Intelligence (v0.6+)
314+
## Next: Dashboard & Visualization (v0.7.1)
315+
316+
> Making codebase insights visible and accessible.
317+
318+
**Epic:** #145
319+
320+
### Philosophy
321+
322+
Dev-agent provides rich context about codebases, but it's currently text-only. A dashboard makes insights:
323+
- **Visible** - See language breakdown, component types, health status at a glance
324+
- **Interactive** - Explore relationships, drill into packages
325+
- **Actionable** - Identify areas needing attention
326+
327+
### Goals
328+
329+
1. **Enhanced CLI** (`dev dashboard`) - Terminal-based stats with rich formatting
330+
2. **Web Dashboard** - Next.js app with real-time insights
331+
3. **Data Infrastructure** - Aggregate stats during indexing for efficient display
332+
333+
### Components
334+
335+
| Component | Status | Priority |
336+
|-----------|--------|----------|
337+
| **CLI Enhancements** | | |
338+
| Language breakdown display | 🔲 Todo | 🔴 High |
339+
| Component type statistics | 🔲 Todo | 🔴 High |
340+
| Package-level stats (monorepo) | 🔲 Todo | 🔴 High |
341+
| Rich formatting (tables, colors) | 🔲 Todo | 🔴 High |
342+
| **Core Data Collection** | | |
343+
| Track language metrics in indexer | 🔲 Todo | 🔴 High |
344+
| Aggregate component type counts | 🔲 Todo | 🔴 High |
345+
| Package-level aggregation | 🔲 Todo | 🟡 Medium |
346+
| Change frequency tracking | 🔲 Todo | 🟡 Medium |
347+
| **Web Dashboard** | | |
348+
| Next.js app setup (`apps/dashboard/`) | 🔲 Todo | 🔴 High |
349+
| Tremor component library | 🔲 Todo | 🔴 High |
350+
| API routes (stats, health) | 🔲 Todo | 🔴 High |
351+
| Real-time stats display | 🔲 Todo | 🔴 High |
352+
| Language distribution charts | 🔲 Todo | 🟡 Medium |
353+
| Component type visualizations | 🔲 Todo | 🟡 Medium |
354+
| Health status indicators | 🔲 Todo | 🟡 Medium |
355+
| Vector index metrics (simple) | 🔲 Todo | 🟡 Medium |
356+
| Basic package list (monorepo) | 🔲 Todo | 🟡 Medium |
357+
358+
### Architecture
359+
360+
```
361+
apps/
362+
└── dashboard/ # Next.js 16 + React 19 + Tremor
363+
├── app/
364+
│ ├── page.tsx # Main dashboard
365+
│ └── api/
366+
│ └── stats/ # Next.js API routes
367+
└── components/
368+
└── tremor/ # Tremor dashboard components
369+
370+
packages/core/
371+
└── src/
372+
└── indexer/
373+
└── stats-aggregator.ts # New: Collect detailed stats
374+
```
375+
376+
### Implementation Plan
377+
378+
**Implementation Phases:**
379+
380+
**Phase 1: Data Foundation**
381+
- Enhance IndexStats with language/component breakdowns
382+
- Aggregate stats during indexing (minimal overhead)
383+
- Foundation for all visualizations
384+
385+
**Phase 2: CLI Enhancements**
386+
- Rich terminal output with tables and colors
387+
- Package-level breakdown for monorepos
388+
- Immediate user value
389+
390+
**Phase 3: Web Dashboard**
391+
- Next.js 16 app in `apps/dashboard/`
392+
- Tremor component setup
393+
- Basic stats display with charts
394+
395+
**Phase 4: Advanced Features**
396+
- Interactive exploration
397+
- Package explorer (monorepo support)
398+
- Real-time updates
399+
400+
---
401+
402+
## Next: Advanced LanceDB Visualizations (v0.7.2)
403+
404+
> Making vector embeddings visible and explorable.
405+
406+
### Philosophy
407+
408+
LanceDB stores 384-dimensional embeddings for semantic search, but these are invisible to users. Advanced visualizations reveal:
409+
- **Where code lives** in semantic space (2D projections)
410+
- **What's related** beyond imports (similarity networks)
411+
- **How embeddings evolve** over time (drift tracking)
412+
- **Search quality** insights (what works, what doesn't)
413+
414+
### Goals
415+
416+
1. **Semantic Code Map** - 2D/3D projection of vector space
417+
2. **Similarity Explorer** - Interactive component relationship graph
418+
3. **Search Quality Dashboard** - Analyze search performance
419+
4. **Embedding Health** - Coverage and quality metrics per directory
420+
421+
### Components
422+
423+
| Component | Description | Priority |
424+
|-----------|-------------|----------|
425+
| **Semantic Code Map** | | |
426+
| t-SNE/UMAP projection to 2D | Visualize embedding space | 🔴 High |
427+
| Interactive scatter plot | Click to see code snippet | 🔴 High |
428+
| Color by language/type | Visual code categorization | 🟡 Medium |
429+
| Cluster detection | Auto-identify code groups | 🟡 Medium |
430+
| **Similarity Network** | | |
431+
| Component relationship graph | Force-directed layout | 🔴 High |
432+
| Semantic similarity edges | Show hidden relationships | 🔴 High |
433+
| Interactive exploration | Zoom, pan, filter | 🟡 Medium |
434+
| Duplication detection | High similarity alerts | 🟡 Medium |
435+
| **Search Quality** | | |
436+
| Search metrics dashboard | Track performance over time | 🔴 High |
437+
| Query similarity heatmap | Understand search patterns | 🟡 Medium |
438+
| "Dead zone" detection | Queries with poor results | 🟡 Medium |
439+
| Recommendation engine | Suggest better queries | 🟢 Low |
440+
| **Embedding Health** | | |
441+
| Coverage heatmap by directory | Identify blind spots | 🔴 High |
442+
| Quality scoring per file | Flag low-quality embeddings | 🟡 Medium |
443+
| Drift tracking over time | Monitor embedding changes | 🟡 Medium |
444+
| Re-index recommendations | Suggest what needs updating | 🟢 Low |
445+
446+
### Architecture
447+
448+
```
449+
Dashboard UI
450+
451+
Advanced Viz Components (D3.js, Plotly, or similar)
452+
453+
New API Routes
454+
├─ GET /api/embeddings/projection (t-SNE/UMAP data)
455+
├─ GET /api/embeddings/similarity (network graph)
456+
├─ GET /api/embeddings/quality (coverage metrics)
457+
└─ GET /api/embeddings/search-history (query analysis)
458+
459+
LanceDB + Vector Analysis
460+
└─ Dimensionality reduction, similarity queries, metrics
461+
```
462+
463+
### Dependencies
464+
465+
**New:**
466+
- `umap-js` or `tsne-js` - Dimensionality reduction
467+
- `d3` or `@visx/visx` - Advanced visualizations
468+
- `react-force-graph` - Network graphs (or `sigma.js`)
469+
- `@tensorflow/tfjs` (optional) - Advanced vector operations
470+
471+
### Implementation Phases
472+
473+
**Phase 1: Semantic Code Map**
474+
- Implement t-SNE/UMAP projection
475+
- Create 2D scatter plot visualization
476+
- Add basic interactivity (hover, click)
477+
478+
**Phase 2: Similarity Network**
479+
- Build component similarity graph
480+
- Implement force-directed layout
481+
- Add filtering and exploration
482+
483+
**Phase 3: Search Quality**
484+
- Track search queries and results
485+
- Build metrics dashboard
486+
- Implement quality scoring
487+
488+
**Phase 4: Embedding Health**
489+
- Coverage analysis by directory
490+
- Quality scoring per file
491+
- Drift detection system
492+
493+
### Success Metrics
494+
495+
- Developers can visually explore codebase semantics
496+
- Identify code duplication without running analysis tools
497+
- Understand which areas need re-indexing
498+
- Improve search query formulation based on insights
499+
500+
---
501+
502+
## Future: Extended Intelligence (v0.8+)
281503

282504
### Multi-Language Support
283505

packages/core/src/indexer.ts

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
/**
2+
* Repository Indexer module exports
3+
*/
4+
5+
export { RepositoryIndexer } from './indexer/index';
6+
export { StatsAggregator } from './indexer/stats-aggregator';
7+
export * from './indexer/types';
8+
export * from './indexer/utils';

0 commit comments

Comments
 (0)