Hierarchical Documentation System for Large Codebases #1820

mindplay-dk · 2025-08-07T12:12:38Z

mindplay-dk
Aug 7, 2025

I've been working on an idea: an approach to handling arbitrary size codebases in a more reliable way.

After working on just the idea for about a week, I decided I'm in over my head, haha - the idea is fairly well described at this point, but I don't think I have the time or energy to commit to building something like this on my own...

But I also don't think anything like this really exists in the open-source community? So, please steal my idea! 😅

This will be a very long post, so here are the two key features very briefly summarized:

TL;DR

Documentation Hierarchy: Build a navigable pyramid of documentation from bottom (individual files) to top (system overview), where each level references tagged source documents. A navigation agent traverses this hierarchy intelligently, drilling down from high-level concepts to specific implementation details to assemble coherent, task-relevant context for development work.

Cross-Vector Similarity Analysis: Generate separate vector embeddings describing "what each module does" and "what each module depends on" for every file. The key insight: modules with similar dependency patterns often serve similar architectural roles, even if they don't directly interact. By comparing a module's "purpose" vector against other modules' "dependency" vectors, the system discovers semantic relationships that pure code analysis would miss - identifying modules that fill similar roles in the system architecture or could be logically grouped together.

Why Build this?

I don't trust agents in large codebases - just mining for a few related files does not seem to work well, even in smaller codebases, and I think I might have come up with a reasonably general solution to this problem.

I don't know if a single agent's discussion board is actually the right context for something like this - there are so many agents now, and this idea might be better implemented as an MCP that could plug into multiple agents... but I figured this community might a good place to air the idea, and I honestly haven't tried another agent I've liked. (Kilo's multi-agent approach seems like The Way!)

A friend described this idea as "Google Maps for code", which I think is really spot on.

If the planet was 4x larger there would just be another zoom level. Similarly, if the codebase is larger, there will be another level of documentation.

A 10,000-file codebase might need 4-5 documentation levels, while a 100,000-file system could require 6-7 levels. The navigation agent can still traverse from overview to implementation details in the same number of hops, regardless of total system size.

Like Google Maps showing different information at different zoom levels (country borders vs street names vs building numbers), each documentation level reveals appropriate information density. High-level documents focus on system architecture and major components, while lower levels provide implementation specifics.

Anyhow, here goes, just a full brain dump of all my documents and notes so far.

Hierarchical Documentation System for Large Codebases

Summary

This system addresses a fundamental limitation of current LLM-based development tools: the lack of persistent, structured knowledge about complex codebases. Traditional approaches force LLMs to "start over" with each task, relying on RAG systems that provide fragmented context without comprehensive understanding.

The solution is a hierarchical documentation pyramid that emerges bottom-up from source code, creating multiple levels of abstraction from detailed component descriptions to high-level system overviews. A specialized navigation agent traverses this hierarchy intelligently, assembling coherent, task-relevant context that tells a complete story from system architecture down to specific implementation details.

This approach transforms the traditional "brilliant developer with daily amnesia" problem into a system with persistent, navigable memory that can provide both big-picture understanding and precise technical details as needed.

Requirements

Core Functionality

Generate hierarchical documentation pyramid from existing codebases
Maintain documentation currency through automated updates on code changes
Enable intelligent navigation through documentation levels based on task context
Support cross-cutting concerns and shared utilities through explicit categorization
Assemble coherent context packages for downstream development agents

Documentation Structure

Bottom Level: Detailed descriptions of individual code files/modules with source file references
Middle Levels: Summaries of related bottom-level documents with tagged source references
Top Level: Single high-level system overview with categorized cross-cutting components
All Levels: Tagged source document lists enabling informed navigation decisions

Agent Architecture

Documentation Maintenance Agent: Creates and updates documentation across all hierarchy levels
Documentation Locator Agent: Navigates hierarchy to assemble task-relevant context
Orchestrator Agent: Coordinates documentation generation and maintenance tasks

Cross-Cutting Component Support

Top-level awareness of utilities, shared components, and common data structures
Dedicated second-tier documents for cross-cutting concerns
Tag-based categorization enabling multiple discovery paths
Dynamic reverse lookup capabilities without hard-coded bidirectional references

Maintenance Characteristics

Documents maintain only forward references to source materials
Tag placement on all documents below top level
Agent-managed reverse indexing for impact analysis
Automated propagation of changes up the hierarchy
Minimal manual maintenance requirements

Loose Specification

Documentation Format

Each document contains:

# Document Title
## Summary
[Document content describing the component/system]

## Source Documents
- document-name.md [tag1, tag2, tag3]
- another-doc.md [tag1, tag4]
- source-file.py [implementation]

Navigation Algorithm

Start: Documentation Locator Agent reads top-level system overview
Path Selection: Based on task context, selects 1-2 relevant source documents using content and tags
Traversal: Reads selected documents, evaluates their source lists with tags
Depth Decision: Continues drilling down or stops based on information sufficiency
Context Assembly: Compiles selected documents into coherent narrative from high-level to specific
Handoff: Provides assembled context to development agents

Initial Generation Process

Code Analysis: Process source files into detailed bottom-level documents
Grouping: Cluster related bottom-level documents by functionality/module
Mid-Level Creation: Generate summary documents for each cluster with tagged references
Hierarchy Building: Continue summarization process until single top-level document
Cross-Cutting Identification: Identify and properly categorize shared utilities and components

Update Process

Change Detection: Identify modified/added/removed source files
Bottom-Up Propagation: Update affected bottom-level documents
Impact Analysis: Use reverse indexing to identify affected higher-level documents
Selective Updates: Regenerate only documents affected by changes
Quality Verification: Ensure cross-references remain valid

Tag Categories

Functional: authentication, validation, data-processing, ui-rendering
Architectural: utilities, middleware, services, components
Domain: forms, reporting, user-management, api-integration
Technical: security, performance, configuration, testing

Implementation Considerations

Dynamic reverse lookup generation (not stored in documents)
Configurable hierarchy depth based on codebase complexity
Error propagation prevention through validation checkpoints
Resource optimization for large-scale documentation maintenance
Integration patterns for MCP server deployment

What is the Louvain Algorithm?

The Louvain algorithm uses a hierarchical approach to detect communities. Initially, each node starts as its own community. The algorithm then iteratively merges nodes and communities to maximize modularity. Once no further improvements can be made, it aggregates the nodes in each community into a single node and repeats the process. This hierarchical method allows the Louvain algorithm to efficiently handle large networks, making it a popular choice for community detection.

More here: https://hypermode.com/blog/community-detection-algorithms

A first draft for a potential codebase decomposition strategy:

Step 1: Multi-View Embeddings Per File

For each file/module, generate multiple vector embeddings, such as:

Vector Type	Content
Description Vector	"This module validates user input for dynamic forms."
Dependency Vector	"This module imports `utils`, `validators`, and uses `FormSchema`."
Structural Role Vector	"This module is a controller that handles API requests."
Call Graph Snippet (optional)	"This module calls `validateInput`, `logEvent`..."

These are not embeddings of code, but embeddings of LLM-generated natural language descriptions of code roles and relationships — greatly compressing complexity while maintaining semantic expressiveness.

Step 2: Similarity & Clustering

Use two axes of similarity:

Semantic clustering: group files with similar description vectors (e.g., all related to "form handling" or "user auth")
Dependency proximity: files whose dependency vectors resemble a given file’s description vector, suggesting potential coupling

Then use graph-based clustering like the Louvain algorithm to discover communities/modules:

Nodes = files/modules
Edges = similarity (semantic + dependency overlap)
Weights = composite score from both vectors

This allows discovery of clusters that are:

Semantically coherent (they serve a related purpose)
Structurally coupled (they depend on similar or overlapping modules)

This also helps detect cross-cutting modules — like logging, auth, utils — because their dependency vectors will be broadly distributed across many semantic clusters.

Step 3: Hierarchical Summary Generation

Once clusters are formed:

Use an LLM to summarize each cluster into a middle-layer document
Cluster summaries to create higher-level abstractions
Add cross-cutting modules as tagged special clusters

This naturally produces a documentation pyramid that mirrors logical rather than folder structure.

Potential Advantages

Semantic Compression: Using LLM-generated natural language descriptions instead of raw code embeddings captures conceptual relationships while reducing computational overhead and noise from irrelevant implementation details.

Multi-Dimensional Analysis: The four vector types provide different perspectives on module relationships - functional purpose, structural dependencies, architectural role, and runtime behavior - enabling more comprehensive similarity assessment.

Language Agnostic: Once descriptions are generated, the clustering approach works uniformly across programming languages without language-specific parsing or analysis.

Interpretability: Clustering decisions become traceable since they're based on readable descriptions rather than high-dimensional code representations.

Logical Organization: The approach should produce groupings that reflect actual system relationships rather than filesystem organization, potentially improving the utility of the resulting documentation hierarchy.

Dependency vector concept

The dependency vector representation enables detection of modules with similar roles in the system architecture without requiring direct interaction between those modules. (Works across layers.)
The dependency vector approach can represent "what this module depends on" as embeddings, allows detection of modules that serve similar roles in the dependency ecosystem, even if they don't directly interact.
The structural role vector could help identify architectural patterns automatically - grouping controllers, services, utilities, etc. without explicit categorization rules.
This approach should produce hierarchical documentation that reflects actual system logic rather than accidental organizational structures, making it far more valuable for navigation and understanding.

Minimal Viable Update Approach

Change Detection: Track which files were modified/added/deleted since last documentation generation.

Affected Document Identification:

For each changed file, identify its current bottom-level document
Traverse up the hierarchy to find all documents that reference it (using the reverse lookup index)
Mark these documents as "stale"

Localized Re-clustering:

For any cluster containing changed files, re-run the multi-view embedding + clustering process on just that cluster and its immediate neighbors
If new files were added, temporarily assign them to the "best fit" existing cluster, then re-cluster that enlarged group (if needed, e.g. if beyond a certain group size threshold.)

Bottom-Up Regeneration:

Regenerate all stale documents from bottom to top
Use the same LLM summarization process as initial generation
Update reference lists and tags as documents are regenerated

Acceptable Tradeoffs (hopefully)

Some Redundant Work: Entire clusters get re-processed even if only one file changed, but this ensures consistency and is computationally manageable for reasonably-sized clusters.

Conservative Approach: When in doubt, re-cluster and regenerate rather than trying to surgically update. This reduces complexity while maintaining correctness.

This approach should scale reasonably well, since most changes affect only a few clusters, and the computational cost is proportional to change scope, rather than to the total codebase size.

(You could imagine something far more complex and probably more optimal, accounting for every type of change, special handling after additions, removals, feature changes, bug fixes, refactoring tasks, etc. - what is probably wanted for a first version of this is something that doesn't rely on 4 or 5 different workflows for different types of changes, as this would require building and evaluating way too many component functions. Get something that works before thinking about optimizations/improvements. If the documentation were to erode over time in a big system, or after a very big change, you could of course always just rebuild your documentation then.)

That's all I've got 💁‍♂️

There is clearly more design to be done before you could jump in and build something like this.

But I think it's a pretty cool idea? There are products claiming to work for "large codebases", but they're all proprietary, as far as I know? I haven't seen any of these products really explaining what it is they do or why it works. There are of course simpler open-source tools that claim to do something with RAG search and documentation, etc. - I just haven't seen anything that sounds like it's really going to work reliably in a large codebase, so my brain kept racking at it.

I just don't have the time or energy myself, so I hope maybe someone else will pick up the idea! 😄

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Hierarchical Documentation System for Large Codebases #1820

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Hierarchical Documentation System for Large Codebases #1820

Uh oh!

mindplay-dk Aug 7, 2025

TL;DR

Why Build this?

Hierarchical Documentation System for Large Codebases

Summary

Requirements

Core Functionality

Documentation Structure

Agent Architecture

Cross-Cutting Component Support

Maintenance Characteristics

Loose Specification

Documentation Format

Navigation Algorithm

Initial Generation Process

Update Process

Tag Categories

Implementation Considerations

What is the Louvain Algorithm?

Step 1: Multi-View Embeddings Per File

Step 2: Similarity & Clustering

Step 3: Hierarchical Summary Generation

Potential Advantages

Dependency vector concept

Minimal Viable Update Approach

Acceptable Tradeoffs (hopefully)

That's all I've got 💁‍♂️

Replies: 0 comments

mindplay-dk
Aug 7, 2025