Hierarchical Documentation System for Large Codebases #1820
mindplay-dk
started this conversation in
1. Feature requests
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I've been working on an idea: an approach to handling arbitrary size codebases in a more reliable way.
After working on just the idea for about a week, I decided I'm in over my head, haha - the idea is fairly well described at this point, but I don't think I have the time or energy to commit to building something like this on my own...
But I also don't think anything like this really exists in the open-source community? So, please steal my idea! 😅
This will be a very long post, so here are the two key features very briefly summarized:
TL;DR
Documentation Hierarchy: Build a navigable pyramid of documentation from bottom (individual files) to top (system overview), where each level references tagged source documents. A navigation agent traverses this hierarchy intelligently, drilling down from high-level concepts to specific implementation details to assemble coherent, task-relevant context for development work.
Cross-Vector Similarity Analysis: Generate separate vector embeddings describing "what each module does" and "what each module depends on" for every file. The key insight: modules with similar dependency patterns often serve similar architectural roles, even if they don't directly interact. By comparing a module's "purpose" vector against other modules' "dependency" vectors, the system discovers semantic relationships that pure code analysis would miss - identifying modules that fill similar roles in the system architecture or could be logically grouped together.
Why Build this?
I don't trust agents in large codebases - just mining for a few related files does not seem to work well, even in smaller codebases, and I think I might have come up with a reasonably general solution to this problem.
I don't know if a single agent's discussion board is actually the right context for something like this - there are so many agents now, and this idea might be better implemented as an MCP that could plug into multiple agents... but I figured this community might a good place to air the idea, and I honestly haven't tried another agent I've liked. (Kilo's multi-agent approach seems like The Way!)
A friend described this idea as "Google Maps for code", which I think is really spot on.
If the planet was 4x larger there would just be another zoom level. Similarly, if the codebase is larger, there will be another level of documentation.
A 10,000-file codebase might need 4-5 documentation levels, while a 100,000-file system could require 6-7 levels. The navigation agent can still traverse from overview to implementation details in the same number of hops, regardless of total system size.
Like Google Maps showing different information at different zoom levels (country borders vs street names vs building numbers), each documentation level reveals appropriate information density. High-level documents focus on system architecture and major components, while lower levels provide implementation specifics.
Anyhow, here goes, just a full brain dump of all my documents and notes so far.
Hierarchical Documentation System for Large Codebases
Summary
This system addresses a fundamental limitation of current LLM-based development tools: the lack of persistent, structured knowledge about complex codebases. Traditional approaches force LLMs to "start over" with each task, relying on RAG systems that provide fragmented context without comprehensive understanding.
The solution is a hierarchical documentation pyramid that emerges bottom-up from source code, creating multiple levels of abstraction from detailed component descriptions to high-level system overviews. A specialized navigation agent traverses this hierarchy intelligently, assembling coherent, task-relevant context that tells a complete story from system architecture down to specific implementation details.
This approach transforms the traditional "brilliant developer with daily amnesia" problem into a system with persistent, navigable memory that can provide both big-picture understanding and precise technical details as needed.
Requirements
Core Functionality
Documentation Structure
Agent Architecture
Cross-Cutting Component Support
Maintenance Characteristics
Loose Specification
Documentation Format
Each document contains:
Navigation Algorithm
Initial Generation Process
Update Process
Tag Categories
Implementation Considerations
What is the Louvain Algorithm?
The Louvain algorithm uses a hierarchical approach to detect communities. Initially, each node starts as its own community. The algorithm then iteratively merges nodes and communities to maximize modularity. Once no further improvements can be made, it aggregates the nodes in each community into a single node and repeats the process. This hierarchical method allows the Louvain algorithm to efficiently handle large networks, making it a popular choice for community detection.
More here: https://hypermode.com/blog/community-detection-algorithms
A first draft for a potential codebase decomposition strategy:
Step 1: Multi-View Embeddings Per File
For each file/module, generate multiple vector embeddings, such as:
utils
,validators
, and usesFormSchema
."validateInput
,logEvent
..."These are not embeddings of code, but embeddings of LLM-generated natural language descriptions of code roles and relationships — greatly compressing complexity while maintaining semantic expressiveness.
Step 2: Similarity & Clustering
Use two axes of similarity:
Semantic clustering: group files with similar description vectors (e.g., all related to "form handling" or "user auth")
Dependency proximity: files whose dependency vectors resemble a given file’s description vector, suggesting potential coupling
Then use graph-based clustering like the Louvain algorithm to discover communities/modules:
Nodes = files/modules
Edges = similarity (semantic + dependency overlap)
Weights = composite score from both vectors
This allows discovery of clusters that are:
Semantically coherent (they serve a related purpose)
Structurally coupled (they depend on similar or overlapping modules)
Step 3: Hierarchical Summary Generation
Once clusters are formed:
Use an LLM to summarize each cluster into a middle-layer document
Cluster summaries to create higher-level abstractions
Add cross-cutting modules as tagged special clusters
This naturally produces a documentation pyramid that mirrors logical rather than folder structure.
Potential Advantages
Semantic Compression: Using LLM-generated natural language descriptions instead of raw code embeddings captures conceptual relationships while reducing computational overhead and noise from irrelevant implementation details.
Multi-Dimensional Analysis: The four vector types provide different perspectives on module relationships - functional purpose, structural dependencies, architectural role, and runtime behavior - enabling more comprehensive similarity assessment.
Language Agnostic: Once descriptions are generated, the clustering approach works uniformly across programming languages without language-specific parsing or analysis.
Interpretability: Clustering decisions become traceable since they're based on readable descriptions rather than high-dimensional code representations.
Logical Organization: The approach should produce groupings that reflect actual system relationships rather than filesystem organization, potentially improving the utility of the resulting documentation hierarchy.
Dependency vector concept
The dependency vector representation enables detection of modules with similar roles in the system architecture without requiring direct interaction between those modules. (Works across layers.)
The dependency vector approach can represent "what this module depends on" as embeddings, allows detection of modules that serve similar roles in the dependency ecosystem, even if they don't directly interact.
The structural role vector could help identify architectural patterns automatically - grouping controllers, services, utilities, etc. without explicit categorization rules.
This approach should produce hierarchical documentation that reflects actual system logic rather than accidental organizational structures, making it far more valuable for navigation and understanding.
Minimal Viable Update Approach
Change Detection: Track which files were modified/added/deleted since last documentation generation.
Affected Document Identification:
Localized Re-clustering:
Bottom-Up Regeneration:
Acceptable Tradeoffs (hopefully)
Some Redundant Work: Entire clusters get re-processed even if only one file changed, but this ensures consistency and is computationally manageable for reasonably-sized clusters.
Conservative Approach: When in doubt, re-cluster and regenerate rather than trying to surgically update. This reduces complexity while maintaining correctness.
This approach should scale reasonably well, since most changes affect only a few clusters, and the computational cost is proportional to change scope, rather than to the total codebase size.
(You could imagine something far more complex and probably more optimal, accounting for every type of change, special handling after additions, removals, feature changes, bug fixes, refactoring tasks, etc. - what is probably wanted for a first version of this is something that doesn't rely on 4 or 5 different workflows for different types of changes, as this would require building and evaluating way too many component functions. Get something that works before thinking about optimizations/improvements. If the documentation were to erode over time in a big system, or after a very big change, you could of course always just rebuild your documentation then.)
That's all I've got 💁♂️
There is clearly more design to be done before you could jump in and build something like this.
But I think it's a pretty cool idea? There are products claiming to work for "large codebases", but they're all proprietary, as far as I know? I haven't seen any of these products really explaining what it is they do or why it works. There are of course simpler open-source tools that claim to do something with RAG search and documentation, etc. - I just haven't seen anything that sounds like it's really going to work reliably in a large codebase, so my brain kept racking at it.
I just don't have the time or energy myself, so I hope maybe someone else will pick up the idea! 😄
Beta Was this translation helpful? Give feedback.
All reactions