A minimal Retrieval Augmented Generation (RAG) system for answering questions about the Apache Commons Text Java library, with architecture analysis capabilities.
This repository contains the Apache Commons Text library source code, licensed under the Apache License 2.0.
The rag/ directory contains original code for RAG-based code analysis, developed as part of this code analysis project.
This RAG pipeline:
- Ingests Java source code, tests, and documentation from the repository
- Chunks documents using a file-aware strategy optimized for code
- Embeds chunks using sentence-transformers (local, no API key needed)
- Retrieves relevant context using cosine similarity search
- Generates answers using an LLM with grounded prompts
- Analyzes architecture issues and suggests improvements
rag/
├── main.py # CLI entry point
├── config.py # Configuration settings
├── ingest.py # Document ingestion, extraction, chunking
├── vector_store.py # Embedding and vector search
├── langchain_integration.py # LangChain compatibility (optional)
├── architecture_agent.py # Architecture analysis facade
├── requirements.txt # Python dependencies
├── pipeline/ # RAG pipeline implementations
│ ├── __init__.py # Base classes and interfaces
│ ├── native_pipeline.py # Native RAG with OpenAI/Ollama
│ └── langchain_pipeline.py # LangChain-based RAG
├── analysis/ # Architecture analysis modules
│ ├── models.py # Data classes (DependencyInfo, etc.)
│ ├── dependency_analyzer.py # Static dependency analysis
│ ├── issue_detector.py # Architecture issue detection
│ └── refactoring_advisor.py # Refactoring recommendations
├── utils/ # Shared utilities
│ ├── lazy_imports.py # Lazy loading utilities
│ ├── logging_config.py # Consistent logging setup
│ └── document_formatter.py # Document formatting for RAG
├── tests/ # Test suite
│ └── test_rag_queries.py # RAG query tests
└── index/ # Saved embeddings and documents
# 1. Setup (first-time)
cd rag
pip install -r requirements.txt
ollama serve & # Start local LLM (in background)
ollama pull tinyllama # Download model (default, fast)
python main.py --build # Build vector index
# 2. Query the codebase
python main.py --query "How does StringSubstitutor work?"
# 3. Run architecture analysis
python main.py --analyzeThis section documents the key architectural choices made in building this RAG system.
We use a file-level chunking strategy with class/method awareness:
| Strategy | Why We Chose It |
|---|---|
| File-level primary | Java files = natural semantic units (one class per file) |
| Preserve imports | Essential for understanding dependencies |
| Keep Javadocs | Documentation stays with code |
| Split large files | Maintain embedding model limits while preserving context |
For files under 8000 chars (~2000 tokens): kept as single chunk For larger files: split at method boundaries with class header preserved
We use hybrid search combining semantic and lexical matching:
| Component | Approach | Rationale |
|---|---|---|
| Base retrieval | Cosine similarity on embeddings | Captures semantic meaning |
| Keyword boosting | Exact class/file name matches | Important for code (CamelCase names) |
| Source prioritization | src/main/ weighted higher than changelogs |
Prefer actual implementation |
| Folder detection | Boosts package-info.java for folder queries |
Package docs are most informative |
Top-k (default 5) retrieval with diversity: for folder queries, we limit chunks from the same file to ensure coverage across the package.
The prompts are designed to ground answers in retrieved context and minimize hallucination:
| Technique | Implementation |
|---|---|
| Context-first prompting | Retrieved code snippets placed before the question |
| Query type detection | Different prompts for class queries vs. folder overview |
| Source attribution | LLM instructed to reference specific files |
| Uncertainty handling | Explicit instruction to say "I don't know" if context insufficient |
| Post-hoc verification | Check if mentioned classes exist in retrieved sources |
Prompt structure:
System: You are a Java expert answering about Apache Commons Text.
Base your answer ONLY on the provided code snippets.
If the context doesn't contain the answer, say so.
Context: [Retrieved code chunks with file paths]
Question: {user_query}
Static analysis using AST-free import parsing with percentile-based thresholds:
| Aspect | Approach | Rationale |
|---|---|---|
| Parsing | Regex on import statements | Fast, sufficient for Java; no external parser needed |
| Graph building | Build adjacency list of class→dependencies | Enables cycle detection, coupling metrics |
| Issue detection | Percentile-based thresholds | Adapts to codebase size (see Percentile-Based Thresholds) |
| Recommendations | Severity-sorted with concrete examples | Critical issues first, actionable suggestions |
We chose regex over full AST parsing because:
- Java imports are line-based and predictable
- No external dependencies (no need for javalang, ANTLR)
- Fast enough for interactive use
- Python 3.9+ (uses modern type hints)
- Ollama (recommended) or OpenAI API key for LLM inference
- ~2GB disk space for embeddings model (downloaded on first run)
cd rag
pip install -r requirements.txtpython main.py --buildThis creates embeddings for all source files and saves them to rag/index/.
Single query:
python main.py --query "How does StringSubstitutor handle variable substitution?"Interactive mode:
python main.py --interactiveRun architecture analysis to detect issues and get recommendations:
python main.py --analyzeThis will:
- Analyze dependencies using static analysis (imports, package references)
- Detect architecture issues (dependency magnets, cycles, oversized modules)
- Generate concrete refactoring recommendations with rationale
from main import create_rag_pipeline
from architecture_agent import analyze_architecture
# RAG Q&A
pipeline = create_rag_pipeline()
result = pipeline.query("What text similarity algorithms are available?")
print(result.answer) # Note: result is now a RAGResult dataclass
print(result.sources)
# Architecture analysis
analysis = analyze_architecture()
for issue in analysis["issues"]:
print(f"[{issue.severity}] {issue.title}")cd rag
python -m pytest tests/ -vThe test suite includes:
- Query tests for specific classes (e.g., StringSubstitutor)
- Folder/package query tests
- General repository questions
- Tests for queries that should fail (irretrievable info)
The pipeline supports both local LLMs (Ollama) and OpenAI.
No API key required! Uses Ollama for local inference.
1. Install Ollama:
# macOS
brew install ollama
# Or download from https://ollama.ai2. Start the Ollama server:
ollama serve3. Pull a model:
ollama pull llama3.2. # or use tinyllama4. Run queries:
python main.py --query "How does StringSubstitutor work?"Recommended models:
| Model | Size | Description |
|---|---|---|
tinyllama |
600MB | Very fast, basic quality (default) |
llama3.2 |
2GB | Good balance of speed and quality |
codellama |
4GB | Optimized for code understanding |
mistral |
4GB | Fast and capable |
phi3 |
2GB | Smaller, runs on limited hardware |
Note: tinyllama is the default for fast responses, but larger models like llama3.2 or codellama give significantly better answers, especially for package overview queries.
Change the model:
export OLLAMA_MODEL=llama3.2
python main.py --query "..."Set your API key to use OpenAI instead:
export OPENAI_API_KEY="your-key-here"When OPENAI_API_KEY is set, the pipeline automatically uses OpenAI.
| Variable | Description | Default |
|---|---|---|
OLLAMA_MODEL |
Ollama model to use | tinyllama |
OLLAMA_URL |
Ollama API URL | http://localhost:11434/v1 |
OPENAI_API_KEY |
OpenAI API key (uses OpenAI if set) | (none) |
OPENAI_MODEL |
OpenAI model to use | gpt-4o-mini |
The sources shown after each answer include relevance scores and line numbers:
- src/main/java/.../StringSubstitutor.java (L45-120) [relevance: 0.65]
- src/test/java/.../StringSubstitutorTest.java (L1-50) [relevance: 0.52]
| Score | Meaning |
|---|---|
| > 0.6 | Highly relevant - confident match |
| 0.4 - 0.6 | Moderately relevant - likely useful |
| 0.3 - 0.4 | Low relevance - may not be helpful |
| < 0.3 | Very low - probably not what you need |
Note: Scores are cosine similarity enhanced with keyword matching. Scores above 1.0 are possible due to additive boosts.
The search uses a combination of:
- Semantic similarity - embedding-based cosine similarity
- Keyword matching - boosts for exact class/file name matches
- Source prioritization - prefers actual code over changelogs
Automatic boosts applied:
| Match Type | Boost |
|---|---|
Exact class file match (e.g., "StringSubstitutor" → StringSubstitutor.java) |
+0.5 |
Folder/package match (e.g., "lookup folder" → files in /lookup/) |
+0.4 |
package-info.java for folder queries |
+0.8 |
Main source code (src/main/java) |
×1.3 |
| Test files | ×1.1 |
| Changelogs and release notes | ×0.5 (penalty) |
The RAG pipeline automatically detects and optimizes for different query types:
Ask about specific classes, methods, or functionality:
python main.py --query "How does StringSubstitutor work?"
python main.py --query "What methods does TextStringBuilder have?"The search prioritizes exact file matches and provides detailed code context.
Ask about entire packages or folders to get an overview:
python main.py --query "What functionality is in the lookup folder?"
python main.py --query "What does the translate package contain?"
python main.py --query "Show me what's in the similarity directory"Keywords like "folder", "package", "directory", or "module" trigger package-overview mode:
- Prioritizes
package-info.java(package documentation) - Returns diverse files from the folder (not just chunks of one file)
- LLM prompt asks for package overview, not class deep-dive
The pipeline includes basic hallucination detection to warn when the LLM may have invented details:
⚠️ **Warning: Possible inaccuracies detected**
The following references could not be verified in the source code:
- Class 'FakeClass' not found in codebase
This checks whether class names mentioned in the answer actually exist in the retrieved sources.
The pipeline includes optional LangChain integration for advanced features like chains, memory, and agents.
pip install langchain langchain-core langchain-community langchain-ollamaOr uncomment the LangChain lines in requirements.txt.
The HybridCodeRetriever wraps our hybrid search logic in a LangChain-compatible interface:
from vector_store import SimpleVectorStore
from langchain_integration import get_langchain_retriever
# Load vector store
vs = SimpleVectorStore()
vs.load_index()
# Create LangChain retriever (preserves all custom boosting logic)
retriever = get_langchain_retriever(vs, top_k=5)
# Use with any LangChain chain
docs = retriever.invoke("How does StringSubstitutor work?")
for doc in docs:
print(f"{doc.metadata['path']}: {doc.metadata['relevance_score']:.2f}")Create a complete RAG chain with one function:
from vector_store import SimpleVectorStore
from langchain_integration import create_rag_chain
vs = SimpleVectorStore()
vs.load_index()
# Create chain (uses Ollama by default)
chain = create_rag_chain(vs, model="llama3.2")
# Query
answer = chain.invoke("What string similarity algorithms are available?")
print(answer)For interactive sessions with follow-up questions:
from langchain_integration import create_conversational_chain
vs = SimpleVectorStore()
vs.load_index()
chain, memory = create_conversational_chain(vs, model="llama3.2")
# First question
result1 = chain.invoke(
{"input": "What is StringSubstitutor?"},
config={"configurable": {"session_id": "user1"}}
)
# Follow-up (remembers context)
result2 = chain.invoke(
{"input": "How do I use variable prefixes with it?"},
config={"configurable": {"session_id": "user1"}}
)For simple usage:
from langchain_integration import quick_query
answer = quick_query("What escape utilities are available?", model="tinyllama")| Feature | Without LangChain | With LangChain |
|---|---|---|
| Basic RAG | ✅ Built-in | ✅ create_rag_chain() |
| Conversation memory | ❌ | ✅ create_conversational_chain() |
| Agent tools | ❌ | ✅ Use retriever with agents |
| Streaming responses | ❌ | ✅ Built-in |
| Model switching | Manual | Automatic via LangChain |
The custom retriever preserves all our hybrid search features (keyword boosting, folder detection, source prioritization) while making the pipeline compatible with the LangChain ecosystem.
For systems with limited RAM, use safe mode:
python main.py --build --safeThis uses:
- Sequential processing (1 worker)
- Small batch sizes for embeddings
- Lower memory footprint
The architecture agent performs static analysis and issue detection using percentile-based thresholds that adapt to your codebase's distribution.
- Extracts imports and package dependencies
- Builds dependency graph between classes
- Computes metrics (LOC, method count, coupling)
- Calculates distribution statistics (min, median, mean, max)
Instead of hardcoded thresholds, issues are flagged based on where classes fall in the distribution:
| Metric | Percentile | Meaning |
|---|---|---|
| Lines of Code | 90th | Top 10% largest classes flagged |
| Method Count | 90th | Top 10% by number of methods |
| Dependencies | 85th | Top 15% by outgoing dependencies |
| Dependents | 85th | Top 15% by incoming dependencies |
Why percentile-based? A class with 500 LOC might be "oversized" in a microservices codebase but perfectly normal in a utility library. Percentile thresholds automatically adapt.
To adjust sensitivity, edit analysis/issue_detector.py:
LOC_PERCENTILE = 90 # Flag top 10% (lower = more strict)
METHOD_PERCENTILE = 90 # Flag top 10%
DEPENDENCY_PERCENTILE = 85 # Flag top 15%
DEPENDENT_PERCENTILE = 85 # Flag top 15%When you run analysis, you'll see codebase statistics:
CODEBASE DISTRIBUTION STATISTICS:
loc: min=22, median=110, mean=284.9, max=3200
→ threshold (p90): 711.7
methods: min=0, median=2, mean=9.7, max=179
→ threshold (p90): 20.7
dependencies: min=0, median=2, mean=3.4, max=17
→ threshold (p85): 6.5
This helps you understand what "normal" looks like in your codebase and why certain classes were flagged.
| Issue Type | Description |
|---|---|
| Dependency Magnet | Classes imported by many others (top percentile), hard to change |
| Cyclic Dependency | Circular dependencies between classes |
| Oversized Module | Classes with too many lines/methods (top percentile) |
| God Class | Classes with too many dependencies (top percentile) |
| Unclear Separation | High coupling between packages |
Each issue includes detailed evidence with both the value and codebase context:
evidence = {
"lines_of_code": 1582,
"method_count": 56,
"loc_percentile_threshold": 711.7,
"codebase_median_loc": 110,
"codebase_max_loc": 3200
}Recommendations are organized by priority:
These appear first and should be addressed to resolve major architecture issues:
- Cyclic dependencies - Break circular imports between classes
- Decompose oversized classes - Split large classes (top 10% by LOC/methods)
- Reduce high coupling - Address classes with too many dependencies
After critical issues, you'll see a separator:
✅ Critical issues addressed above.
The following are GOOD PRACTICES for further improvement:
These are optional improvements for better code quality:
- Extract interfaces for dependency magnets
- Clarify package boundaries for cross-package coupling
Each recommendation includes:
- Description: What to do
- Rationale: Why it helps (grounded in evidence)
- Quality Impact: Effect on maintainability, testability, evolvability
- Effort: Low/Medium/High
- Concrete Examples: Specific files and suggested names
Every answer includes:
- Answer: LLM-generated response grounded in retrieved context
- Sources: List of files used with relevance scores
- Uncertainty flag: Warns when retrieved context has low relevance
The pipeline explicitly handles uncertainty:
- If retrieval scores are below 0.3, it returns a clear "insufficient information" message
- LLM is instructed to admit when context doesn't contain the answer
- Sources and relevance scores are always shown for transparency
- "How do I use StringSubstitutor to replace variables in a string?"
- "What string similarity metrics are available?"
- "How does the diff algorithm work?"
- "What escape utilities does StringEscapeUtils provide?"
- "How do I generate random strings?"
- "What functionality is in the lookup folder?"
- "What does the translate package do?"
- "Show me the similarity classes"
- "What's in the matcher package?"
If you're getting low relevance scores (< 0.3), try:
- Use exact class names (CamelCase): "StringSubstitutor" not "string substitutor"
- For packages, use keywords like "folder" or "package": "lookup folder"
- Be specific: "StringSubstitutor variable replacement" instead of "how to replace text"
Error connecting to Ollama
Make sure Ollama is running:
ollama serveUse safe mode:
python main.py --build --safeIf you see "Possible inaccuracies detected", the LLM mentioned classes or concepts not found in the retrieved sources. Verify the answer against the listed source files.
This project was developed with the assistance of AI tools:
- ChatGPT: Used for initial drafting and organizing the steps to solve this task
- GitHub Copilot: Used for coding assistance including:
- Initial scaffolding and basic structure
- Refactoring and debugging help
- LangChain integration
- Type hinting coverage, method signatures, code commenting, and README creation
- Microsoft Copilot: Used for grammar/spell checking and writing suggestions in the final report
All core functionality (excluding documentation, logging, licensing, and type checking) is original work stemming from my own intentions for the project. Initial drafts were refined and debugged with AI assistance and suggestions.