Name	Name	Last commit message	Last commit date
parent directory ..
analysis	analysis
pipeline	pipeline
tests	tests
utils	utils
README.md	README.md
__init__.py	__init__.py
architecture_agent.py	architecture_agent.py
config.py	config.py
ingest.py	ingest.py
langchain_integration.py	langchain_integration.py
main.py	main.py
requirements.txt	requirements.txt
vector_store.py	vector_store.py

RAG Pipeline for Apache Commons Text

A minimal Retrieval Augmented Generation (RAG) system for answering questions about the Apache Commons Text Java library, with architecture analysis capabilities.

Attribution

This repository contains the Apache Commons Text library source code, licensed under the Apache License 2.0.

The rag/ directory contains original code for RAG-based code analysis, developed as part of this code analysis project.

Overview

This RAG pipeline:

Ingests Java source code, tests, and documentation from the repository
Chunks documents using a file-aware strategy optimized for code
Embeds chunks using sentence-transformers (local, no API key needed)
Retrieves relevant context using cosine similarity search
Generates answers using an LLM with grounded prompts
Analyzes architecture issues and suggests improvements

Project Structure

rag/
├── main.py                    # CLI entry point
├── config.py                  # Configuration settings  
├── ingest.py                  # Document ingestion, extraction, chunking
├── vector_store.py            # Embedding and vector search
├── langchain_integration.py   # LangChain compatibility (optional)
├── architecture_agent.py      # Architecture analysis facade
├── requirements.txt           # Python dependencies
├── pipeline/                  # RAG pipeline implementations
│   ├── __init__.py           # Base classes and interfaces
│   ├── native_pipeline.py    # Native RAG with OpenAI/Ollama
│   └── langchain_pipeline.py # LangChain-based RAG
├── analysis/                  # Architecture analysis modules
│   ├── models.py             # Data classes (DependencyInfo, etc.)
│   ├── dependency_analyzer.py # Static dependency analysis
│   ├── issue_detector.py     # Architecture issue detection
│   └── refactoring_advisor.py # Refactoring recommendations
├── utils/                     # Shared utilities
│   ├── lazy_imports.py       # Lazy loading utilities
│   ├── logging_config.py     # Consistent logging setup
│   └── document_formatter.py # Document formatting for RAG
├── tests/                     # Test suite
│   └── test_rag_queries.py   # RAG query tests
└── index/                     # Saved embeddings and documents

TL;DR - Quick Start

# 1. Setup (first-time)
cd rag
pip install -r requirements.txt
ollama serve &              # Start local LLM (in background)
ollama pull tinyllama       # Download model (default, fast)
python main.py --build      # Build vector index

# 2. Query the codebase
python main.py --query "How does StringSubstitutor work?"

# 3. Run architecture analysis
python main.py --analyze

Design Decisions

This section documents the key architectural choices made in building this RAG system.

Chunking Strategy

We use a file-level chunking strategy with class/method awareness:

Strategy	Why We Chose It
File-level primary	Java files = natural semantic units (one class per file)
Preserve imports	Essential for understanding dependencies
Keep Javadocs	Documentation stays with code
Split large files	Maintain embedding model limits while preserving context

For files under 8000 chars (~2000 tokens): kept as single chunk For larger files: split at method boundaries with class header preserved

Retrieval Strategy

We use hybrid search combining semantic and lexical matching:

Component	Approach	Rationale
Base retrieval	Cosine similarity on embeddings	Captures semantic meaning
Keyword boosting	Exact class/file name matches	Important for code (CamelCase names)
Source prioritization	`src/main/` weighted higher than changelogs	Prefer actual implementation
Folder detection	Boosts `package-info.java` for folder queries	Package docs are most informative

Top-k (default 5) retrieval with diversity: for folder queries, we limit chunks from the same file to ensure coverage across the package.

Prompting Strategy

The prompts are designed to ground answers in retrieved context and minimize hallucination:

Technique	Implementation
Context-first prompting	Retrieved code snippets placed before the question
Query type detection	Different prompts for class queries vs. folder overview
Source attribution	LLM instructed to reference specific files
Uncertainty handling	Explicit instruction to say "I don't know" if context insufficient
Post-hoc verification	Check if mentioned classes exist in retrieved sources

Prompt structure:

System: You are a Java expert answering about Apache Commons Text.
        Base your answer ONLY on the provided code snippets.
        If the context doesn't contain the answer, say so.
        
Context: [Retrieved code chunks with file paths]

Question: {user_query}

Dependency Analysis Approach

Static analysis using AST-free import parsing with percentile-based thresholds:

Aspect	Approach	Rationale
Parsing	Regex on import statements	Fast, sufficient for Java; no external parser needed
Graph building	Build adjacency list of class→dependencies	Enables cycle detection, coupling metrics
Issue detection	Percentile-based thresholds	Adapts to codebase size (see Percentile-Based Thresholds)
Recommendations	Severity-sorted with concrete examples	Critical issues first, actionable suggestions

We chose regex over full AST parsing because:

Java imports are line-based and predictable
No external dependencies (no need for javalang, ANTLR)
Fast enough for interactive use

Installation

Requirements

Python 3.9+ (uses modern type hints)
Ollama (recommended) or OpenAI API key for LLM inference
~2GB disk space for embeddings model (downloaded on first run)

Setup

cd rag
pip install -r requirements.txt

Usage

1. Build the Index (First Time)

python main.py --build

This creates embeddings for all source files and saves them to rag/index/.

2. Ask Questions

Single query:

python main.py --query "How does StringSubstitutor handle variable substitution?"

Interactive mode:

python main.py --interactive

3. Architecture Analysis

Run architecture analysis to detect issues and get recommendations:

python main.py --analyze

This will:

Analyze dependencies using static analysis (imports, package references)
Detect architecture issues (dependency magnets, cycles, oversized modules)
Generate concrete refactoring recommendations with rationale

4. Programmatic Usage

from main import create_rag_pipeline
from architecture_agent import analyze_architecture

# RAG Q&A
pipeline = create_rag_pipeline()
result = pipeline.query("What text similarity algorithms are available?")
print(result.answer)  # Note: result is now a RAGResult dataclass
print(result.sources)

# Architecture analysis
analysis = analyze_architecture()
for issue in analysis["issues"]:
    print(f"[{issue.severity}] {issue.title}")

5. Running Tests

cd rag
python -m pytest tests/ -v

The test suite includes:

Query tests for specific classes (e.g., StringSubstitutor)
Folder/package query tests
General repository questions
Tests for queries that should fail (irretrievable info)

Configuration

LLM Backend

The pipeline supports both local LLMs (Ollama) and OpenAI.

Option 1: Local LLM with Ollama (Recommended)

No API key required! Uses Ollama for local inference.

1. Install Ollama:

# macOS
brew install ollama

# Or download from https://ollama.ai

2. Start the Ollama server:

ollama serve

3. Pull a model:

ollama pull llama3.2. # or use tinyllama

4. Run queries:

python main.py --query "How does StringSubstitutor work?"

Recommended models:

Model	Size	Description
`tinyllama`	600MB	Very fast, basic quality (default)
`llama3.2`	2GB	Good balance of speed and quality
`codellama`	4GB	Optimized for code understanding
`mistral`	4GB	Fast and capable
`phi3`	2GB	Smaller, runs on limited hardware

Note: tinyllama is the default for fast responses, but larger models like llama3.2 or codellama give significantly better answers, especially for package overview queries.

Change the model:

export OLLAMA_MODEL=llama3.2
python main.py --query "..."

Option 2: OpenAI

Set your API key to use OpenAI instead:

export OPENAI_API_KEY="your-key-here"

When OPENAI_API_KEY is set, the pipeline automatically uses OpenAI.

Environment Variables

Variable	Description	Default
`OLLAMA_MODEL`	Ollama model to use	`tinyllama`
`OLLAMA_URL`	Ollama API URL	`http://localhost:11434/v1`
`OPENAI_API_KEY`	OpenAI API key (uses OpenAI if set)	(none)
`OPENAI_MODEL`	OpenAI model to use	`gpt-4o-mini`

Understanding Relevance Scores

The sources shown after each answer include relevance scores and line numbers:

- src/main/java/.../StringSubstitutor.java (L45-120) [relevance: 0.65]
- src/test/java/.../StringSubstitutorTest.java (L1-50) [relevance: 0.52]

How to interpret scores:

Score	Meaning
> 0.6	Highly relevant - confident match
0.4 - 0.6	Moderately relevant - likely useful
0.3 - 0.4	Low relevance - may not be helpful
< 0.3	Very low - probably not what you need

Note: Scores are cosine similarity enhanced with keyword matching. Scores above 1.0 are possible due to additive boosts.

Hybrid Search

The search uses a combination of:

Semantic similarity - embedding-based cosine similarity
Keyword matching - boosts for exact class/file name matches
Source prioritization - prefers actual code over changelogs

Automatic boosts applied:

Match Type	Boost
Exact class file match (e.g., "StringSubstitutor" → `StringSubstitutor.java`)	+0.5
Folder/package match (e.g., "lookup folder" → files in `/lookup/`)	+0.4
`package-info.java` for folder queries	+0.8
Main source code (`src/main/java`)	×1.3
Test files	×1.1
Changelogs and release notes	×0.5 (penalty)

Query Types

The RAG pipeline automatically detects and optimizes for different query types:

Class/Code Queries (Default)

Ask about specific classes, methods, or functionality:

python main.py --query "How does StringSubstitutor work?"
python main.py --query "What methods does TextStringBuilder have?"

The search prioritizes exact file matches and provides detailed code context.

Folder/Package Queries

Ask about entire packages or folders to get an overview:

python main.py --query "What functionality is in the lookup folder?"
python main.py --query "What does the translate package contain?"
python main.py --query "Show me what's in the similarity directory"

Keywords like "folder", "package", "directory", or "module" trigger package-overview mode:

Prioritizes package-info.java (package documentation)
Returns diverse files from the folder (not just chunks of one file)
LLM prompt asks for package overview, not class deep-dive

Hallucination Detection

The pipeline includes basic hallucination detection to warn when the LLM may have invented details:

⚠️ **Warning: Possible inaccuracies detected**
The following references could not be verified in the source code:
  - Class 'FakeClass' not found in codebase

This checks whether class names mentioned in the answer actually exist in the retrieved sources.

LangChain Integration (Optional)

The pipeline includes optional LangChain integration for advanced features like chains, memory, and agents.

Installation

pip install langchain langchain-core langchain-community langchain-ollama

Or uncomment the LangChain lines in requirements.txt.

Using the Custom Retriever

The HybridCodeRetriever wraps our hybrid search logic in a LangChain-compatible interface:

from vector_store import SimpleVectorStore
from langchain_integration import get_langchain_retriever

# Load vector store
vs = SimpleVectorStore()
vs.load_index()

# Create LangChain retriever (preserves all custom boosting logic)
retriever = get_langchain_retriever(vs, top_k=5)

# Use with any LangChain chain
docs = retriever.invoke("How does StringSubstitutor work?")
for doc in docs:
    print(f"{doc.metadata['path']}: {doc.metadata['relevance_score']:.2f}")

Using RAG Chains

Create a complete RAG chain with one function:

from vector_store import SimpleVectorStore
from langchain_integration import create_rag_chain

vs = SimpleVectorStore()
vs.load_index()

# Create chain (uses Ollama by default)
chain = create_rag_chain(vs, model="llama3.2")

# Query
answer = chain.invoke("What string similarity algorithms are available?")
print(answer)

Conversational Memory

For interactive sessions with follow-up questions:

from langchain_integration import create_conversational_chain

vs = SimpleVectorStore()
vs.load_index()

chain, memory = create_conversational_chain(vs, model="llama3.2")

# First question
result1 = chain.invoke(
    {"input": "What is StringSubstitutor?"},
    config={"configurable": {"session_id": "user1"}}
)

# Follow-up (remembers context)
result2 = chain.invoke(
    {"input": "How do I use variable prefixes with it?"},
    config={"configurable": {"session_id": "user1"}}
)

Quick One-Shot Query

For simple usage:

from langchain_integration import quick_query

answer = quick_query("What escape utilities are available?", model="tinyllama")

Why Use LangChain?

Feature	Without LangChain	With LangChain
Basic RAG	✅ Built-in	✅ `create_rag_chain()`
Conversation memory	❌	✅ `create_conversational_chain()`
Agent tools	❌	✅ Use retriever with agents
Streaming responses	❌	✅ Built-in
Model switching	Manual	Automatic via LangChain

The custom retriever preserves all our hybrid search features (keyword boosting, folder detection, source prioritization) while making the pipeline compatible with the LangChain ecosystem.

Safe Mode

For systems with limited RAM, use safe mode:

python main.py --build --safe

This uses:

Sequential processing (1 worker)
Small batch sizes for embeddings
Lower memory footprint

Architecture Analysis

The architecture agent performs static analysis and issue detection using percentile-based thresholds that adapt to your codebase's distribution.

Static Analysis

Extracts imports and package dependencies
Builds dependency graph between classes
Computes metrics (LOC, method count, coupling)
Calculates distribution statistics (min, median, mean, max)

Percentile-Based Thresholds

Instead of hardcoded thresholds, issues are flagged based on where classes fall in the distribution:

Metric	Percentile	Meaning
Lines of Code	90th	Top 10% largest classes flagged
Method Count	90th	Top 10% by number of methods
Dependencies	85th	Top 15% by outgoing dependencies
Dependents	85th	Top 15% by incoming dependencies

Why percentile-based? A class with 500 LOC might be "oversized" in a microservices codebase but perfectly normal in a utility library. Percentile thresholds automatically adapt.

To adjust sensitivity, edit analysis/issue_detector.py:

LOC_PERCENTILE = 90        # Flag top 10% (lower = more strict)
METHOD_PERCENTILE = 90     # Flag top 10%
DEPENDENCY_PERCENTILE = 85 # Flag top 15%
DEPENDENT_PERCENTILE = 85  # Flag top 15%

Viewing Distribution Statistics

When you run analysis, you'll see codebase statistics:

CODEBASE DISTRIBUTION STATISTICS:
  loc: min=22, median=110, mean=284.9, max=3200
    → threshold (p90): 711.7
  methods: min=0, median=2, mean=9.7, max=179
    → threshold (p90): 20.7
  dependencies: min=0, median=2, mean=3.4, max=17
    → threshold (p85): 6.5

This helps you understand what "normal" looks like in your codebase and why certain classes were flagged.

Issue Detection

Issue Type	Description
Dependency Magnet	Classes imported by many others (top percentile), hard to change
Cyclic Dependency	Circular dependencies between classes
Oversized Module	Classes with too many lines/methods (top percentile)
God Class	Classes with too many dependencies (top percentile)
Unclear Separation	High coupling between packages

Evidence in Issues

Each issue includes detailed evidence with both the value and codebase context:

evidence = {
    "lines_of_code": 1582,
    "method_count": 56,
    "loc_percentile_threshold": 711.7,
    "codebase_median_loc": 110,
    "codebase_max_loc": 3200
}

Recommendations

Recommendations are organized by priority:

🔴 CRITICAL (High Severity)

These appear first and should be addressed to resolve major architecture issues:

Cyclic dependencies - Break circular imports between classes
Decompose oversized classes - Split large classes (top 10% by LOC/methods)
Reduce high coupling - Address classes with too many dependencies

✅ GOOD PRACTICES (Medium/Low Severity)

After critical issues, you'll see a separator:

✅ Critical issues addressed above.
   The following are GOOD PRACTICES for further improvement:

These are optional improvements for better code quality:

Extract interfaces for dependency magnets
Clarify package boundaries for cross-package coupling

Recommendation Details

Each recommendation includes:

Description: What to do
Rationale: Why it helps (grounded in evidence)
Quality Impact: Effect on maintainability, testability, evolvability
Effort: Low/Medium/High
Concrete Examples: Specific files and suggested names

Output Format

Every answer includes:

Answer: LLM-generated response grounded in retrieved context
Sources: List of files used with relevance scores
Uncertainty flag: Warns when retrieved context has low relevance

Uncertainty Handling

The pipeline explicitly handles uncertainty:

If retrieval scores are below 0.3, it returns a clear "insufficient information" message
LLM is instructed to admit when context doesn't contain the answer
Sources and relevance scores are always shown for transparency

Example Queries

Class/Code Questions

"How do I use StringSubstitutor to replace variables in a string?"
"What string similarity metrics are available?"
"How does the diff algorithm work?"
"What escape utilities does StringEscapeUtils provide?"
"How do I generate random strings?"

Package/Folder Overview

"What functionality is in the lookup folder?"
"What does the translate package do?"
"Show me the similarity classes"
"What's in the matcher package?"

Troubleshooting

Low relevance scores

If you're getting low relevance scores (< 0.3), try:

Use exact class names (CamelCase): "StringSubstitutor" not "string substitutor"
For packages, use keywords like "folder" or "package": "lookup folder"
Be specific: "StringSubstitutor variable replacement" instead of "how to replace text"

Ollama connection errors

Error connecting to Ollama

Make sure Ollama is running:

ollama serve

Out of memory during build

Use safe mode:

python main.py --build --safe

Hallucination warnings

If you see "Possible inaccuracies detected", the LLM mentioned classes or concepts not found in the retrieved sources. Verify the answer against the listed source files.

AI Usage Statement

This project was developed with the assistance of AI tools:

ChatGPT: Used for initial drafting and organizing the steps to solve this task
GitHub Copilot: Used for coding assistance including:
- Initial scaffolding and basic structure
- Refactoring and debugging help
- LangChain integration
- Type hinting coverage, method signatures, code commenting, and README creation
Microsoft Copilot: Used for grammar/spell checking and writing suggestions in the final report

All core functionality (excluding documentation, logging, licensing, and type checking) is original work stemming from my own intentions for the project. Initial drafts were refined and debugged with AI assistance and suggestions.

FilesExpand file tree

rag

Directory actions

More options

Directory actions

More options

Latest commit

History

rag

Folders and files

parent directory

README.md

RAG Pipeline for Apache Commons Text

Attribution

Overview

Project Structure

TL;DR - Quick Start

Design Decisions

Chunking Strategy

Retrieval Strategy

Prompting Strategy

Dependency Analysis Approach

Installation

Requirements

Setup

Usage

1. Build the Index (First Time)

2. Ask Questions

3. Architecture Analysis

4. Programmatic Usage

5. Running Tests

Configuration

LLM Backend

Option 1: Local LLM with Ollama (Recommended)

Option 2: OpenAI

Environment Variables

Understanding Relevance Scores

How to interpret scores:

Hybrid Search

Query Types

Class/Code Queries (Default)

Folder/Package Queries

Hallucination Detection

LangChain Integration (Optional)

Installation

Using the Custom Retriever

Using RAG Chains

Conversational Memory

Quick One-Shot Query

Why Use LangChain?

Safe Mode

Architecture Analysis

Static Analysis

Percentile-Based Thresholds

Viewing Distribution Statistics

Issue Detection

Evidence in Issues

Recommendations

🔴 CRITICAL (High Severity)

✅ GOOD PRACTICES (Medium/Low Severity)

Recommendation Details

Output Format

Uncertainty Handling

Example Queries

Class/Code Questions

Package/Folder Overview

Troubleshooting

Low relevance scores

Ollama connection errors

Out of memory during build

Hallucination warnings

AI Usage Statement