RFC RAG System

A Retrieval-Augmented Generation (RAG) system for searching and answering questions about RFC (Request for Comments) documents using semantic search and Claude AI.

Overview

This project implements a complete RAG pipeline that:

Downloads and processes RFC documents from XML sources
Creates semantic embeddings using sentence-transformers
Builds a FAISS vector index for efficient similarity search
Provides semantic search and Q&A capabilities via CLI
Integrates with Claude AI for intelligent answer generation

Features

Semantic Search: Find relevant RFC sections using natural language queries
AI-Powered Q&A: Get intelligent answers from Claude based on RFC context
Performance Metrics: Detailed timing and relevance analysis
CLI Interface: Easy-to-use command-line tools
CCDS Structure: Following Cookiecutter Data Science conventions

Quick Start

Installation

# Clone the repository
git clone <your-repo-url>
cd rfc-rag

# Create and activate virtual environment (using uv - recommended)
uv venv
source .venv/bin/activate  # On macOS/Linux
# .venv\Scripts\activate   # On Windows

# Install dependencies
uv pip install -r requirements.txt

# Alternative: using standard Python venv
# python -m venv .venv
# source .venv/bin/activate
# pip install -r requirements.txt

# Set up environment variables
cp .env.template .env
# Edit .env with your actual API keys and settings
export ANTHROPIC_API_KEY="your_claude_api_key"

Usage

Download and process RFC data:

# Download RFC XML files
python data/download_bulk.py

# Process XML to chunks, generate embeddings, and build index
make data       # Process XML files to JSONL chunks
make features   # Generate embeddings 
make train      # Build FAISS index

# Or run the complete pipeline:
make data features train

Search RFCs:

# Semantic search
python -m rfc_rag.modeling.predict search "TCP congestion control"

# Get AI-powered answers
python -m rfc_rag.modeling.predict answer "How does TCP handle packet loss?"

Commands

Data Processing Pipeline

# 1. Process RFC XML files to chunks
python -m rfc_rag.dataset [OPTIONS]

# 2. Generate embeddings (requires embed subcommand)
python -m rfc_rag.features embed [OPTIONS]

# 3. Build FAISS index
python -m rfc_rag.modeling.train [OPTIONS]

Search Command

python -m rfc_rag.modeling.predict search "your query" [OPTIONS]

Options:

--k: Number of results (default: 5)
--model-name: Embedding model (default: "all-MiniLM-L6-v2")

Answer Command

python -m rfc_rag.modeling.predict answer "your question" [OPTIONS]

Options:

--k: Number of context chunks (default: 5)
--claude-model: Claude model (default: "claude-3-haiku-20240307")
--model-name: Embedding model (default: "all-MiniLM-L6-v2")

Requirements

Python 3.10+
FAISS (CPU version)
sentence-transformers
Anthropic Claude API access
See requirements.txt for full dependencies

Project Organization

├── LICENSE            <- Open-source license
├── Makefile           <- Makefile with convenience commands like `make data` or `make train`
├── README.md          <- The top-level README for developers using this project
├── data
│   ├── external       <- Data from third party sources (RFC XML files)
│   ├── interim        <- Intermediate data that has been transformed
│   ├── processed      <- The final, canonical data sets for modeling
│   │   ├── rfc_chunks.jsonl    <- Processed RFC text chunks
│   │   └── rfc_embeddings.npz  <- Precomputed embeddings
│   └── raw            <- The original, immutable data dump
│       └── xmlsource-all/      <- Downloaded RFC XML files
│
├── docs               <- Documentation using mkdocs
│
├── models             <- Trained and serialized models, model predictions, or model summaries
│   ├── rfc_faiss_index.bin     <- FAISS vector index
│   └── rfc_metadata.pkl        <- RFC chunk metadata
│
├── notebooks          <- Jupyter notebooks for exploration and analysis
│
├── pyproject.toml     <- Project configuration file with package metadata
│
├── references         <- Data dictionaries, manuals, and all other explanatory materials
│
├── reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures        <- Generated graphics and figures to be used in reporting
│
├── requirements.txt   <- The requirements file for reproducing the analysis environment
│
├── tests/             <- Unit tests
│
└── rfc_rag/          <- Source code for use in this project
    │
    ├── __init__.py             <- Makes rfc_rag a Python module
    │
    ├── config.py               <- Store useful variables and configuration
    │
    ├── dataset.py              <- Scripts to download or generate data
    │
    ├── features.py             <- Code to create features for modeling
    │
    ├── modeling/               <- Scripts to train models and make predictions
    │   ├── __init__.py 
    │   ├── predict.py          <- Code to run model inference with trained models          
    │   └── train.py            <- Code to train models
    │
    └── plots.py                <- Code to create visualizations

Architecture

The system follows a standard RAG architecture:

Data Ingestion: RFC XML files are downloaded and parsed
Text Processing: Documents are chunked into semantic segments
Embedding Generation: Text chunks are converted to vector embeddings
Vector Indexing: FAISS index is built for efficient similarity search
Query Processing: User queries are embedded and matched against the index
Context Retrieval: Top-K most relevant chunks are retrieved
Answer Generation: Claude AI generates answers based on retrieved context

Performance

The system provides detailed metrics including:

End-to-end response times
Individual component timing (embedding, search, API calls)
Relevance scores and categorization
Source attribution with similarity scores

Contributing

This project follows Cookiecutter Data Science (CCDS) conventions:

Use make commands for common tasks
Keep data processing scripts in data/
Model training in rfc_rag/modeling/train.py
Inference in rfc_rag/modeling/predict.py
Tests in tests/
Documentation in docs/

License

See LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RFC RAG System

Overview

Features

Quick Start

Installation

Usage

Commands

Data Processing Pipeline

Search Command

Answer Command

Requirements

Project Organization

Architecture

Performance

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github		.github
docs		docs
models		models
notebooks		notebooks
references		references
reports		reports
rfc_rag		rfc_rag
tests		tests
.env.template		.env.template
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

RFC RAG System

Overview

Features

Quick Start

Installation

Usage

Commands

Data Processing Pipeline

Search Command

Answer Command

Requirements

Project Organization

Architecture

Performance

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages