SEC Filings RAG Assistant

This project implements a Retrieval-Augmented Generation (RAG) system over U.S. SEC filings (10-K, 10-Q) to answer questions using source-grounded LLM responses.

The system ingests real-world financial disclosures, indexes them in a vector database, and generates answers strictly based on retrieved source documents, reducing hallucinations.

🔍 What This Project Does

Downloads public SEC filings for a given company
Parses and cleans raw HTML documents
Splits documents into semantic chunks
Generates embeddings using OpenAI
Indexes content in a persistent vector database (ChromaDB)
Answers user questions with citations to original filings

🧠 Architecture Overview

SEC Filings (HTML)
  -> Parsing & Cleaning
  -> Chunking
  -> Embeddings
  -> ChromaDB
  -> Retrieval
  -> LLM Answer + Citations

📦 Tech Stack

Python
OpenAI API (embeddings + chat completions)
ChromaDB (persistent vector store)
BeautifulSoup (HTML parsing)
SEC EDGAR public data

📁 Project Structure

src/
  rag/
    filters.py
    formatting.py
    mmr.py
  sec_download.py
  sec_ingest.py
  chroma_index_openai.py
  search_openai.py
  ask.py
tests/
  test_filters.py
  test_mmr.py

🚀 How to Run

1. Setup environment

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

2. Configure environment variables

Create a .env file in the project root (see .env.example):

OPENAI_API_KEY=your_openai_key_here
OPENAI_EMBED_MODEL=text-embedding-3-small
OPENAI_CHAT_MODEL=gpt-4o-mini
SEC_USER_AGENT=YourName your@email.com

⚠️ Never commit .env. API keys must remain private.

3. Download SEC filings

Download recent SEC filings (10-K / 10-Q) for a company:

python src/sec_download.py --ticker TSLA --limit 10

4. Parse and chunk documents

Convert raw HTML filings into cleaned, chunked text:

python src/sec_ingest.py

5. Build vector index (OpenAI embeddings)

Generate embeddings and store them in a persistent ChromaDB index:

python src/chroma_index_openai.py

6. Ask questions (RAG)

Query the system using Retrieval-Augmented Generation:

python src/ask.py "What does Tesla say about Cybertruck production ramp?"

Filter by ticker and form

python src/ask.py \
  "What are the main risk factors Tesla lists?" \
  --ticker TSLA \
  --form 10-K \
  --k 5

Date range filtering

python src/ask.py \
  "How does Tesla describe liquidity risks?" \
  --ticker TSLA \
  --form 10-K \
  --min_date 2023-01-01 \
  --max_date 2024-12-31

MMR reranking (diversified results)

python src/ask.py \
"What are Tesla's key market risks?" \
--ticker TSLA \
--form 10-K \
--mmr \
--k 5

🛡️ Hallucination Control

The assistant is explicitly instructed to:

Answer only using retrieved sources
Return "Not found in provided sources" if information is missing
Cite every factual statement

📈 Possible Improvements

Section-aware chunking (e.g. Item 1A, Item 7)
Hybrid search (BM25 + vectors)
Reranking (cross-encoder)
Evaluation pipeline with benchmark questions
Web or API interface

📌 Use Cases

Financial research assistants
Compliance & regulatory analysis
Internal document search
Private knowledge-base chatbots

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SEC Filings RAG Assistant

🔍 What This Project Does

🧠 Architecture Overview

📦 Tech Stack

📁 Project Structure

🚀 How to Run

1. Setup environment

2. Configure environment variables

3. Download SEC filings

4. Parse and chunk documents

5. Build vector index (OpenAI embeddings)

6. Ask questions (RAG)

🛡️ Hallucination Control

📈 Possible Improvements

📌 Use Cases

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

SEC Filings RAG Assistant

🔍 What This Project Does

🧠 Architecture Overview

📦 Tech Stack

📁 Project Structure

🚀 How to Run

1. Setup environment

2. Configure environment variables

3. Download SEC filings

4. Parse and chunk documents

5. Build vector index (OpenAI embeddings)

6. Ask questions (RAG)

🛡️ Hallucination Control

📈 Possible Improvements

📌 Use Cases

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages