This project implements a Retrieval-Augmented Generation (RAG) system over U.S. SEC filings (10-K, 10-Q) to answer questions using source-grounded LLM responses.
The system ingests real-world financial disclosures, indexes them in a vector database, and generates answers strictly based on retrieved source documents, reducing hallucinations.
- Downloads public SEC filings for a given company
- Parses and cleans raw HTML documents
- Splits documents into semantic chunks
- Generates embeddings using OpenAI
- Indexes content in a persistent vector database (ChromaDB)
- Answers user questions with citations to original filings
SEC Filings (HTML)
-> Parsing & Cleaning
-> Chunking
-> Embeddings
-> ChromaDB
-> Retrieval
-> LLM Answer + Citations
- Python
- OpenAI API (embeddings + chat completions)
- ChromaDB (persistent vector store)
- BeautifulSoup (HTML parsing)
- SEC EDGAR public data
src/
rag/
filters.py
formatting.py
mmr.py
sec_download.py
sec_ingest.py
chroma_index_openai.py
search_openai.py
ask.py
tests/
test_filters.py
test_mmr.py
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtCreate a .env file in the project root (see .env.example):
OPENAI_API_KEY=your_openai_key_here
OPENAI_EMBED_MODEL=text-embedding-3-small
OPENAI_CHAT_MODEL=gpt-4o-mini
SEC_USER_AGENT=YourName your@email.com
Download recent SEC filings (10-K / 10-Q) for a company:
python src/sec_download.py --ticker TSLA --limit 10Convert raw HTML filings into cleaned, chunked text:
python src/sec_ingest.pyGenerate embeddings and store them in a persistent ChromaDB index:
python src/chroma_index_openai.pyQuery the system using Retrieval-Augmented Generation:
python src/ask.py "What does Tesla say about Cybertruck production ramp?"Filter by ticker and form
python src/ask.py \
"What are the main risk factors Tesla lists?" \
--ticker TSLA \
--form 10-K \
--k 5Date range filtering
python src/ask.py \
"How does Tesla describe liquidity risks?" \
--ticker TSLA \
--form 10-K \
--min_date 2023-01-01 \
--max_date 2024-12-31MMR reranking (diversified results)
python src/ask.py \
"What are Tesla's key market risks?" \
--ticker TSLA \
--form 10-K \
--mmr \
--k 5The assistant is explicitly instructed to:
-
Answer only using retrieved sources
-
Return "Not found in provided sources" if information is missing
-
Cite every factual statement
-
Section-aware chunking (e.g. Item 1A, Item 7)
-
Hybrid search (BM25 + vectors)
-
Reranking (cross-encoder)
-
Evaluation pipeline with benchmark questions
-
Web or API interface
-
Financial research assistants
-
Compliance & regulatory analysis
-
Internal document search
-
Private knowledge-base chatbots