🏦 SEC Financial Data Lakehouse & RAG Pipeline

An end-to-end Big Data project that ingests unstructured SEC 10-K filings, processes them with Apache Spark, and powers an AI-driven financial analysis dashboard.

Status: ✅ Complete | Tech Stack: Docker, Spark, MinIO, DuckDB, Streamlit, Google Gemini

🚀 How It Works

This project implements a modern "Lakehouse" architecture to solve the problem of analyzing messy legal documents.

Ingest: A Python script fetches raw 10-K HTML reports from the SEC EDGAR Archives.
Store: Raw files are saved to MinIO (S3-compatible object storage).
Process: Apache Spark reads the HTML, cleans tags, redacts PII (emails/phones), and chunks text for AI.
Analyze: DuckDB queries the processed Parquet files in milliseconds.
Intelligence (RAG): Google Gemini Pro acts as a reasoning engine, answering user questions based only on the retrieved financial data.

🛠️ Tech Stack

Infrastructure: Docker & Docker Compose
Storage: MinIO (Object Storage)
Processing: Apache Spark (PySpark)
Query Engine: DuckDB (OLAP)
UI/Visualization: Streamlit
GenAI Model: gemini-2.5-flash

📸 Project Snapshots

StreamLit Dashboard:

MinIO HomePage:

Apache Spark:

⚡ Quick Start

1. Prerequisites

Docker Desktop installed
Python 3.9+ installed

2. Setup Infrastructure

git clone [https://github.com/atharvaa45/Financial-Compliance-System-using-Evaluated-RAG.git](https://github.com/atharvaa45/Financial-Compliance-System-using-Evaluated-RAG.git)
cd compliance-llm-pipeline
docker-compose up -d

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
docker/spark		docker/spark
notebooks		notebooks
src		src
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🏦 SEC Financial Data Lakehouse & RAG Pipeline

🚀 How It Works

🛠️ Tech Stack

📸 Project Snapshots

⚡ Quick Start

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🏦 SEC Financial Data Lakehouse & RAG Pipeline

🚀 How It Works

🛠️ Tech Stack

📸 Project Snapshots

⚡ Quick Start

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages