An end-to-end Big Data project that ingests unstructured SEC 10-K filings, processes them with Apache Spark, and powers an AI-driven financial analysis dashboard.
Status: ✅ Complete | Tech Stack: Docker, Spark, MinIO, DuckDB, Streamlit, Google Gemini
This project implements a modern "Lakehouse" architecture to solve the problem of analyzing messy legal documents.
- Ingest: A Python script fetches raw 10-K HTML reports from the SEC EDGAR Archives.
- Store: Raw files are saved to MinIO (S3-compatible object storage).
- Process: Apache Spark reads the HTML, cleans tags, redacts PII (emails/phones), and chunks text for AI.
- Analyze: DuckDB queries the processed Parquet files in milliseconds.
- Intelligence (RAG): Google Gemini Pro acts as a reasoning engine, answering user questions based only on the retrieved financial data.
- Infrastructure: Docker & Docker Compose
- Storage: MinIO (Object Storage)
- Processing: Apache Spark (PySpark)
- Query Engine: DuckDB (OLAP)
- UI/Visualization: Streamlit
- GenAI Model: gemini-2.5-flash
1. Prerequisites
- Docker Desktop installed
- Python 3.9+ installed
2. Setup Infrastructure
git clone [https://github.com/atharvaa45/Financial-Compliance-System-using-Evaluated-RAG.git](https://github.com/atharvaa45/Financial-Compliance-System-using-Evaluated-RAG.git)
cd compliance-llm-pipeline
docker-compose up -d

