This project aims to build an Intelligent Document Search and Retrieval System that leverages the power of Airflow, FastAPI, Streamlit, and Pinecone, a vector database. The system allows users to securely register, log in, and search for information within a collection of structured documents. Users can input queries or questions, and the system will retrieve relevant documents using a similarity search technique. The project is divided into two major parts: automating data acquisition and embedding creation (Part 1) and developing a client-facing application using Streamlit and FastAPI (Part 2).
- Aditya Kawale
- NUID 002766716
- Email [email protected]
- Nidhi Singh
- NUID 002925684
- Email [email protected]
- Uddhav Zambare
- NUID 002199488
- Email [email protected]
- Codelab Doc - link
- Demo Link - link
- Airflow - link
- Pinecone - link
- Pinecone-client - link
- SEC Forms - link
- Nougat Library - link
- PyPDF Documentation - link
- Open AI Cookbook - link
- Streamlit - link
- Designed for data acquisition and embedding generation.
- Incorporates parameters in YAML format for:
- Specifying a list of files for processing (at least 5 from the SEC website).
- Choosing the processing option, either "Nougat" or "PyPdf."
- Providing credentials in YAML format.
- Implements data validation checks to ensure accurate parsed data.
- Generates embeddings and metadata associated with chunked texts.
- Saves the file extracts in a CSV file.
- Intended for inserting records into the Pinecone vector database.
- Parameterizes the source path of the CSV file for loading into the Pinecone database.
- Capable of creating/updating/deleting the index as needed when data is refreshed.
- Implements user registration and login functionality.
- Utilizes JWT (JSON Web Token) authentication to secure API endpoints.
- Stores user login credentials and hashed passwords in a SQL database.
- Stores application logs in the database for auditing and troubleshooting.
- Provides a user registration and login page for creating accounts and logging in securely.
- Offers a Question Answering interface for answering questions and querying information.
- Allows users to select from various preprocessed forms (at least 5) such as documents, templates, or other structured data sources.
- Enables users to input queries or questions and perform searches in the Pinecone vector database using a Similarity search technique.
- Filters searches based on the selected form.
- Performs comprehensive searches across all items in the index if no specific form is selected by the user.
This project combines the capabilities of various technologies and tools to create a robust and user-friendly system for intelligent document search and retrieval. It enhances data processing, search functionality, and user security to provide a comprehensive solution for information retrieval from structured documents.
App can be directly accessed from Streamlit Cloud via link
OR
-
Clone the Repository
Clone the repository to your local machine:
git clone <repository_url>
Backend
-
Open the Backend folder Navigate to the module directory:
cd backend
-
Create a .env file Create a
.env
file with the necessary environment variables and API Keys. Reference: example.env -
Install Dependencies:
Open the terminal in VSCode and run the following commands:
pipenv install --dev
-
Activate Virtual Environment:
To activate the virtual environment, run:
pipenv shell
-
Run the Backend Server:
Start the backend server using Uvicorn. Run the following command:
uvicorn fastapiservice.src.app:app --reload
Your backend API will be accessible at http://127.0.0.1:8000.
Frontend
-
Open the Frontend folder Navigate to the module directory:
cd frontend
-
Create a Virtual Environment: Create a virtual environment and activate it:
python -m venv venv source venv/bin/activate # On Windows, use: venv\Scripts\activate
-
Install Frontend Dependencies:
In the activated virtual environment, install the required Python packages:
pip install -r requirements.txt
-
Create a
secrets.toml
file Create an environment variable file with the necessary variable as.streamlit\secrets.toml
. Reference: example_secrets.toml -
Run the Frontend Application:
To start the frontend application, run the following command:
streamlit run main.py
https://airflow.apache.org/docs/apache-airflow/stable/howto/docker-compose/index.html
- You must have docker desktop installed -> https://docs.docker.com/get-docker/
- You must clone this repository
- You must have a mysql database installed -> https://dev.mysql.com/downloads/mysql/
- Login as mysql root user and run azure-mysql-database/1_application_user_db_setup.sql (Change the passwords as per your choice)
- Login as application_dba user and run azure-mysql-database/azure-mysql-database/2_application_table_setup.sql (Change the passwords as per your choice)
- cd into airflow-pipeline/
- open terminal and run
export AIRFLOW_HOME=$PWD
curl -LfO 'https://airflow.apache.org/docs/apache-airflow/2.7.2/docker-compose.yaml'
mkdir -p ./logs ./plugins ./config
echo -e "AIRFLOW_UID=$(id -u)" > .env
docker compose up airflow-init
docker compose up
This project combines the capabilities of various technologies and tools to create a robust and user-friendly system for intelligent document search and retrieval. It enhances data processing, search functionality, and user security to provide a comprehensive solution for information retrieval from structured documents.
- Aditya : 33
%
- Nidhi : 33
%
- Uddhav : 34
%
Developer | Deliverables |
---|---|
Aditya | Airflow pipeline 1 and 2 |
Aditya | Google cloud composer : Airflow deployement |
Aditya | Github code integration |
Uddhav | Pinecone DB manupilation functions |
Uddhav | Q&A ChatBot using Pinecone embeddings |
Uddhav | Streamlit application and Deployment |
Uddhav | Diagramms architecture |
Nidhi | JWT authentication to secure API endpoints. |
Nidhi | FastAPI access points |
Nidhi | Data Research and Documentation |
WE ATTEST THAT WE HAVEN’T USED ANY OTHER STUDENTS’ WORK IN OUR ASSIGNMENT AND ABIDE BY THE POLICIES LISTED IN THE STUDENT HANDBOOK.