The PDF Document Summarization Q&A Chatbot is a specialized tool designed to help users extract and summarize information from PDF documents. This project aims to develop a tool for analysts to efficiently load PDF documents from the U.S. Securities and Exchange Commission (SEC) website and obtain summaries. The team is evaluating two possible libraries, Nougat and PyPDF, for text extraction. Additionally, the project draws inspiration from the Open AI cookbook to build a question and answer (Q&A) system based on books 2 and 3.
- Aditya Kawale
- NUID 002766716
- Email [email protected]
- Nidhi Singh
- NUID 002925684
- Email [email protected]
- Uddhav Zambare
- NUID 002199488
- Email [email protected]
backend
├── Pipfile
├── Pipfile.lock
├── README.md
├── example.env
├── fastapiservice
│ ├── __init__.py
│ ├── filecache
│ │ └── README.md
│ ├── postman
│ │ └── damg7245-assgn2-chatbot-private-files.postman_collection.json
│ ├── src
│ │ ├── __init__.py
│ │ ├── app.py
│ │ ├── chatanswer.py
│ │ ├── processpdf.py
│ │ └── utilities
│ │ ├── __init__.py
│ │ └── customexception.py
│ └── test
│ ├── __init__.py
│ └── test_app.py
└── fine-tune model
└── create-model.ipynb
frontend
├── README.md
├── diagram
│ ├── architecture-assgn1-generator.py
│ └── architecture-assgn2-generator.py
├── images
│ ├── pdf_processing_flow.png
│ ├── qa_chatbot_for_pdfs_architecture.png
│ ├── streamlit.png
│ └── user.png
├── main.py
├── pages
│ ├── architecture-assign1.py
│ └── architecture-assign2.py
└── requirements.txt
- Codelab Doc - link
- Nougat Library - link
- SEC Forms - link
- PyPDF Documentation - link
- Open AI Cookbook - link
- Streamlit - link
- Users initiate the process by providing web links to PDF documents.
- The system validates and downloads the PDFs, saving them in cache memory for further processing.
- Users choose between PyPDF and Nougat for text extraction from the PDFs.
- Text is segmented into coherent sections of approximately 1000 tokens.
- Sections are grouped and embedded for convenient access.
- Select the "gpt-3.5-turbo-instruct" model for AI-based Q&A.
- Create question-answer pairs for model training.
- Format the data for fine-tuning and use the Open AI Finetuning Job API.
- Users can interact with the fine-tuned model:
- Provide a question.
- The chatbot searches for relevant context.
- Embeddings are generated and contexts ranked.
- The top contexts are sent to the model for an answer, which is returned to the user.
App can be directly accessed from Streamlit Cloud via link
OR
-
Clone the Repository
Clone the repository to your local machine:
git clone <repository_url>
Backend
-
Open the Backend folder Navigate to the module directory:
cd backend
-
Create a .env file Create a
.env
file with the necessary environment variables and API Keys. Reference: example.env -
Install Dependencies:
Open the terminal in VSCode and run the following commands:
pipenv install --dev
-
Activate Virtual Environment:
To activate the virtual environment, run:
pipenv shell
-
Run the Backend Server:
Start the backend server using Uvicorn. Run the following command:
uvicorn fastapiservice.src.app:app --reload
Your backend API will be accessible at http://127.0.0.1:8000.
Frontend
-
Open the Frontend folder Navigate to the module directory:
cd frontend
-
Create a Virtual Environment: Create a virtual environment and activate it:
python -m venv venv source venv/bin/activate # On Windows, use: venv\Scripts\activate
-
Install Frontend Dependencies:
In the activated virtual environment, install the required Python packages:
pip install -r requirements.txt
-
Upgrade Pip (Optional):
You can upgrade Pip to the latest version if needed:
pip install --upgrade pip
-
Run the Frontend Application:
To start the frontend application, run the following command:
streamlit run main.py
- Save a CSV file with 'context' and 'tokens' columns in the 'backend/fine-tune model' directory.
- Open 'create-model.ipynb' in your Jupyter Notebook environment.
- Modify the filepath in the first code snippet to point to your CSV file.
- Follow the instructions in the notebook to fine-tune your model.
This project expansion seeks to develop an organization-specific QA chatbot. This chatbot will allow businesses and institutions to effortlessly access and utilize internal documents by providing answers to company-specific questions. The key objectives include custom document integration, business-specific query responses, contextual understanding, knowledge retrieval automation, and user-friendly interactions. The workflow covers document integration, custom document processing, model training, user interactions, and continuous learning. This initiative promises efficient knowledge retrieval, consistent responses, increased productivity, scalability, and robust data security.
- Aditya : 34
%
- Nidhi : 33
%
- Uddhav : 33
%
Developer | Deliverables |
---|---|
Aditya | Fast API Setup and Application deployment |
Aditya | Pdf text Segmentation and context creation |
Uddhav | Q&A dataset & Fine-tuning OpenAI model |
Uddhav | Answering questions by fetching related context |
Nidhi | Streamlit Front-End and API access point |
Nidhi | Data Research and Documentation |
WE ATTEST THAT WE HAVEN’T USED ANY OTHER STUDENTS’ WORK IN OUR ASSIGNMENT AND ABIDE BY THE POLICIES LISTED IN THE STUDENT HANDBOOK.