Assignment 2 : PDF Document Q&A Chatbot

Abstract

The PDF Document Summarization Q&A Chatbot is a specialized tool designed to help users extract and summarize information from PDF documents. This project aims to develop a tool for analysts to efficiently load PDF documents from the U.S. Securities and Exchange Commission (SEC) website and obtain summaries. The team is evaluating two possible libraries, Nougat and PyPDF, for text extraction. Additionally, the project draws inspiration from the Open AI cookbook to build a question and answer (Q&A) system based on books 2 and 3.

Team Members 👥

Aditya Kawale
- NUID 002766716
- Email [email protected]
Nidhi Singh
- NUID 002925684
- Email [email protected]
Uddhav Zambare
- NUID 002199488
- Email [email protected]

Project Structure

backend
├── Pipfile
├── Pipfile.lock
├── README.md
├── example.env
├── fastapiservice
│   ├── __init__.py
│   ├── filecache
│   │   └── README.md
│   ├── postman
│   │   └── damg7245-assgn2-chatbot-private-files.postman_collection.json
│   ├── src
│   │   ├── __init__.py
│   │   ├── app.py
│   │   ├── chatanswer.py
│   │   ├── processpdf.py
│   │   └── utilities
│   │       ├── __init__.py
│   │       └── customexception.py
│   └── test
│       ├── __init__.py
│       └── test_app.py
└── fine-tune model
    └── create-model.ipynb

frontend
├── README.md
├── diagram
│   ├── architecture-assgn1-generator.py
│   └── architecture-assgn2-generator.py
├── images
│   ├── pdf_processing_flow.png
│   ├── qa_chatbot_for_pdfs_architecture.png
│   ├── streamlit.png
│   └── user.png
├── main.py
├── pages
│   ├── architecture-assign1.py
│   └── architecture-assign2.py
└── requirements.txt

Links 📎

Codelab Doc - link
Nougat Library - link
SEC Forms - link
PyPDF Documentation - link
Open AI Cookbook - link
Streamlit - link

Architecture 👷🏻‍♂️

Project Workflow

Step 1: Input PDF URLs

Users initiate the process by providing web links to PDF documents.
The system validates and downloads the PDFs, saving them in cache memory for further processing.

Step 2: Choose PDF Processor

Users choose between PyPDF and Nougat for text extraction from the PDFs.

Step 3: Create Sections of Processed PDF

Text is segmented into coherent sections of approximately 1000 tokens.
Sections are grouped and embedded for convenient access.

Step 4: Generate a Model

Select the "gpt-3.5-turbo-instruct" model for AI-based Q&A.
Create question-answer pairs for model training.
Format the data for fine-tuning and use the Open AI Finetuning Job API.

Step 5: Chat with Your Personal Chatbot

Users can interact with the fine-tuned model:
- Provide a question.
- The chatbot searches for relevant context.
- Embeddings are generated and contexts ranked.
- The top contexts are sent to the model for an answer, which is returned to the user.

Steps to Execute

App can be directly accessed from Streamlit Cloud via link

OR

Clone the Repository

Clone the repository to your local machine:
```
git clone <repository_url>
```

Backend

Open the Backend folder Navigate to the module directory:
```
cd backend
```
Create a .env file Create a .env file with the necessary environment variables and API Keys. Reference: example.env
Install Dependencies:

Open the terminal in VSCode and run the following commands:
```
pipenv install --dev
```
Activate Virtual Environment:

To activate the virtual environment, run:
```
pipenv shell
```
Run the Backend Server:

Start the backend server using Uvicorn. Run the following command:
```
uvicorn fastapiservice.src.app:app --reload
```
Your backend API will be accessible at http://127.0.0.1:8000.

Frontend

Open the Frontend folder Navigate to the module directory:
```
cd frontend
```

Create a Virtual Environment: Create a virtual environment and activate it:

python -m venv venv
source venv/bin/activate  # On Windows, use: venv\Scripts\activate

Install Frontend Dependencies:

In the activated virtual environment, install the required Python packages:
```
pip install -r requirements.txt
```
Upgrade Pip (Optional):

You can upgrade Pip to the latest version if needed:
```
pip install --upgrade pip
```
Run the Frontend Application:

To start the frontend application, run the following command:
```
streamlit run main.py
```

Creating Your Own Fine-Tuned Language Model

Save a CSV file with 'context' and 'tokens' columns in the 'backend/fine-tune model' directory.
Open 'create-model.ipynb' in your Jupyter Notebook environment.
Modify the filepath in the first code snippet to point to your CSV file.
Follow the instructions in the notebook to fine-tune your model.

Scope

This project expansion seeks to develop an organization-specific QA chatbot. This chatbot will allow businesses and institutions to effortlessly access and utilize internal documents by providing answers to company-specific questions. The key objectives include custom document integration, business-specific query responses, contextual understanding, knowledge retrieval automation, and user-friendly interactions. The workflow covers document integration, custom document processing, model training, user interactions, and continuous learning. This initiative promises efficient knowledge retrieval, consistent responses, increased productivity, scalability, and robust data security.

Contribution 🤝

Aditya : 34%
Nidhi : 33%
Uddhav : 33%

Individual Distribution ⚖️

Developer	Deliverables
Aditya	Fast API Setup and Application deployment
Aditya	Pdf text Segmentation and context creation
Uddhav	Q&A dataset & Fine-tuning OpenAI model
Uddhav	Answering questions by fetching related context
Nidhi	Streamlit Front-End and API access point
Nidhi	Data Research and Documentation

WE ATTEST THAT WE HAVEN’T USED ANY OTHER STUDENTS’ WORK IN OUR ASSIGNMENT AND ABIDE BY THE POLICIES LISTED IN THE STUDENT HANDBOOK.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
.github/workflows		.github/workflows
backend		backend
frontend		frontend
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Assignment 2 : PDF Document Q&A Chatbot

Abstract

Team Members 👥

Project Structure

Links 📎

Architecture 👷🏻‍♂️

Project Workflow

Step 1: Input PDF URLs

Step 2: Choose PDF Processor

Step 3: Create Sections of Processed PDF

Step 4: Generate a Model

Step 5: Chat with Your Personal Chatbot

Steps to Execute

Creating Your Own Fine-Tuned Language Model

Scope

Contribution 🤝

Individual Distribution ⚖️

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

BigDataIA-Fall2023-Team7/Assignment2-QA-Chatbot-PrivateFiles

Folders and files

Latest commit

History

Repository files navigation

Assignment 2 : PDF Document Q&A Chatbot

Abstract

Team Members 👥

Project Structure

Links 📎

Architecture 👷🏻‍♂️

Project Workflow

Step 1: Input PDF URLs

Step 2: Choose PDF Processor

Step 3: Create Sections of Processed PDF

Step 4: Generate a Model

Step 5: Chat with Your Personal Chatbot

Steps to Execute

Creating Your Own Fine-Tuned Language Model

Scope

Contribution 🤝

Individual Distribution ⚖️

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages