Assignment1

Abstract 📝

Two Streamlit applications have been developed one that processes PDFs either via Nougat or PyPDF python libraries. The objective of is to analyze and compare Nougat and PyPDF based on various use cases and different input PDFs. Another application allows users to evaluate the quality of the Freddie Mac Single Family dataset. Users can upload a CSV or XLS file containing either Origination or Monthly performance data and assess whether it adheres to the published schema. The tool will use ydata profiling to summarize the data and display the results to the user. Additionally, it will run Great Expectations to perform data quality checks, including schema validation, data validity, absence of missing data, and other custom tests.

Team Members 👥

Aditya Kawale
- NUID 002766716
- Email [email protected]
Nidhi Singh
- NUID 002925684
- Email [email protected]
Uddhav Zambare
- NUID 002199488
- Email [email protected]

Project Structure

Part1
├── NougatAPIServer.ipynb
├── README.md
├── architecture_diagram_generator.py
├── images
│   ├── colab.png
│   ├── ngrok.png
│   ├── pypdf2.png
│   ├── streamlit.png
│   └── user.png
├── main.py
├── pages
│   └── architecture.py
├── pdf_processing_flow.png
└── requirements.txt

Part2
├── Home.py
├── README.md
├── arch_diagram.py
├── architecture_diagram.png
├── aws_config.py
├── data
│   ├── file_layout.xlsx
│   ├── monthly
│   └── origination
├── example.env
├── gx
│   ├── expectations
│   ├── great_expectations.yml
│   └── uncommitted
├── gx_monthly_data.py
├── gx_origination_data.py
├── images
│   ├── great_expectations.png
│   ├── streamlit.png
│   └── ydata-profiling.png
├── pages
│   ├── Architecture.py
│   ├── Great_Expectation.py
│   └── Test.py
└── requirements.txt

Links 📎

📕 Codelab Doc - link
1️⃣ Streamlit Application for PDF Processing via NougatOCR and PyPDF2 - link
❷ Streamlit Application for Data Profiling / Validation using ydata-profiling and great-expectations - link
📕 Colab Notebook - link
📊 Part 1 Datasets (SEC.gov) - link
📊 Part 2 Datasets (Freddie Mac) - link
🗑️ Part 1 Outputs (processed ouputs) - link

Tools 🧰

🔧 Streamlit - link
🔧 NougatOCR - link
🔧 PyPDF2 - link
🔧 Google Colab - link
🔧 AWS s3 - link
🔧 ydata-profiling - link
🔧 great-expectations - link

Part 1 - PDF Processing

Architecture 👷🏻‍♂️

Flow for Part 1

User gives Http/Https PDF link to Streamlit App hosted on Streamlit Cloud
Streamlit cloud app downloads the pdf on its own storage and validates if it is truly a pdf file
If the check passes, it checks the PDF Processor.
If the PDF Processor is PyPDF it processes the PDF on Streamlit Cloud itself
If the PDF Processor is Nougat it sends the downloaded PDF to Ngrok Agent which is accessible via public internet
Once the Ngrok Agent gets the request, it forwards it to Google Colab ngrok service
Ngrok service forwards the request to Nougat API running on port 8503
Nougat API processes the PDF and returns the MMD file via HTTP to streamlit application
User downloads MMD files from Streamlit

Source Code References 💻

(Front-end) streamlit-pdf-processing - link
(Back-end) colab-nougat-api-server - link

Steps to execute Part1 Streamlit application locally

Clone the repository to your local machine:
```
git clone <repository_url>
```
Navigate to the module directory:
```
cd Part1
```

Create a virtual environment and activate it:

python -m venv venv
source venv/bin/activate  # On Windows, use: venv\Scripts\activate

Install the required dependencies from the requirements.txt file:
```
pip install -r requirements.txt
```
Run the Streamlit application:
```
streamlit run main.py
```
Access the tool through your web browser at http://localhost:8501.

Steps to execute Part1 NougatAPIServer on Google colab

Download the Google Colab Notebook link
Create a parent-child folder 'DAMG7245_Fall2023/Assignment1' on your google drive to which your Colab notebook is connected.
Create an account on ngrok to get authtoken link
Copy the token to 'ngrok_authtoken.txt' file and upload it to the Assignment1 folder created in step 2
Select 'T4 GPU' on colab runtime and run the cells. Please follow the instructions in cell comments.
The colab notebook will generate a public URL to Nougat API server. Use that for your experimentation with either hosted streamlit app or locally hosted app.

Observations and Challenges

NougatOCR has 2 offerings CLI and API. The CLI version is mature with many configuration options as opposed to API version. We used API version specifically as we are trying to build a cloud application with decoupled front-end and back-end
As per our experiments,
- NougatOCR takes 6.30 minutes to generate MMD file for single page PDF when used with Colab CPU
- NougatOCR takes 1.30 minutes to generate MMD file for single page PDF when used with Colab TPU
- NougatOCR library uses caching mechanism on individual page level, so if the output is requested for the same file in next iterations, the responses are quick.
- NougatOCR API returns the processed output in string format which needs preprocessing before converting it to MMD file
  - Example:
  - Remove the first and last double quotations
  - Replace \n with literal newline character
  - Replace \\ with \
PyPDF is faster than Nougat (~30 seconds for single page) for processing PDFs but looses the style and presentation information.
Ngrok public URL gets invalid after ~ 10 minutes, so you wil have to run particular cells on NougatAPIServer Colab notebook to generate a new-one. Since colab doesnot support multiple cell runs - You will have to stop the already running API server. Get a new ngrok url and then start the API Server.

Pros and Cons

Equations: Nougat is better than PyPDF. Nougat properly represents equations with correct symbols and presentation (subscript/superscript). PyPDF fails at capturing symbols and doesnot maintain presentation elements.
Tables: Nougat is better at judging tables than PyPDF
Header, Footer, Font styles: PyPDF is better at capturing header and footer information but fails at capturing font styles
Text inside images: Since nougat does OCR, it is able to extract textual elements from images. PyPDF fails here
Hyperlinks: Nougat needs https or http infront of URLs to treat them as hyperlinks in MMD files

Part 2

Objective

The objective of this module is to build a tool that allows users to evaluate the quality of the Freddie Mac Single Family dataset. Users can upload a CSV or XLS file containing either Origination or Monthly performance data and assess whether it adheres to the published schema. The tool will use Pandas Profiling to summarize the data and display the results to the user. Additionally, it will run Great Expectations to perform data quality checks, including schema validation, data validity, absence of missing data, and other custom tests.

Architecture

Streamlit: Streamlit provides an easy-to-use web interface for uploading files and displaying reports. It simplifies the development of the user interface.

Pandas Profiling: Pandas Profiling is a powerful tool for data profiling, generating summary statistics, and visualizations of the dataset. It helps users quickly understand the data.

Great Expectations: Great Expectations is a robust framework for data validation. It allows us to define and enforce expectations about the dataset's schema and data quality.

AWS S3: AWS S3 is used for temporary file storage. Uploaded files are stored here before analysis. This ensures that the original dataset is not modified.

Input Data Format Constraints

The input data must adhere to the following constraints:

Data must follow the schema provided and should not have columns names. The data must be in the same format as it is on the website
The data columns must follow the same order as mentioned in the file_layout.xlsx file available on the Freddie Mac Loan Set website.

Steps to Execute

App can be directly accessed from Streamlit Cloud via link

OR

Clone the repository to your local machine:
```
git clone <repository_url>
```
Navigate to the module directory:
```
cd Part2
```

Create a virtual environment and activate it:

python -m venv venv
source venv/bin/activate  # On Windows, use: venv\Scripts\activate

Install the required dependencies from the requirements.txt file:
```
pip install -r requirements.txt
```
Create a .env file with the necessary environment variables, such as AWS credentials. Reference: example.env
Run the Streamlit application:
```
streamlit run Home.py
```
Access the tool through your web browser at http://localhost:8501.

Scope

By using this tool, we aim to simplify the process of evaluating the quality of Freddie Mac Single Family dataset files. The tool leverages the power of Pandas Profiling and Great Expectations to provide comprehensive data analysis and validation reports to ensure data quality and adherence to the schema. This project will help data engineers and analysts assess and trust the data they work with, ultimately improving data-driven decision-making processes.

If you encounter any issues or have suggestions for improvements, please feel free to open an issue or contribute to this project. Your feedback is valuable in enhancing the tool's functionality and usability.

Contribution 🤝

Aditya : 34%
Nidhi : 33%
Uddhav : 33%

Individual Distribution ⚖️

Developer	Deliverables
Aditya	Streamlit Part 1 - Nougat
Aditya	Git setup and integration
Uddhav	Streamlit Part 2 - Great Expectation
Uddhav	Streamlit Part 2 - Ydata Profiling
Nidhi	Streamlit Part 1 - Pypdf
Nidhi	Architecture Diagrams
Nidhi	Documentation

WE ATTEST THAT WE HAVEN’T USED ANY OTHER STUDENTS’ WORK IN OUR ASSIGNMENT AND ABIDE BY THE POLICIES LISTED IN THE STUDENT HANDBOOK.

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
Part1		Part1
Part2		Part2
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Assignment1

Abstract 📝

Team Members 👥

Project Structure

Links 📎

Tools 🧰

Part 1 - PDF Processing

Architecture 👷🏻‍♂️

Source Code References 💻

Steps to execute Part1 Streamlit application locally

Steps to execute Part1 NougatAPIServer on Google colab

Observations and Challenges

Pros and Cons

Part 2

Objective

Architecture

Input Data Format Constraints

Steps to Execute

Scope

Contribution 🤝

Individual Distribution ⚖️

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

BigDataIA-Fall2023-Team7/Assignment1-PDF-Processing-Application

Folders and files

Latest commit

History

Repository files navigation

Assignment1

Abstract 📝

Team Members 👥

Project Structure

Links 📎

Tools 🧰

Part 1 - PDF Processing

Architecture 👷🏻‍♂️

Source Code References 💻

Steps to execute Part1 Streamlit application locally

Steps to execute Part1 NougatAPIServer on Google colab

Observations and Challenges

Pros and Cons

Part 2

Objective

Architecture

Input Data Format Constraints

Steps to Execute

Scope

Contribution 🤝

Individual Distribution ⚖️

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages