Skip to content

BigDataIA-Fall2023-Team7/Assignment1-PDF-Processing-Application

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

44 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Assignment1

Abstract πŸ“

Two Streamlit applications have been developed one that processes PDFs either via Nougat or PyPDF python libraries. The objective of is to analyze and compare Nougat and PyPDF based on various use cases and different input PDFs. Another application allows users to evaluate the quality of the Freddie Mac Single Family dataset. Users can upload a CSV or XLS file containing either Origination or Monthly performance data and assess whether it adheres to the published schema. The tool will use ydata profiling to summarize the data and display the results to the user. Additionally, it will run Great Expectations to perform data quality checks, including schema validation, data validity, absence of missing data, and other custom tests.

Team Members πŸ‘₯


Project Structure

Part1
β”œβ”€β”€ NougatAPIServer.ipynb
β”œβ”€β”€ README.md
β”œβ”€β”€ architecture_diagram_generator.py
β”œβ”€β”€ images
β”‚   β”œβ”€β”€ colab.png
β”‚   β”œβ”€β”€ ngrok.png
β”‚   β”œβ”€β”€ pypdf2.png
β”‚   β”œβ”€β”€ streamlit.png
β”‚   └── user.png
β”œβ”€β”€ main.py
β”œβ”€β”€ pages
β”‚   └── architecture.py
β”œβ”€β”€ pdf_processing_flow.png
└── requirements.txt
Part2
β”œβ”€β”€ Home.py
β”œβ”€β”€ README.md
β”œβ”€β”€ arch_diagram.py
β”œβ”€β”€ architecture_diagram.png
β”œβ”€β”€ aws_config.py
β”œβ”€β”€ data
β”‚   β”œβ”€β”€ file_layout.xlsx
β”‚   β”œβ”€β”€ monthly
β”‚   └── origination
β”œβ”€β”€ example.env
β”œβ”€β”€ gx
β”‚   β”œβ”€β”€ expectations
β”‚   β”œβ”€β”€ great_expectations.yml
β”‚   └── uncommitted
β”œβ”€β”€ gx_monthly_data.py
β”œβ”€β”€ gx_origination_data.py
β”œβ”€β”€ images
β”‚   β”œβ”€β”€ great_expectations.png
β”‚   β”œβ”€β”€ streamlit.png
β”‚   └── ydata-profiling.png
β”œβ”€β”€ pages
β”‚   β”œβ”€β”€ Architecture.py
β”‚   β”œβ”€β”€ Great_Expectation.py
β”‚   └── Test.py
└── requirements.txt

Links πŸ“Ž

  • πŸ“• Codelab Doc - link
  • 1️⃣ Streamlit Application for PDF Processing via NougatOCR and PyPDF2 - link
  • ❷ Streamlit Application for Data Profiling / Validation using ydata-profiling and great-expectations - link
  • πŸ“• Colab Notebook - link
  • πŸ“Š Part 1 Datasets (SEC.gov) - link
  • πŸ“Š Part 2 Datasets (Freddie Mac) - link
  • πŸ—‘οΈ Part 1 Outputs (processed ouputs) - link

Tools 🧰

  • πŸ”§ Streamlit - link
  • πŸ”§ NougatOCR - link
  • πŸ”§ PyPDF2 - link
  • πŸ”§ Google Colab - link
  • πŸ”§ AWS s3 - link
  • πŸ”§ ydata-profiling - link
  • πŸ”§ great-expectations - link

Part 1 - PDF Processing

Architecture πŸ‘·πŸ»β€β™‚οΈ

alt text

Flow for Part 1

  1. User gives Http/Https PDF link to Streamlit App hosted on Streamlit Cloud
  2. Streamlit cloud app downloads the pdf on its own storage and validates if it is truly a pdf file
  3. If the check passes, it checks the PDF Processor.
  4. If the PDF Processor is PyPDF it processes the PDF on Streamlit Cloud itself
  5. If the PDF Processor is Nougat it sends the downloaded PDF to Ngrok Agent which is accessible via public internet
  6. Once the Ngrok Agent gets the request, it forwards it to Google Colab ngrok service
  7. Ngrok service forwards the request to Nougat API running on port 8503
  8. Nougat API processes the PDF and returns the MMD file via HTTP to streamlit application
  9. User downloads MMD files from Streamlit

Source Code References πŸ’»

  1. (Front-end) streamlit-pdf-processing - link
  2. (Back-end) colab-nougat-api-server - link

Steps to execute Part1 Streamlit application locally

  1. Clone the repository to your local machine:

    git clone <repository_url>
    
  2. Navigate to the module directory:

    cd Part1
    
  3. Create a virtual environment and activate it:

    python -m venv venv
    source venv/bin/activate  # On Windows, use: venv\Scripts\activate
    
  4. Install the required dependencies from the requirements.txt file:

    pip install -r requirements.txt
    
  5. Run the Streamlit application:

    streamlit run main.py
    
  6. Access the tool through your web browser at http://localhost:8501.


Steps to execute Part1 NougatAPIServer on Google colab

  1. Download the Google Colab Notebook link

  2. Create a parent-child folder 'DAMG7245_Fall2023/Assignment1' on your google drive to which your Colab notebook is connected.

  3. Create an account on ngrok to get authtoken link

  4. Copy the token to 'ngrok_authtoken.txt' file and upload it to the Assignment1 folder created in step 2

  5. Select 'T4 GPU' on colab runtime and run the cells. Please follow the instructions in cell comments.

  6. The colab notebook will generate a public URL to Nougat API server. Use that for your experimentation with either hosted streamlit app or locally hosted app.


Observations and Challenges

  • NougatOCR has 2 offerings CLI and API. The CLI version is mature with many configuration options as opposed to API version. We used API version specifically as we are trying to build a cloud application with decoupled front-end and back-end

  • As per our experiments,

    • NougatOCR takes 6.30 minutes to generate MMD file for single page PDF when used with Colab CPU
    • NougatOCR takes 1.30 minutes to generate MMD file for single page PDF when used with Colab TPU
    • NougatOCR library uses caching mechanism on individual page level, so if the output is requested for the same file in next iterations, the responses are quick.
    • NougatOCR API returns the processed output in string format which needs preprocessing before converting it to MMD file
      • Example:
      • Remove the first and last double quotations
      • Replace \n with literal newline character
      • Replace \\ with \
  • PyPDF is faster than Nougat (~30 seconds for single page) for processing PDFs but looses the style and presentation information.

  • Ngrok public URL gets invalid after ~ 10 minutes, so you wil have to run particular cells on NougatAPIServer Colab notebook to generate a new-one. Since colab doesnot support multiple cell runs - You will have to stop the already running API server. Get a new ngrok url and then start the API Server.


Pros and Cons

  • Equations: Nougat is better than PyPDF. Nougat properly represents equations with correct symbols and presentation (subscript/superscript). PyPDF fails at capturing symbols and doesnot maintain presentation elements.

  • Tables: Nougat is better at judging tables than PyPDF

  • Header, Footer, Font styles: PyPDF is better at capturing header and footer information but fails at capturing font styles

  • Text inside images: Since nougat does OCR, it is able to extract textual elements from images. PyPDF fails here

  • Hyperlinks: Nougat needs https or http infront of URLs to treat them as hyperlinks in MMD files


Part 2

Objective

The objective of this module is to build a tool that allows users to evaluate the quality of the Freddie Mac Single Family dataset. Users can upload a CSV or XLS file containing either Origination or Monthly performance data and assess whether it adheres to the published schema. The tool will use Pandas Profiling to summarize the data and display the results to the user. Additionally, it will run Great Expectations to perform data quality checks, including schema validation, data validity, absence of missing data, and other custom tests.

Architecture

alt text

Streamlit: Streamlit provides an easy-to-use web interface for uploading files and displaying reports. It simplifies the development of the user interface.

Pandas Profiling: Pandas Profiling is a powerful tool for data profiling, generating summary statistics, and visualizations of the dataset. It helps users quickly understand the data.

Great Expectations: Great Expectations is a robust framework for data validation. It allows us to define and enforce expectations about the dataset's schema and data quality.

AWS S3: AWS S3 is used for temporary file storage. Uploaded files are stored here before analysis. This ensures that the original dataset is not modified.

Input Data Format Constraints

The input data must adhere to the following constraints:

  • Data must follow the schema provided and should not have columns names. The data must be in the same format as it is on the website
  • The data columns must follow the same order as mentioned in the file_layout.xlsx file available on the Freddie Mac Loan Set website.

Steps to Execute

App can be directly accessed from Streamlit Cloud via link

OR

  1. Clone the repository to your local machine:

    git clone <repository_url>
    
  2. Navigate to the module directory:

    cd Part2
    
  3. Create a virtual environment and activate it:

    python -m venv venv
    source venv/bin/activate  # On Windows, use: venv\Scripts\activate
    
  4. Install the required dependencies from the requirements.txt file:

    pip install -r requirements.txt
    
  5. Create a .env file with the necessary environment variables, such as AWS credentials. Reference: example.env

  6. Run the Streamlit application:

    streamlit run Home.py
    
  7. Access the tool through your web browser at http://localhost:8501.

Scope

By using this tool, we aim to simplify the process of evaluating the quality of Freddie Mac Single Family dataset files. The tool leverages the power of Pandas Profiling and Great Expectations to provide comprehensive data analysis and validation reports to ensure data quality and adherence to the schema. This project will help data engineers and analysts assess and trust the data they work with, ultimately improving data-driven decision-making processes.

If you encounter any issues or have suggestions for improvements, please feel free to open an issue or contribute to this project. Your feedback is valuable in enhancing the tool's functionality and usability.

Contribution 🀝

  • Aditya : 34%
  • Nidhi : 33%
  • Uddhav : 33%

Individual Distribution βš–οΈ

Developer Deliverables
Aditya Streamlit Part 1 - Nougat
Aditya Git setup and integration
Uddhav Streamlit Part 2 - Great Expectation
Uddhav Streamlit Part 2 - Ydata Profiling
Nidhi Streamlit Part 1 - Pypdf
Nidhi Architecture Diagrams
Nidhi Documentation


WE ATTEST THAT WE HAVEN’T USED ANY OTHER STUDENTS’ WORK IN OUR ASSIGNMENT AND ABIDE BY THE POLICIES LISTED IN THE STUDENT HANDBOOK.

About

Streamlit application that processes PDFs either via Nougat or PyPDF python libraries

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages