Two Streamlit applications have been developed one that processes PDFs either via Nougat or PyPDF python libraries. The objective of is to analyze and compare Nougat and PyPDF based on various use cases and different input PDFs. Another application allows users to evaluate the quality of the Freddie Mac Single Family dataset. Users can upload a CSV or XLS file containing either Origination or Monthly performance data and assess whether it adheres to the published schema. The tool will use ydata profiling to summarize the data and display the results to the user. Additionally, it will run Great Expectations to perform data quality checks, including schema validation, data validity, absence of missing data, and other custom tests.
- Aditya Kawale
- NUID 002766716
- Email [email protected]
- Nidhi Singh
- NUID 002925684
- Email [email protected]
- Uddhav Zambare
- NUID 002199488
- Email [email protected]
Part1
βββ NougatAPIServer.ipynb
βββ README.md
βββ architecture_diagram_generator.py
βββ images
β βββ colab.png
β βββ ngrok.png
β βββ pypdf2.png
β βββ streamlit.png
β βββ user.png
βββ main.py
βββ pages
β βββ architecture.py
βββ pdf_processing_flow.png
βββ requirements.txt
Part2
βββ Home.py
βββ README.md
βββ arch_diagram.py
βββ architecture_diagram.png
βββ aws_config.py
βββ data
β βββ file_layout.xlsx
β βββ monthly
β βββ origination
βββ example.env
βββ gx
β βββ expectations
β βββ great_expectations.yml
β βββ uncommitted
βββ gx_monthly_data.py
βββ gx_origination_data.py
βββ images
β βββ great_expectations.png
β βββ streamlit.png
β βββ ydata-profiling.png
βββ pages
β βββ Architecture.py
β βββ Great_Expectation.py
β βββ Test.py
βββ requirements.txt
- π Codelab Doc - link
- 1οΈβ£ Streamlit Application for PDF Processing via NougatOCR and PyPDF2 - link
- β· Streamlit Application for Data Profiling / Validation using ydata-profiling and great-expectations - link
- π Colab Notebook - link
- π Part 1 Datasets (SEC.gov) - link
- π Part 2 Datasets (Freddie Mac) - link
- ποΈ Part 1 Outputs (processed ouputs) - link
- π§ Streamlit - link
- π§ NougatOCR - link
- π§ PyPDF2 - link
- π§ Google Colab - link
- π§ AWS s3 - link
- π§ ydata-profiling - link
- π§ great-expectations - link
Flow for Part 1
- User gives Http/Https PDF link to Streamlit App hosted on Streamlit Cloud
- Streamlit cloud app downloads the pdf on its own storage and validates if it is truly a pdf file
- If the check passes, it checks the PDF Processor.
- If the PDF Processor is PyPDF it processes the PDF on Streamlit Cloud itself
- If the PDF Processor is Nougat it sends the downloaded PDF to Ngrok Agent which is accessible via public internet
- Once the Ngrok Agent gets the request, it forwards it to Google Colab ngrok service
- Ngrok service forwards the request to Nougat API running on port 8503
- Nougat API processes the PDF and returns the MMD file via HTTP to streamlit application
- User downloads MMD files from Streamlit
-
Clone the repository to your local machine:
git clone <repository_url>
-
Navigate to the module directory:
cd Part1
-
Create a virtual environment and activate it:
python -m venv venv source venv/bin/activate # On Windows, use: venv\Scripts\activate
-
Install the required dependencies from the
requirements.txt
file:pip install -r requirements.txt
-
Run the Streamlit application:
streamlit run main.py
-
Access the tool through your web browser at
http://localhost:8501
.
-
Download the Google Colab Notebook link
-
Create a parent-child folder 'DAMG7245_Fall2023/Assignment1' on your google drive to which your Colab notebook is connected.
-
Create an account on ngrok to get authtoken link
-
Copy the token to 'ngrok_authtoken.txt' file and upload it to the Assignment1 folder created in step 2
-
Select 'T4 GPU' on colab runtime and run the cells. Please follow the instructions in cell comments.
-
The colab notebook will generate a public URL to Nougat API server. Use that for your experimentation with either hosted streamlit app or locally hosted app.
-
NougatOCR has 2 offerings CLI and API. The CLI version is mature with many configuration options as opposed to API version. We used API version specifically as we are trying to build a cloud application with decoupled front-end and back-end
-
As per our experiments,
- NougatOCR takes 6.30 minutes to generate MMD file for single page PDF when used with Colab CPU
- NougatOCR takes 1.30 minutes to generate MMD file for single page PDF when used with Colab TPU
- NougatOCR library uses caching mechanism on individual page level, so if the output is requested for the same file in next iterations, the responses are quick.
- NougatOCR API returns the processed output in string format which needs preprocessing before converting it to MMD file
- Example:
- Remove the first and last double quotations
- Replace \n with literal newline character
- Replace \\ with \
-
PyPDF is faster than Nougat (~30 seconds for single page) for processing PDFs but looses the style and presentation information.
-
Ngrok public URL gets invalid after ~ 10 minutes, so you wil have to run particular cells on NougatAPIServer Colab notebook to generate a new-one. Since colab doesnot support multiple cell runs - You will have to stop the already running API server. Get a new ngrok url and then start the API Server.
-
Equations: Nougat is better than PyPDF. Nougat properly represents equations with correct symbols and presentation (subscript/superscript). PyPDF fails at capturing symbols and doesnot maintain presentation elements.
-
Tables: Nougat is better at judging tables than PyPDF
-
Header, Footer, Font styles: PyPDF is better at capturing header and footer information but fails at capturing font styles
-
Text inside images: Since nougat does OCR, it is able to extract textual elements from images. PyPDF fails here
-
Hyperlinks: Nougat needs https or http infront of URLs to treat them as hyperlinks in MMD files
The objective of this module is to build a tool that allows users to evaluate the quality of the Freddie Mac Single Family dataset. Users can upload a CSV or XLS file containing either Origination or Monthly performance data and assess whether it adheres to the published schema. The tool will use Pandas Profiling to summarize the data and display the results to the user. Additionally, it will run Great Expectations to perform data quality checks, including schema validation, data validity, absence of missing data, and other custom tests.
Streamlit: Streamlit provides an easy-to-use web interface for uploading files and displaying reports. It simplifies the development of the user interface.
Pandas Profiling: Pandas Profiling is a powerful tool for data profiling, generating summary statistics, and visualizations of the dataset. It helps users quickly understand the data.
Great Expectations: Great Expectations is a robust framework for data validation. It allows us to define and enforce expectations about the dataset's schema and data quality.
AWS S3: AWS S3 is used for temporary file storage. Uploaded files are stored here before analysis. This ensures that the original dataset is not modified.
The input data must adhere to the following constraints:
- Data must follow the schema provided and should not have columns names. The data must be in the same format as it is on the website
- The data columns must follow the same order as mentioned in the
file_layout.xlsx
file available on the Freddie Mac Loan Set website.
App can be directly accessed from Streamlit Cloud via link
OR
-
Clone the repository to your local machine:
git clone <repository_url>
-
Navigate to the module directory:
cd Part2
-
Create a virtual environment and activate it:
python -m venv venv source venv/bin/activate # On Windows, use: venv\Scripts\activate
-
Install the required dependencies from the
requirements.txt
file:pip install -r requirements.txt
-
Create a
.env
file with the necessary environment variables, such as AWS credentials. Reference: example.env -
Run the Streamlit application:
streamlit run Home.py
-
Access the tool through your web browser at
http://localhost:8501
.
By using this tool, we aim to simplify the process of evaluating the quality of Freddie Mac Single Family dataset files. The tool leverages the power of Pandas Profiling and Great Expectations to provide comprehensive data analysis and validation reports to ensure data quality and adherence to the schema. This project will help data engineers and analysts assess and trust the data they work with, ultimately improving data-driven decision-making processes.
If you encounter any issues or have suggestions for improvements, please feel free to open an issue or contribute to this project. Your feedback is valuable in enhancing the tool's functionality and usability.
- Aditya : 34
%
- Nidhi : 33
%
- Uddhav : 33
%
Developer | Deliverables |
---|---|
Aditya | Streamlit Part 1 - Nougat |
Aditya | Git setup and integration |
Uddhav | Streamlit Part 2 - Great Expectation |
Uddhav | Streamlit Part 2 - Ydata Profiling |
Nidhi | Streamlit Part 1 - Pypdf |
Nidhi | Architecture Diagrams |
Nidhi | Documentation |
WE ATTEST THAT WE HAVENβT USED ANY OTHER STUDENTSβ WORK IN OUR ASSIGNMENT AND ABIDE BY THE POLICIES LISTED IN THE STUDENT HANDBOOK.