GitHub - ankushseal/data-validation-framework-to-validate-data-from-Datalake-till-Datawarehouse

A python-based data validation framework to validate data from Datalake till Datawarehouse

This repository contains a Python script for analyzing healthcare data using various AWS services including Amazon S3, Amazon Athena, and PostgreSQL. The script retrieves data from different sources, performs analysis, and sends a summary report via email.

Prerequisites

Before running the script, ensure you have the following dependencies installed:

Python 3.x
boto3
pandas
psycopg2
environ
pandasql
pretty_html_table

You can install the dependencies via pip:

pip install boto3 pandas psycopg2 environ pandasql pretty_html_table

Configuration

Ensure you have configured your AWS credentials properly. You can set up your AWS credentials using AWS CLI or directly in the script.

Usage

Clone the repository:

git clone https://github.com/your-username/healthcare-aws-analysis.git
cd data-validation-framework-to-validate-data-from-Datalake-till-Datawarehouse

Update the config.json file with your AWS credentials and other necessary configurations.
Run the Python script main.py:

python main.py

Description

This script performs healthcare data analysis using the following steps:

Amazon S3 Interaction: Connects to Amazon S3 using the boto3 library to retrieve data files.
File Count: Counts the number of rows in each data file and prints the results.
Athena Interaction: Utilizes Amazon Athena to query data snapshots and counts the records in each snapshot.
PostgreSQL Interaction: Connects to a PostgreSQL database to count records in landing tables.
Email Notification: Generates a summary report containing file counts, Athena counts, and PostgreSQL counts. Sends the report via email using SMTP.

Input Data

The input data consists of CSV files stored in an Amazon S3 bucket and snapshots in Amazon Athena.

Output

The output is a summary report sent via email, containing the counts of records in data files, Athena snapshots, and PostgreSQL landing tables.

Contributing

Contributions are welcome! If you have suggestions, feature requests, or bug fixes, please feel free to open an issue or create a pull request.

Acknowledgements

boto3 - AWS SDK for Python.
pandas - Python data analysis library.
psycopg2 - PostgreSQL adapter for Python.
pretty-html-table - Python library for generating HTML tables.

workflow :

1st we download all the daily files from s3 à Then we count the rows of all the files and mention those counts in an Excel spreadsheet à Then we take Athena count for that day’s snapshot and mention those counts in that spreadsheet à And after Integration job completion we take the landing table count and mention those in that spreadsheet à At the end we send a mail with those details in tabular format.

This python framework which can help us to achieve automation of that upper manual workflow. By that script :-

Count the records of daily coming files in S3 buckets without downloading the files. Taking the Athena count of that file’s snapshots without querying Athena manually. Collecting the landing table count without querying manually. Sending a mail with all the count in tabular format automatically.

For testing purpose, I have used dummy customer.txt and order.txt file. PFB some screenshots for better understanding. Please let me know if I need to provide a demo for better understanding. S3 path details :

Customer folder :

Order folder:

File Description:

Customer File:

Order File:

In customer file 654 rows and order file we have 60918 rows.

Athena Count :

Customer :

Orders:

Landing Table :

Customer:

Orders:

Script’s output:

Mail :

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.idea		.idea
README.md		README.md
config.json		config.json
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

A python-based data validation framework to validate data from Datalake till Datawarehouse

Prerequisites

Configuration

Usage

Description

Input Data

Output

Contributing

Acknowledgements

workflow :

About

Uh oh!

Releases

Packages

Languages

ankushseal/data-validation-framework-to-validate-data-from-Datalake-till-Datawarehouse

Folders and files

Latest commit

History

Repository files navigation

A python-based data validation framework to validate data from Datalake till Datawarehouse

Prerequisites

Configuration

Usage

Description

Input Data

Output

Contributing

Acknowledgements

workflow :

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages