This repository contains a Python script for analyzing healthcare data using various AWS services including Amazon S3, Amazon Athena, and PostgreSQL. The script retrieves data from different sources, performs analysis, and sends a summary report via email.
Before running the script, ensure you have the following dependencies installed:
- Python 3.x
boto3
pandas
psycopg2
environ
pandasql
pretty_html_table
You can install the dependencies via pip:
pip install boto3 pandas psycopg2 environ pandasql pretty_html_table
Ensure you have configured your AWS credentials properly. You can set up your AWS credentials using AWS CLI or directly in the script.
- Clone the repository:
git clone https://github.com/your-username/healthcare-aws-analysis.git
cd data-validation-framework-to-validate-data-from-Datalake-till-Datawarehouse
-
Update the
config.json
file with your AWS credentials and other necessary configurations. -
Run the Python script
main.py
:
python main.py
This script performs healthcare data analysis using the following steps:
-
Amazon S3 Interaction: Connects to Amazon S3 using the
boto3
library to retrieve data files. -
File Count: Counts the number of rows in each data file and prints the results.
-
Athena Interaction: Utilizes Amazon Athena to query data snapshots and counts the records in each snapshot.
-
PostgreSQL Interaction: Connects to a PostgreSQL database to count records in landing tables.
-
Email Notification: Generates a summary report containing file counts, Athena counts, and PostgreSQL counts. Sends the report via email using SMTP.
The input data consists of CSV files stored in an Amazon S3 bucket and snapshots in Amazon Athena.
The output is a summary report sent via email, containing the counts of records in data files, Athena snapshots, and PostgreSQL landing tables.
Contributions are welcome! If you have suggestions, feature requests, or bug fixes, please feel free to open an issue or create a pull request.
- boto3 - AWS SDK for Python.
- pandas - Python data analysis library.
- psycopg2 - PostgreSQL adapter for Python.
- pretty-html-table - Python library for generating HTML tables.
1st we download all the daily files from s3 à Then we count the rows of all the files and mention those counts in an Excel spreadsheet à Then we take Athena count for that day’s snapshot and mention those counts in that spreadsheet à And after Integration job completion we take the landing table count and mention those in that spreadsheet à At the end we send a mail with those details in tabular format.
This python framework which can help us to achieve automation of that upper manual workflow. By that script :-
Count the records of daily coming files in S3 buckets without downloading the files. Taking the Athena count of that file’s snapshots without querying Athena manually. Collecting the landing table count without querying manually. Sending a mail with all the count in tabular format automatically.
For testing purpose, I have used dummy customer.txt and order.txt file. PFB some screenshots for better understanding. Please let me know if I need to provide a demo for better understanding. S3 path details :
Customer folder :
Order folder:
File Description:
Customer File:
Order File:
In customer file 654 rows and order file we have 60918 rows.
Athena Count :
Customer :
Orders:
Landing Table :
Customer:
Orders:
Script’s output:
Mail :
