Skip to content

Commit 089615f

Browse files
authored
Update README.md
1 parent b98f499 commit 089615f

File tree

1 file changed

+63
-17
lines changed

1 file changed

+63
-17
lines changed

README.md

Lines changed: 63 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -27,15 +27,15 @@
2727

2828
---
2929

30-
## 📌 Project Overview
30+
## 📌 1. Project Overview
3131

3232
This project demonstrates my ability to build a **scalable, production-grade data pipeline** using industry-standard tools. From raw data ingestion and transformation to CI/CD and visualization, this project simulates the daily responsibilities of a Data Engineer.
3333

34-
> ⚙️ Tech stack: GCP + BigQuery + DBT Core + GitHub Actions + Python + Looker Studio
34+
> ⚙️ Tech stack: GCP + BigQuery + DBT Core + GitHub Actions + Python
3535
3636
---
3737

38-
## 🛠️ Tools & Technologies Used
38+
## 🛠️ 2. Tools & Technologies Used
3939

4040
| Tool | Purpose |
4141
|---------------------|----------------------------------------------|
@@ -48,7 +48,7 @@ This project demonstrates my ability to build a **scalable, production-grade dat
4848

4949
---
5050

51-
## 🧱 Architecture Diagram
51+
## 🧱 3. Architecture Diagram
5252

5353
This project follows a modular and automated data engineering architecture on Google Cloud.
5454
Raw synthetic healthcare data is generated and stored in GCS, externalized into BigQuery, transformed via DBT models, and deployed through CI/CD using GitHub Actions.
@@ -59,21 +59,67 @@ Raw synthetic healthcare data is generated and stored in GCS, externalized into
5959

6060
---
6161

62-
## 🔁 Step-by-Step Workflow
62+
## 🔁 4. Step-by-Step Workflow
6363

6464
### 4.1 GCP Setup
65-
- Created GCP project
66-
- Configured IAM roles and Service Accounts
67-
- Enabled BigQuery & GCS
65+
- Created a new Google Cloud project (`root-matrix-457217-p5`)
66+
- Enabled **BigQuery**, **Cloud Storage**, and **IAM** APIs
67+
- Created service accounts with proper IAM roles (`BigQuery Admin`, `Storage Admin`, etc.)
68+
69+
<p align="center">
70+
<img src="./images/gcp-project-setup.png" alt="GCP Project Setup" width="700"/>
71+
</p>
6872

6973
### 4.2 BigQuery Datasets & GCS Buckets
70-
- Created `dev_healthcare_data` and `prod_healthcare_data`
71-
- Created GCS bucket `healthcare-data-bucket-*`
74+
- Created datasets:
75+
- `dev_healthcare_data`
76+
- `prod_healthcare_data`
77+
- **Cloud Storage bucket (`healthcare-data-bucket-amarkhatri`) was created automatically by the Python script**
78+
79+
<p align="center">
80+
<img src="./images/gcs-bucket-files.png" alt="GCS Bucket with Raw Files" width="700"/>
81+
</p>
82+
83+
### 4.3 Data Generation Script (Python)
84+
### 4.3 Data Generation Script (Python)
85+
86+
To simulate a real-world healthcare data pipeline, I wrote a Python script that:
87+
88+
- Generates synthetic data using the **Faker** library for:
89+
- Patient demographics (`CSV`)
90+
- Electronic health records (`JSON`)
91+
- Insurance claims (`Parquet`)
92+
- Creates a **Cloud Storage bucket** if it doesn't already exist
93+
- **Cleans the target folders** before uploading new files
94+
- Uploads raw data directly to GCS (`dev/` and `prod/` folders)
95+
- Writes all files in appropriate formats using:
96+
- `pandas` for CSV
97+
- `json` for newline-delimited JSON
98+
- `pyarrow` for Parquet
99+
100+
✅ The script performs **all ingestion + staging steps programmatically**, without manual uploads.
101+
102+
> 📁 Script location: [`data_generator/synthetic_data_generator.py`](./data_generator/synthetic_data_generator.py)
103+
104+
#### 🔑 Key Logic Overview
105+
106+
```python
107+
# Create the bucket if it doesn't exist
108+
def create_bucket():
109+
bucket = storage_client.bucket(BUCKET_NAME)
110+
if not bucket.exists():
111+
storage_client.create_bucket(BUCKET_NAME)
112+
113+
# Generate synthetic patients
114+
def generate_patients(num_records):
115+
...
116+
return pd.DataFrame(patients)
117+
118+
# Upload CSV, JSON, or Parquet to GCS
119+
def upload_to_gcs(data, path, filename, file_format):
120+
...
121+
72122

73-
### 4.3 Data Generation & Upload
74-
- Wrote Python script using `Faker` and `Pandas`
75-
- Generated realistic healthcare data (patients, EHR, claims)
76-
- Uploaded CSV, JSON, Parquet to GCS
77123

78124
### 4.4 External Table Creation
79125
- Used BigQuery to create external tables from GCS
@@ -102,20 +148,20 @@ Raw synthetic healthcare data is generated and stored in GCS, externalized into
102148

103149
---
104150

105-
## 🧬 Data Lineage & Model Flow
151+
## 🧬 5. Data Lineage & Model Flow
106152

107153
> _Include a screenshot of your DBT Cloud lineage graph here_
108154

109155
Shows how raw data from GCS flows into staging, transformations, and final analytical tables.
110156

111157
---
112158

113-
## 📸 Screenshots & Walkthrough
159+
## 📸 6. Screenshots & Walkthrough
114160

115161
Add screenshots of:
116162

117163
- BigQuery datasets (`dev` and `prod`)
118164
- GCS bucket with raw files
119165
- DBT models + directory structure
120166
- GitHub Actions CI passing
121-
- Looker Studio dashboard
167+

0 commit comments

Comments
 (0)