This project is a fully automated, ELT pipeline that collects, processes, and visualizes funding data for 100 recently funded Indian startups.
Goal of the project is to demonstrate how raw, unstructured web data can be transformed into a reliable, analytics-ready dataset using modern data engineering principles.
Final output is a Power BI dashboard that refreshes periodically.
Public startup funding data is often:
- Un-structured
- Spread out across different websites
- Updated randomly
Traditional scraping tools fail on such sources, and one-time data pulls do not support historical analysis.
I built end-to-end ELT pipeline that:
- Scrapes GrowthList weekly to get raw data for 100 most recently funded Indian startups, using a headless browser.
- Build ETL pipeline to load raw and historical data safely in BigQuery (SQL-based data warehouse)
- Clean and transform data using SQL
- Connect BigQuery dataset to Power BI for creating reports
- Automate entire process via Github Actions and Scheduled Refresh
| Layer | Technology |
|---|---|
| Extraction | Python, Playwright (Headless Chromium) |
| Orchestration | GitHub Actions (CI/CD, CRON scheduling) |
| Data Warehouse | Google BigQuery (Serverless SQL) |
| Analytics | Power BI (Microsoft Fabric) |
| Security | GitHub Secrets, GCP Service Accounts |
This pipeline follows a cloud-native ELT design, where raw data is loaded first and transformed later inside the warehouse.
Source website loads data dynamically and requires user-like interaction (clicking a button to reveal data) to ensure full list is visible on the page. Below is the comparison b/w different tools for this purpose
| Feature | BeautifulSoup4 | Scrapy | Selenium | Playwright (Recommended) |
|---|---|---|---|---|
| Can click buttons? | β No | β No (not natively) | β Yes | π’ Yes |
| Handles JavaScript? | β No | β No | β Yes | π’ Yes (Best-in-class) |
| Speed | β‘ Very Fast | β‘ Fast (Async) | π’ Slow | π’ Fast |
| Learning Curve | π’ Very Easy | π΄ Hard | π‘ Medium | π‘ Medium |
| Headless Mode | N/A | N/A | β Yes | π’ Yes (Native & optimized) |
Playwright was chosen over BeautifulSoup, Scrapy, and Selenium because of:
-
Web Interaction: Because I want to interact with the browser (move to button and click it) BeautifulSoup and Scrapy are disqualified.
-
Auto-Waiting: Unlike Selenium, Playwright "waits" for elements to be visible/clickable before interacting. This prevents the script from crashing if page takes 2 seconds to load.
page.locator(".btn").click() # auto-waits for button to appear before clicking- Headless mode: Runs web broswer in background without GUI. Runs faster and uses fewer resources.
The warehouse is structured using a Medallion pattern to clearly separate data responsibilities.
| Layer | Purpose |
|---|---|
| Bronze | Preserve raw data exactly as scraped. Acts as safety net, if transformation logic has a bug later or requirements change later, we still have original data without needing to re-scrape website. |
| Silver | Clean and standardize fresh data |
| Gold | Maintain historical, analytics-ready facts. This is the table we connect to Power BI for vizualization |
| Table | Description |
|---|---|
stg_raw_latest |
Holds only the latest scrape (truncate & load) |
fct_raw_history |
Append-only table storing all raw records |
Why two raw tables?
stg_raw_latestkeeps transformations fast and cheapfct_raw_historypreserves historical row data collected over time
| Table | Description |
|---|---|
stg_clean_latest |
Cleaned version of the latest scrape |
Transformations include:
- Uppercasing categorical fields
- Parsing funding dates
- Normalizing multi-valued industries using
UNNEST - Splitting timestamp into date and time (UTC)
| Table | Description |
|---|---|
fct_clean_history |
Historical fact table for reporting |
This table is populated using a MERGE (UPSERT) strategy to ensure:
- No duplicate funding events
- Existing records are updated if source data changes
The diagram below represents the schema and enforced data types for the fct_clean_history table.
A funding record is considered unique based on:
- Company Name
- Country
- Industry Type
- Last Funding Date
Why this matters:
- Startups raise multiple funding rounds
- Company name alone is not a stable identifier
Behavior:
- If a matching record exists β update non-key fields
- If no match exists β insert a new row
This ensures the Gold table always reflects the most accurate version of each funding event.
- Direct integration between Google BigQuery and Microsoft PowerBI
- No local gateway required
- Fully automated using Scheduled Refresh
The entire pipeline is orchestrated using GitHub Actions
To enable full automation, we need to add the following GitHub Secrets:
| Secret Name | Description |
|---|---|
| GCP_SA_KEY | Entire contents of bigquery-key.json |
- Installs Playwright browser binaries
- Sets up Python dependencies
- Authenticates with BigQuery
- Executes the ETL pipeline on schedule.
30 18 * * 0 Runs every Sunday at 6:30 PM UTC (or Monday at 12:00 AM IST)
- Works directly with Github
- Free of cost
- Easy CI/CD-style reproducibility
Gold table (fct_clean_history) is connected directly to Power BI via Import Data Connectivity Mode.
- Where all the funding goes ?
- Industry-wise funding distribution
- Funding stage analysis
This project mirrors how modern data platforms are built in production environments using techniques like
- Headless browser-based data extraction
- Medallion (Bronze / Silver / Gold) architecture
- Cloud-native ELT pipelines
- Idempotent data loading using MERGE (UPSERT)
- CI/CD-based orchestration
- Direct cloud-to-cloud BI integration
- Add data quality checks (Great Expectations)
- Add alerting on pipeline failures
- Extend to additional startup ecosystems