Skip to content

Commit 854944e

Browse files
authored
feat(etl): Script to export github data into bigquery
1 parent bada097 commit 854944e

File tree

7 files changed

+1160
-70
lines changed

7 files changed

+1160
-70
lines changed

Dockerfile.mock

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
# Dockerfile for mock GitHub API service
2+
FROM python:3.11-slim
3+
4+
WORKDIR /app
5+
6+
# Install Flask
7+
RUN pip install --no-cache-dir flask
8+
9+
# Copy mock API script
10+
COPY mock_github_api.py .
11+
12+
# Expose port
13+
EXPOSE 5000
14+
15+
# Run the mock API
16+
CMD ["python", "mock_github_api.py"]

README.md

Lines changed: 110 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,32 @@
11
# github-etl
2+
23
An ETL for the Mozilla Organization Firefox repositories
34

45
## Overview
56

6-
This repository contains a Python-based ETL (Extract, Transform, Load) script designed to process data from Mozilla Organization Firefox repositories on GitHub. The application runs in a Docker container for easy deployment and isolation.
7+
This repository contains a Python-based ETL (Extract, Transform, Load) script
8+
designed to process pull request data from Mozilla Organization Firefox
9+
repositories on GitHub and load them into Google BigQuery. The application
10+
runs in a Docker container for easy deployment and isolation.
711

812
## Features
913

1014
- **Containerized**: Runs in a Docker container using the latest stable Python
1115
- **Secure**: Runs as a non-root user (`app`) inside the container
12-
- **Structured**: Follows ETL patterns with separate extract, transform, and load phases
13-
- **Logging**: Comprehensive logging for monitoring and debugging
16+
- **Streaming Architecture**: Processes pull requests in chunks of 100 for memory efficiency
17+
- **BigQuery Integration**: Loads data directly into BigQuery using the Python client library
18+
- **Rate Limit Handling**: Automatically handles GitHub API rate limits
19+
- **Comprehensive Logging**: Detailed logging for monitoring and debugging
1420

1521
## Quick Start
1622

23+
### Prerequisites
24+
25+
1. **GitHub Personal Access Token**: Create a [token](https://github.com/settings/tokens)
26+
2. **Google Cloud Project**: Set up a GCP project with BigQuery enabled
27+
3. **BigQuery Dataset**: Create a dataset in your GCP project
28+
4. **Authentication**: Configure GCP credentials (see Authentication section below)
29+
1730
### Building the Docker Image
1831

1932
```bash
@@ -23,9 +36,26 @@ docker build -t github-etl .
2336
### Running the Container
2437

2538
```bash
26-
docker run --rm github-etl
39+
docker run --rm \
40+
-e GITHUB_REPOS="mozilla/firefox" \
41+
-e GITHUB_TOKEN="your_github_token" \
42+
-e BIGQUERY_PROJECT="your-gcp-project" \
43+
-e BIGQUERY_DATASET="your_dataset" \
44+
-e GOOGLE_APPLICATION_CREDENTIALS="/path/to/credentials.json" \
45+
-v /local/path/to/credentials.json:/path/to/credentials.json \
46+
github-etl
2747
```
2848

49+
### Environment Variables
50+
51+
| Variable | Required | Default | Description |
52+
|----------|----------|---------|-------------|
53+
| `GITHUB_REPOS` | Yes | - | Comma separated repositories in format "owner/repo" (e.g., "mozilla/firefox") |
54+
| `GITHUB_TOKEN` | No | - | GitHub Personal Access Token (recommended to avoid rate limits) |
55+
| `BIGQUERY_PROJECT` | Yes | - | Google Cloud Project ID |
56+
| `BIGQUERY_DATASET` | Yes | - | BigQuery dataset ID |
57+
| `GOOGLE_APPLICATION_CREDENTIALS` | Yes* | - | Path to GCP service account JSON file (*or use Workload Identity) |
58+
2959
## Architecture
3060

3161
### Components
@@ -43,24 +73,95 @@ docker run --rm github-etl
4373

4474
### ETL Process
4575

46-
1. **Extract**: Retrieves data from GitHub repositories
47-
2. **Transform**: Processes and structures the data
48-
3. **Load**: Stores the processed data in the target destination
76+
The pipeline uses a **streaming/chunked architecture** that processes pull
77+
requests in batches of 100:
78+
79+
1. **Extract**: Generator yields chunks of 100 PRs from GitHub API
80+
- Implements pagination and rate limit handling
81+
- Fetches all pull requests (open, closed, merged) sorted by creation date
82+
83+
2. **Transform**: Flattens and structures PR data for BigQuery
84+
- Extracts key fields (number, title, state, timestamps, user info)
85+
- Flattens nested objects (user, head/base branches)
86+
- Converts arrays (labels, assignees) to JSON strings
87+
88+
3. **Load**: Inserts transformed data into BigQuery
89+
- Uses BigQuery Python client library
90+
- Adds snapshot_date timestamp to all rows
91+
- Immediate insertion after each chunk is transformed
92+
93+
**Benefits of Chunked Processing**:
94+
95+
- Memory-efficient for large repositories
96+
- Incremental progress visibility
97+
- Early failure detection
98+
- Supports streaming data pipelines
99+
100+
## Authentication
101+
102+
### Google Cloud Authentication
103+
104+
The script uses the BigQuery Python client library which supports multiple
105+
authentication methods:
106+
107+
1. **Service Account Key File** (Recommended for local development):
108+
109+
```bash
110+
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account-key.json"
111+
```
112+
113+
2. **Workload Identity** (Recommended for Kubernetes):
114+
- Configure Workload Identity on your GKE cluster
115+
- No explicit credentials file needed
116+
117+
3. **Application Default Credentials** (For local development):
118+
119+
```bash
120+
gcloud auth application-default login
121+
```
49122

50123
## Development
51124

52125
### Local Development
53126

54-
You can run the script directly with Python:
127+
Set up environment variables and run the script:
55128

56129
```bash
130+
export GITHUB_REPOS="mozilla/firefox"
131+
export GITHUB_TOKEN="your_github_token"
132+
export BIGQUERY_PROJECT="your-gcp-project"
133+
export BIGQUERY_DATASET="your_dataset"
134+
57135
python3 main.py
58136
```
59137

138+
### Local Testing with Docker Compose
139+
140+
For local development and testing, you can use Docker Compose to run the ETL
141+
with mocked services (no GitHub API rate limits or GCP credentials required):
142+
143+
```bash
144+
# Start all services (mock GitHub API, BigQuery emulator, and ETL)
145+
docker-compose up --build
146+
147+
# View logs
148+
docker-compose logs -f github-etl
149+
150+
# Stop services
151+
docker-compose down
152+
```
153+
154+
This setup includes:
155+
156+
- **Mock GitHub API**: Generates 250 sample pull requests
157+
- **BigQuery Emulator**: Local BigQuery instance for testing
158+
- **ETL Service**: Configured to use both mock services
159+
60160
### Adding Dependencies
61161

62162
Add new Python packages to `requirements.txt` and rebuild the Docker image.
63163

64164
## License
65165

66-
This project is licensed under the Mozilla Public License Version 2.0. See the [LICENSE](LICENSE) file for details.
166+
This project is licensed under the Mozilla Public License Version 2.0. See the
167+
[LICENSE](LICENSE) file for details.

data.yml

Lines changed: 86 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,86 @@
1+
projects:
2+
- id: test
3+
datasets:
4+
- id: github_etl
5+
tables:
6+
- id: pull_requests
7+
columns:
8+
- name: pull_request_id
9+
type: INTEGER
10+
- name: current_status
11+
type: STRING
12+
- name: date_created
13+
type: TIMESTAMP
14+
- name: date_modified
15+
type: TIMESTAMP
16+
- name: target_repository
17+
type: STRING
18+
- name: bug_id
19+
type: INTEGER
20+
- name: date_landed
21+
type: TIMESTAMP
22+
- name: date_approved
23+
type: TIMESTAMP
24+
- name: labels
25+
type: STRING
26+
mode: REPEATED
27+
- name: snapshot_date
28+
type: DATE
29+
- id: commits
30+
columns:
31+
- name: pull_request_id
32+
type: INTEGER
33+
- name: target_repository
34+
type: STRING
35+
- name: commit_sha
36+
type: STRING
37+
- name: date_created
38+
type: TIMESTAMP
39+
- name: author_username
40+
type: STRING
41+
- name: author_email
42+
type: STRING
43+
- name: filename
44+
type: STRING
45+
- name: lines_removed
46+
type: INTEGER
47+
- name: lines_added
48+
type: INTEGER
49+
- name: snapshot_date
50+
type: DATE
51+
- id: reviewers
52+
columns:
53+
- name: pull_request_id
54+
type: INTEGER
55+
- name: target_repository
56+
type: STRING
57+
- name: date_reviewed
58+
type: TIMESTAMP
59+
- name: reviewer_email
60+
type: STRING
61+
- name: reviewer_username
62+
type: STRING
63+
- name: status
64+
type: STRING
65+
- name: snapshot_date
66+
type: DATE
67+
- id: comments
68+
columns:
69+
- name: pull_request_id
70+
type: INTEGER
71+
- name: target_repository
72+
type: STRING
73+
- name: comment_id
74+
type: INTEGER
75+
- name: date_created
76+
type: TIMESTAMP
77+
- name: author_email
78+
type: STRING
79+
- name: author_username
80+
type: STRING
81+
- name: character_count
82+
type: INTEGER
83+
- name: status
84+
type: STRING
85+
- name: snapshot_date
86+
type: DATE

docker-compose.yml

Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
services:
2+
# Mock GitHub API service for testing without rate limits
3+
mock-github-api:
4+
build:
5+
context: .
6+
dockerfile: Dockerfile.mock
7+
ports:
8+
- "5000:5000"
9+
networks:
10+
- github_etl
11+
12+
# BigQuery emulator for local testing
13+
bigquery-emulator:
14+
image: ghcr.io/goccy/bigquery-emulator:latest
15+
platform: linux/amd64
16+
ports:
17+
- "9050:9050"
18+
- "9060:9060"
19+
volumes:
20+
- ./data.yml:/data.yml
21+
command: |
22+
--project=test --data-from-yaml=/data.yml --log-level=debug
23+
networks:
24+
- github_etl
25+
26+
# GitHub ETL service
27+
github-etl:
28+
build: .
29+
depends_on:
30+
- mock-github-api
31+
- bigquery-emulator
32+
environment:
33+
# GitHub Configuration
34+
GITHUB_REPOS: "mozilla-firefox/firefox"
35+
GITHUB_TOKEN: "" # Not needed for mock API
36+
37+
# Use mock GitHub API instead of real API
38+
GITHUB_API_URL: "http://mock-github-api:5000"
39+
40+
# BigQuery Configuration
41+
BIGQUERY_PROJECT: "test"
42+
BIGQUERY_DATASET: "github_etl"
43+
44+
# Point to the BigQuery emulator
45+
BIGQUERY_EMULATOR_HOST: "http://bigquery-emulator:9050"
46+
volumes:
47+
- ./main.py:/app/main.py
48+
networks:
49+
- github_etl
50+
51+
networks:
52+
github_etl:
53+
driver: bridge

0 commit comments

Comments
 (0)