An ETL for the Mozilla Organization Firefox repositories
This repository contains a Python-based ETL (Extract, Transform, Load) script designed to process pull request data from Mozilla Organization Firefox repositories on GitHub and load them into Google BigQuery. The application runs in a Docker container for easy deployment and isolation.
- Containerized: Runs in a Docker container using the latest stable Python
- Secure: Runs as a non-root user (
app) inside the container - Streaming Architecture: Processes pull requests in chunks of 100 for memory efficiency
- BigQuery Integration: Loads data directly into BigQuery using the Python client library
- Rate Limit Handling: Automatically handles GitHub API rate limits
- Comprehensive Logging: Detailed logging for monitoring and debugging
- GitHub Personal Access Token: Create a token
- Google Cloud Project: Set up a GCP project with BigQuery enabled
- BigQuery Dataset: Create a dataset in your GCP project
- Authentication: Configure GCP credentials (see Authentication section below)
docker build -t github-etl .docker run --rm \
-e GITHUB_REPOS="mozilla/firefox" \
-e GITHUB_TOKEN="your_github_token" \
-e BIGQUERY_PROJECT="your-gcp-project" \
-e BIGQUERY_DATASET="your_dataset" \
-e GOOGLE_APPLICATION_CREDENTIALS="/path/to/credentials.json" \
-v /local/path/to/credentials.json:/path/to/credentials.json \
github-etl| Variable | Required | Default | Description |
|---|---|---|---|
GITHUB_REPOS |
Yes | - | Comma separated repositories in format "owner/repo" (e.g., "mozilla/firefox") |
GITHUB_TOKEN |
No | - | GitHub Personal Access Token (recommended to avoid rate limits) |
BIGQUERY_PROJECT |
Yes | - | Google Cloud Project ID |
BIGQUERY_DATASET |
Yes | - | BigQuery dataset ID |
GOOGLE_APPLICATION_CREDENTIALS |
Yes* | - | Path to GCP service account JSON file (*or use Workload Identity) |
main.py: The main ETL script containing the business logicrequirements.txt: Python dependenciesDockerfile: Container configuration
- Base Image:
python:3.14.2-slim(latest stable Python) - User:
app(uid: 1000, gid: 1000) - Working Directory:
/app - Ownership: All files in
/appare owned by theappuser
The pipeline uses a streaming/chunked architecture that processes pull requests in batches of 100:
-
Extract: Generator yields chunks of 100 PRs from GitHub API
- Implements pagination and rate limit handling
- Fetches all pull requests (open, closed, merged) sorted by creation date
-
Transform: Flattens and structures PR data for BigQuery
- Extracts key fields (number, title, state, timestamps, user info)
- Flattens nested objects (user, head/base branches)
- Converts arrays (labels, assignees) to JSON strings
-
Load: Inserts transformed data into BigQuery
- Uses BigQuery Python client library
- Adds snapshot_date timestamp to all rows
- Immediate insertion after each chunk is transformed
Benefits of Chunked Processing:
- Memory-efficient for large repositories
- Incremental progress visibility
- Early failure detection
- Supports streaming data pipelines
The script uses the BigQuery Python client library which supports multiple authentication methods:
-
Service Account Key File (Recommended for local development):
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account-key.json"
-
Workload Identity (Recommended for Kubernetes):
- Configure Workload Identity on your GKE cluster
- No explicit credentials file needed
-
Application Default Credentials (For local development):
gcloud auth application-default login
Set up environment variables and run the script:
export GITHUB_REPOS="mozilla/firefox"
export GITHUB_TOKEN="your_github_token"
export BIGQUERY_PROJECT="your-gcp-project"
export BIGQUERY_DATASET="your_dataset"
python3 main.pyFor local development and testing, you can use Docker Compose to run the ETL with mocked services (no GitHub API rate limits or GCP credentials required):
# Start all services (mock GitHub API, BigQuery emulator, and ETL)
docker-compose up --build
# View logs
docker-compose logs -f github-etl
# Stop services
docker-compose downThis setup includes:
- Mock GitHub API: Generates 250 sample pull requests
- BigQuery Emulator: Local BigQuery instance for testing
- ETL Service: Configured to use both mock services
The project includes a test suite using pytest. Tests are in the tests/ directory.
Install development dependencies:
pip install -e ".[dev]"Run tests:
pytestRun with coverage:
pytest --cov=. --cov-report=htmlAdd new Python packages to requirements.txt and rebuild the Docker image.
This project is licensed under the Mozilla Public License Version 2.0. See the LICENSE file for details.