mozilla-conduit
diff --git a/‎Dockerfile.mock‎
Lines changed: 16 additions & 0 deletions b/‎Dockerfile.mock‎
Lines changed: 16 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 110 additions & 9 deletions b/‎README.md‎
Lines changed: 110 additions & 9 deletions
diff --git a/‎data.yml‎
Lines changed: 86 additions & 0 deletions b/‎data.yml‎
Lines changed: 86 additions & 0 deletions
diff --git a/‎docker-compose.yml‎
Lines changed: 53 additions & 0 deletions b/‎docker-compose.yml‎
Lines changed: 53 additions & 0 deletions
@@ -0,0 +1,16 @@
+# Dockerfile for mock GitHub API service
+FROM python:3.11-slim
+
+WORKDIR /app
+
+# Install Flask
+RUN pip install --no-cache-dir flask
+
+# Copy mock API script
+COPY mock_github_api.py .
+
+# Expose port
+EXPOSE 5000
+
+# Run the mock API
+CMD ["python", "mock_github_api.py"]
@@ -1,19 +1,32 @@
 # github-etl
+
 An ETL for the Mozilla Organization Firefox repositories
 
 ## Overview
 
-This repository contains a Python-based ETL (Extract, Transform, Load) script designed to process data from Mozilla Organization Firefox repositories on GitHub. The application runs in a Docker container for easy deployment and isolation.
+This repository contains a Python-based ETL (Extract, Transform, Load) script
+designed to process pull request data from Mozilla Organization Firefox
+repositories on GitHub and load them into Google BigQuery. The application
+runs in a Docker container for easy deployment and isolation.
 
 ## Features
 
 - **Containerized**: Runs in a Docker container using the latest stable Python
 - **Secure**: Runs as a non-root user (`app`) inside the container
-- **Structured**: Follows ETL patterns with separate extract, transform, and load phases
-- **Logging**: Comprehensive logging for monitoring and debugging
+- **Streaming Architecture**: Processes pull requests in chunks of 100 for memory efficiency
+- **BigQuery Integration**: Loads data directly into BigQuery using the Python client library
+- **Rate Limit Handling**: Automatically handles GitHub API rate limits
+- **Comprehensive Logging**: Detailed logging for monitoring and debugging
 
 ## Quick Start
 
+### Prerequisites
+
+1. **GitHub Personal Access Token**: Create a [token](https://github.com/settings/tokens)
+2. **Google Cloud Project**: Set up a GCP project with BigQuery enabled
+3. **BigQuery Dataset**: Create a dataset in your GCP project
+4. **Authentication**: Configure GCP credentials (see Authentication section below)
+
 ### Building the Docker Image
 
 ```bash
@@ -23,9 +36,26 @@ docker build -t github-etl .
 ### Running the Container
 
 ```bash
-docker run --rm github-etl
+docker run --rm \
+  -e GITHUB_REPOS="mozilla/firefox" \
+  -e GITHUB_TOKEN="your_github_token" \
+  -e BIGQUERY_PROJECT="your-gcp-project" \
+  -e BIGQUERY_DATASET="your_dataset" \
+  -e GOOGLE_APPLICATION_CREDENTIALS="/path/to/credentials.json" \
+  -v /local/path/to/credentials.json:/path/to/credentials.json \
+  github-etl
 ```
 
+### Environment Variables
+
+| Variable | Required | Default | Description |
+|----------|----------|---------|-------------|
+| `GITHUB_REPOS` | Yes | - | Comma separated repositories in format "owner/repo" (e.g., "mozilla/firefox") |
+| `GITHUB_TOKEN` | No | - | GitHub Personal Access Token (recommended to avoid rate limits) |
+| `BIGQUERY_PROJECT` | Yes | - | Google Cloud Project ID |
+| `BIGQUERY_DATASET` | Yes | - | BigQuery dataset ID |
+| `GOOGLE_APPLICATION_CREDENTIALS` | Yes* | - | Path to GCP service account JSON file (*or use Workload Identity) |
+
 ## Architecture
 
 ### Components
@@ -43,24 +73,95 @@ docker run --rm github-etl
 
 ### ETL Process
 
-1. **Extract**: Retrieves data from GitHub repositories
-2. **Transform**: Processes and structures the data
-3. **Load**: Stores the processed data in the target destination
+The pipeline uses a **streaming/chunked architecture** that processes pull
+requests in batches of 100:
+
+1. **Extract**: Generator yields chunks of 100 PRs from GitHub API
+   - Implements pagination and rate limit handling
+   - Fetches all pull requests (open, closed, merged) sorted by creation date
+
+2. **Transform**: Flattens and structures PR data for BigQuery
+   - Extracts key fields (number, title, state, timestamps, user info)
+   - Flattens nested objects (user, head/base branches)
+   - Converts arrays (labels, assignees) to JSON strings
+
+3. **Load**: Inserts transformed data into BigQuery
+   - Uses BigQuery Python client library
+   - Adds snapshot_date timestamp to all rows
+   - Immediate insertion after each chunk is transformed
+
+**Benefits of Chunked Processing**:
+
+- Memory-efficient for large repositories
+- Incremental progress visibility
+- Early failure detection
+- Supports streaming data pipelines
+
+## Authentication
+
+### Google Cloud Authentication
+
+The script uses the BigQuery Python client library which supports multiple
+authentication methods:
+
+1. **Service Account Key File** (Recommended for local development):
+
+   ```bash
+   export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account-key.json"
+   ```
+
+2. **Workload Identity** (Recommended for Kubernetes):
+   - Configure Workload Identity on your GKE cluster
+   - No explicit credentials file needed
+
+3. **Application Default Credentials** (For local development):
+
+   ```bash
+   gcloud auth application-default login
+   ```
 
 ## Development
 
 ### Local Development
 
-You can run the script directly with Python:
+Set up environment variables and run the script:
 
 ```bash
+export GITHUB_REPOS="mozilla/firefox"
+export GITHUB_TOKEN="your_github_token"
+export BIGQUERY_PROJECT="your-gcp-project"
+export BIGQUERY_DATASET="your_dataset"
+
 python3 main.py
 ```
 
+### Local Testing with Docker Compose
+
+For local development and testing, you can use Docker Compose to run the ETL
+with mocked services (no GitHub API rate limits or GCP credentials required):
+
+```bash
+# Start all services (mock GitHub API, BigQuery emulator, and ETL)
+docker-compose up --build
+
+# View logs
+docker-compose logs -f github-etl
+
+# Stop services
+docker-compose down
+```
+
+This setup includes:
+
+- **Mock GitHub API**: Generates 250 sample pull requests
+- **BigQuery Emulator**: Local BigQuery instance for testing
+- **ETL Service**: Configured to use both mock services
+
 ### Adding Dependencies
 
 Add new Python packages to `requirements.txt` and rebuild the Docker image.
 
 ## License
 
-This project is licensed under the Mozilla Public License Version 2.0. See the [LICENSE](LICENSE) file for details.
+This project is licensed under the Mozilla Public License Version 2.0. See the
+[LICENSE](LICENSE) file for details.
@@ -0,0 +1,86 @@
+projects:
+  - id: test
+    datasets:
+      - id: github_etl
+        tables:
+          - id: pull_requests
+            columns:
+              - name: pull_request_id
+                type: INTEGER
+              - name: current_status
+                type: STRING
+              - name: date_created
+                type: TIMESTAMP
+              - name: date_modified
+                type: TIMESTAMP
+              - name: target_repository
+                type: STRING
+              - name: bug_id
+                type: INTEGER
+              - name: date_landed
+                type: TIMESTAMP
+              - name: date_approved
+                type: TIMESTAMP
+              - name: labels
+                type: STRING
+                mode: REPEATED
+              - name: snapshot_date
+                type: DATE
+          - id: commits
+            columns:
+              - name: pull_request_id
+                type: INTEGER
+              - name: target_repository
+                type: STRING
+              - name: commit_sha
+                type: STRING
+              - name: date_created
+                type: TIMESTAMP
+              - name: author_username
+                type: STRING
+              - name: author_email
+                type: STRING
+              - name: filename
+                type: STRING
+              - name: lines_removed
+                type: INTEGER
+              - name: lines_added
+                type: INTEGER
+              - name: snapshot_date
+                type: DATE
+          - id: reviewers
+            columns:
+              - name: pull_request_id
+                type: INTEGER
+              - name: target_repository
+                type: STRING
+              - name: date_reviewed
+                type: TIMESTAMP
+              - name: reviewer_email
+                type: STRING
+              - name: reviewer_username
+                type: STRING
+              - name: status
+                type: STRING
+              - name: snapshot_date
+                type: DATE
+          - id: comments
+            columns:
+              - name: pull_request_id
+                type: INTEGER
+              - name: target_repository
+                type: STRING
+              - name: comment_id
+                type: INTEGER
+              - name: date_created
+                type: TIMESTAMP
+              - name: author_email
+                type: STRING
+              - name: author_username
+                type: STRING
+              - name: character_count
+                type: INTEGER
+              - name: status
+                type: STRING
+              - name: snapshot_date
+                type: DATE
@@ -0,0 +1,53 @@
+services:
+  # Mock GitHub API service for testing without rate limits
+  mock-github-api:
+    build:
+      context: .
+      dockerfile: Dockerfile.mock
+    ports:
+      - "5000:5000"
+    networks:
+      - github_etl
+
+  # BigQuery emulator for local testing
+  bigquery-emulator:
+    image: ghcr.io/goccy/bigquery-emulator:latest
+    platform: linux/amd64
+    ports:
+      - "9050:9050"
+      - "9060:9060"
+    volumes:
+      - ./data.yml:/data.yml
+    command: |
+      --project=test --data-from-yaml=/data.yml --log-level=debug
+    networks:
+      - github_etl
+
+  # GitHub ETL service
+  github-etl:
+    build: .
+    depends_on:
+      - mock-github-api
+      - bigquery-emulator
+    environment:
+      # GitHub Configuration
+      GITHUB_REPOS: "mozilla-firefox/firefox"
+      GITHUB_TOKEN: ""  # Not needed for mock API
+
+      # Use mock GitHub API instead of real API
+      GITHUB_API_URL: "http://mock-github-api:5000"
+
+      # BigQuery Configuration
+      BIGQUERY_PROJECT: "test"
+      BIGQUERY_DATASET: "github_etl"
+
+      # Point to the BigQuery emulator
+      BIGQUERY_EMULATOR_HOST: "http://bigquery-emulator:9050"
+    volumes:
+      - ./main.py:/app/main.py
+    networks:
+      - github_etl
+
+networks:
+  github_etl:
+    driver: bridge