You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+110-9Lines changed: 110 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,19 +1,32 @@
1
1
# github-etl
2
+
2
3
An ETL for the Mozilla Organization Firefox repositories
3
4
4
5
## Overview
5
6
6
-
This repository contains a Python-based ETL (Extract, Transform, Load) script designed to process data from Mozilla Organization Firefox repositories on GitHub. The application runs in a Docker container for easy deployment and isolation.
7
+
This repository contains a Python-based ETL (Extract, Transform, Load) script
8
+
designed to process pull request data from Mozilla Organization Firefox
9
+
repositories on GitHub and load them into Google BigQuery. The application
10
+
runs in a Docker container for easy deployment and isolation.
7
11
8
12
## Features
9
13
10
14
-**Containerized**: Runs in a Docker container using the latest stable Python
11
15
-**Secure**: Runs as a non-root user (`app`) inside the container
12
-
-**Structured**: Follows ETL patterns with separate extract, transform, and load phases
13
-
-**Logging**: Comprehensive logging for monitoring and debugging
16
+
-**Streaming Architecture**: Processes pull requests in chunks of 100 for memory efficiency
17
+
-**BigQuery Integration**: Loads data directly into BigQuery using the Python client library
18
+
-**Rate Limit Handling**: Automatically handles GitHub API rate limits
19
+
-**Comprehensive Logging**: Detailed logging for monitoring and debugging
14
20
15
21
## Quick Start
16
22
23
+
### Prerequisites
24
+
25
+
1.**GitHub Personal Access Token**: Create a [token](https://github.com/settings/tokens)
26
+
2.**Google Cloud Project**: Set up a GCP project with BigQuery enabled
27
+
3.**BigQuery Dataset**: Create a dataset in your GCP project
28
+
4.**Authentication**: Configure GCP credentials (see Authentication section below)
0 commit comments