Skip to content

Commit 1a4e481

Browse files
Merge pull request #12 from JarvusInnovations/themightychris/python-download
Simplify workflow and improve beginner experience
2 parents 0695a91 + 3780b80 commit 1a4e481

23 files changed

+643
-540
lines changed

.devcontainer/devcontainer.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@
1111
"ghcr.io/eitsupi/devcontainer-features/duckdb-cli:1": {}
1212
},
1313

14-
"postCreateCommand": "uv sync && uv run dbt deps",
14+
"postCreateCommand": "uv sync && uv run dbt deps && uv run python scripts/download_data.py --defaults",
1515

1616
"customizations": {
1717
"vscode": {

.github/workflows/ci.yml

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -51,6 +51,9 @@ jobs:
5151
- name: Install dbt dependencies
5252
run: uv run dbt deps
5353

54+
- name: Download sample data
55+
run: uv run python scripts/download_data.py --defaults
56+
5457
- name: Load seed data
5558
run: uv run dbt seed
5659

@@ -65,8 +68,8 @@ jobs:
6568

6669
- name: Verify database was created
6770
run: |
68-
if [ ! -f workshop.duckdb ]; then
69-
echo "Error: workshop.duckdb was not created"
71+
if [ ! -f sandbox.duckdb ]; then
72+
echo "Error: sandbox.duckdb was not created"
7073
exit 1
7174
fi
72-
echo "workshop.duckdb exists ($(stat -c%s workshop.duckdb) bytes)"
75+
echo "sandbox.duckdb exists ($(stat -c%s sandbox.duckdb) bytes)"

README.md

Lines changed: 50 additions & 116 deletions
Original file line numberDiff line numberDiff line change
@@ -1,25 +1,15 @@
1-
# GTFS-RT DuckDB Workshop
1+
# GTFS-RT Sandbox
22

3-
Query real-time transit data using DuckDB and dbt.
3+
A sandbox environment for exploring transit operational data transformation patterns using DuckDB and dbt. Part of the **Common Transit Operations Data Framework**, this demo shows how raw operational data can be transformed into [TIDES](https://tides-transit.org/)-compliant analytics tables using architectural patterns that scale from a laptop to enterprise cloud infrastructure.
44

5-
## Overview
6-
7-
This workshop demonstrates how to query GTFS Realtime parquet data from a public GCS bucket using DuckDB's httpfs extension and dbt for data transformation.
8-
9-
**Data source**: `gs://parquet.gtfsrt.io/` (also available at <http://parquet.gtfsrt.io/>)
10-
11-
Three feed types are available:
12-
13-
- **vehicle_positions** - Real-time vehicle locations
14-
- **trip_updates** - Arrival/departure predictions
15-
- **service_alerts** - Service disruption notices
5+
This sandbox uses publicly available GTFS-RT feeds as source data. In production, you would typically use raw AVL system exports which contain richer data, but GTFS-RT provides an accessible starting point for learning the patterns.
166

177
## Quick Start
188

199
### Option 1: GitHub Codespaces (Recommended)
2010

2111
1. Click the green "Code" button → "Open with Codespaces"
22-
2. Wait for the container to build (~2 minutes)
12+
2. Wait for setup (~3 minutes, includes sample data download)
2313
3. Run dbt:
2414

2515
```bash
@@ -29,7 +19,7 @@ Three feed types are available:
2919
4. Query your data:
3020

3121
```bash
32-
duckdb workshop.duckdb -ui
22+
duckdb sandbox.duckdb -ui
3323
```
3424

3525
### Option 2: Local Setup
@@ -44,147 +34,88 @@ Three feed types are available:
4434
git clone https://github.com/JarvusInnovations/gtfsrt-sandbox.git
4535
cd gtfsrt-sandbox
4636

47-
# Install Python dependencies
48-
uv sync
49-
uv run dbt deps
37+
# Install dependencies
38+
uv sync && uv run dbt deps
39+
40+
# Download sample data (~30 seconds)
41+
uv run python scripts/download_data.py --defaults
5042

51-
# Run dbt to download and transform data
43+
# Run dbt to create views
5244
uv run dbt run
5345

5446
# Query the data
55-
duckdb workshop.duckdb -ui
47+
duckdb sandbox.duckdb -ui
5648
```
5749

5850
> **Note:** If you get a "Failed to download extension" error with `-ui`, see [DuckDB UI Extension Error](docs/troubleshooting.md#duckdb-ui-extension-error).
5951
60-
## Choosing a Feed
61-
62-
Available feeds are listed in `seeds/available_feeds.csv`. To use a different feed:
52+
## How It Works
6353

64-
```bash
65-
# View available feeds
66-
duckdb -c "SELECT * FROM read_csv_auto('seeds/available_feeds.csv')"
67-
68-
# Run dbt with specific feeds (one variable per feed type)
69-
uv run dbt run --vars '{
70-
"vehicle_positions_feed": "aHR0cHM6Ly9hcGkuNTExLm9yZy90cmFuc2l0L3ZlaGljbGVwb3NpdGlvbnM_YWdlbmN5PVND",
71-
"trip_updates_feed": "aHR0cHM6Ly9hcGkuNTExLm9yZy90cmFuc2l0L3RyaXB1cGRhdGVzP2FnZW5jeT1TQw",
72-
"service_alerts_feed": "aHR0cHM6Ly9hcGkuNTExLm9yZy90cmFuc2l0L3NlcnZpY2VhbGVydHM_YWdlbmN5PVND",
73-
"start_date": "2026-01-04",
74-
"end_date": "2026-01-04"
75-
}'
76-
```
54+
This sandbox uses a two-phase approach:
7755

78-
### Feed Examples
56+
1. **Download data** (`download_data.py`) - fetches parquet files to `data/`
57+
2. **Transform data** (`dbt run`) - creates views in DuckDB reading from local files
7958

80-
| Agency | Feed Type | base64url |
81-
|--------|-----------|-----------|
82-
| SEPTA Regional Rail | vehicle_positions | `aHR0cHM6Ly93d3czLnNlcHRhLm9yZy9ndGZzcnQvc2VwdGEtcGEtdXMvVmVoaWNsZS9ydFZlaGljbGVQb3NpdGlvbi5wYg` |
83-
| 511.org SC | vehicle_positions | `aHR0cHM6Ly9hcGkuNTExLm9yZy90cmFuc2l0L3ZlaGljbGVwb3NpdGlvbnM_YWdlbmN5PVND` |
84-
| AC Transit | vehicle_positions | `aHR0cHM6Ly9hcGkuYWN0cmFuc2l0Lm9yZy90cmFuc2l0L2d0ZnNydC92ZWhpY2xlcw` |
85-
| Metrolink | vehicle_positions | `aHR0cHM6Ly9tZXRyb2xpbmstZ3Rmc3J0Lmdic2RpZ2l0YWwudXMvZmVlZC9ndGZzcnQtdmVoaWNsZXM` |
59+
This separation keeps dbt runs fast and makes the workflow easier to understand.
8660

8761
## Project Structure
8862

8963
```
9064
gtfsrt-sandbox/
91-
├── dbt_project.yml # dbt configuration
92-
├── profiles.yml # DuckDB connection settings
65+
├── data/ # Downloaded parquet data (gitignored)
66+
│ ├── vehicle_positions/
67+
│ ├── trip_updates/
68+
│ └── service_alerts/
9369
├── models/
94-
│ ├── staging/ # Data download & caching
70+
│ ├── staging/ # Views reading from data/
9571
│ │ ├── stg_vehicle_positions.sql
9672
│ │ ├── stg_trip_updates.sql
9773
│ │ └── stg_service_alerts.sql
98-
│ └── marts/ # Analytics views
99-
│ ├── feed_summary.sql
100-
│ └── vehicle_activity.sql
101-
├── macros/
102-
│ └── read_gtfs_parquet.sql # URL generation macro
103-
├── seeds/
104-
│ └── available_feeds.csv # List of available feeds
74+
│ ├── intermediate/ # Transformations
75+
│ └── marts/ # Analytics views
10576
└── scripts/
106-
├── explore_feeds.sql # Direct DuckDB queries
107-
├── generate_feed_list.py # Refresh feed list
108-
└── prefetch_data.py # Pre-download for offline use
77+
└── download_data.py # Data download script
10978
```
11079

111-
## How It Works
112-
113-
1. **Staging models** download parquet data from the public GCS bucket
114-
2. Data is **cached locally** in `workshop.duckdb` as tables
115-
3. **Mart models** are views that query the cached staging tables
116-
4. Subsequent queries use **local data** (no repeated downloads)
117-
118-
To refresh data: `uv run dbt run --full-refresh`
80+
## Downloading Different Data
11981

120-
## Direct DuckDB Queries
82+
### See what's available
12183

122-
You can query the data directly without dbt using `gs://` URLs with glob patterns:
123-
124-
```sql
125-
-- Start DuckDB CLI
126-
duckdb
127-
128-
-- Load httpfs extension
129-
INSTALL httpfs;
130-
LOAD httpfs;
131-
132-
-- Query with glob pattern (all dates for a feed)
133-
SELECT date, COUNT(*) as records
134-
FROM read_parquet(
135-
'gs://parquet.gtfsrt.io/vehicle_positions/date=*/base64url=aHR0cHM6Ly93d3czLnNlcHRhLm9yZy9ndGZzcnQvc2VwdGEtcGEtdXMvVmVoaWNsZS9ydFZlaGljbGVQb3NpdGlvbi5wYg/data.parquet',
136-
hive_partitioning=true
137-
)
138-
GROUP BY date;
139-
140-
-- Query all feeds for a date
141-
SELECT base64url, COUNT(*) as records
142-
FROM read_parquet(
143-
'gs://parquet.gtfsrt.io/vehicle_positions/date=2026-01-04/base64url=*/data.parquet',
144-
hive_partitioning=true
145-
)
146-
GROUP BY base64url;
84+
```bash
85+
uv run python scripts/download_data.py --list
14786
```
14887

149-
**Key advantage**: `gs://` URLs support glob patterns (`*`) for directory listing, while `http://` URLs do not.
150-
151-
See `scripts/explore_feeds.sql` for more examples.
88+
### Download a different agency
15289

153-
## Offline Use
90+
```bash
91+
uv run python scripts/download_data.py --agency septa --date 2026-01-20
92+
```
15493

155-
To pre-download data for offline use:
94+
### Use a different date
15695

15796
```bash
158-
uv run python scripts/prefetch_data.py \
159-
--feed-type vehicle_positions \
160-
--feed-base64 aHR0cHM6Ly93d3czLnNlcHRhLm9yZy9ndGZzcnQvc2VwdGEtcGEtdXMvVmVoaWNsZS9ydFZlaGljbGVQb3NpdGlvbi5wYg \
161-
--start-date 2026-01-01 \
162-
--end-date 2026-01-07
97+
uv run python scripts/download_data.py --defaults --date 2026-01-20
16398
```
16499

165-
Files are saved to `data/` with the same Hive partition structure.
100+
See [docs/downloading_data.md](docs/downloading_data.md) for advanced options.
166101

167102
## Useful Commands
168103

169104
```bash
105+
# Download sample data
106+
uv run python scripts/download_data.py --defaults
107+
170108
# Run all models
171109
uv run dbt run
172110

173111
# Run specific model
174112
uv run dbt run --select stg_vehicle_positions
175113

176-
# Force re-download (full refresh)
177-
uv run dbt run --full-refresh
178-
179-
# Load seed data
180-
uv run dbt seed
181-
182-
# Generate docs
183-
uv run dbt docs generate
184-
uv run dbt docs serve
114+
# Generate and view docs
115+
uv run dbt docs generate && uv run dbt docs serve
185116

186117
# Query the database
187-
duckdb workshop.duckdb
118+
duckdb sandbox.duckdb -ui
188119
```
189120

190121
## Data Schema
@@ -193,7 +124,8 @@ duckdb workshop.duckdb
193124

194125
| Column | Type | Description |
195126
|--------|------|-------------|
196-
| partition_date | date | Date partition (from Hive partitioning) |
127+
| partition_date | date | Date partition |
128+
| feed_base64 | string | Base64url-encoded feed URL |
197129
| feed_timestamp | timestamp | When the feed was fetched |
198130
| vehicle_id | string | Vehicle identifier |
199131
| trip_id | string | Trip identifier |
@@ -206,7 +138,8 @@ duckdb workshop.duckdb
206138

207139
| Column | Type | Description |
208140
|--------|------|-------------|
209-
| partition_date | date | Date partition (from Hive partitioning) |
141+
| partition_date | date | Date partition |
142+
| feed_base64 | string | Base64url-encoded feed URL |
210143
| feed_timestamp | timestamp | When the feed was fetched |
211144
| trip_id | string | Trip identifier |
212145
| stop_id | string | Stop identifier |
@@ -217,7 +150,8 @@ duckdb workshop.duckdb
217150

218151
| Column | Type | Description |
219152
|--------|------|-------------|
220-
| partition_date | date | Date partition (from Hive partitioning) |
153+
| partition_date | date | Date partition |
154+
| feed_base64 | string | Base64url-encoded feed URL |
221155
| feed_timestamp | timestamp | When the feed was fetched |
222156
| header_text | string | Alert title |
223157
| description_text | string | Alert details |
@@ -226,8 +160,8 @@ duckdb workshop.duckdb
226160

227161
## Need Help?
228162

229-
See [docs/troubleshooting.md](docs/troubleshooting.md) for common issues and solutions.
163+
See [docs/troubleshooting.md](docs/troubleshooting.md) for common issues and solutions, or [open an issue](https://github.com/JarvusInnovations/gtfsrt-sandbox/issues) if you're stuck.
230164

231165
## License
232166

233-
Data sourced from public GTFS-RT feeds. Workshop materials are MIT licensed.
167+
Data sourced from public GTFS-RT feeds. Sandbox materials are MIT licensed.

dbt_project.yml

Lines changed: 1 addition & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -7,32 +7,16 @@ profile: 'gtfsrt_sandbox'
77
model-paths: ["models"]
88
seed-paths: ["seeds"]
99
macro-paths: ["macros"]
10-
docs-paths: ["docs"]
10+
docs-paths: ["docs/data"]
1111
target-path: "target"
1212
clean-targets:
1313
- "target"
1414
- "dbt_packages"
1515

16-
vars:
17-
# Feed URLs encoded as base64url (no padding)
18-
# Default: AC Transit feeds (smaller, reliable, 106 routes)
19-
vehicle_positions_feed: 'aHR0cHM6Ly9hcGkuYWN0cmFuc2l0Lm9yZy90cmFuc2l0L2d0ZnNydC92ZWhpY2xlcw'
20-
trip_updates_feed: 'aHR0cHM6Ly9hcGkuYWN0cmFuc2l0Lm9yZy90cmFuc2l0L2d0ZnNydC90cmlwdXBkYXRlcw'
21-
service_alerts_feed: 'aHR0cHM6Ly9hcGkuYWN0cmFuc2l0Lm9yZy90cmFuc2l0L2d0ZnNydC9hbGVydHM'
22-
# directory prefix for where DuckDB should look for raw parquet files
23-
# if it starts with gs://, will look to GCS
24-
parquet_prefix: 'gs://parquet.gtfsrt.io'
25-
26-
# Date range for data query (adjust to available data)
27-
start_date: '2026-01-24'
28-
end_date: '2026-01-24'
29-
3016
models:
3117
gtfsrt_sandbox:
3218
staging:
3319
+materialized: view
34-
base:
35-
+materialized: incremental
3620
intermediate:
3721
+materialized: view
3822
marts:
File renamed without changes.
File renamed without changes.

0 commit comments

Comments
 (0)