JarvusInnovations
diff --git a/‎.devcontainer/devcontainer.json‎
Lines changed: 1 addition & 1 deletion b/‎.devcontainer/devcontainer.json‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎.github/workflows/ci.yml‎
Lines changed: 6 additions & 3 deletions b/‎.github/workflows/ci.yml‎
Lines changed: 6 additions & 3 deletions
diff --git a/‎README.md‎
Lines changed: 50 additions & 116 deletions b/‎README.md‎
Lines changed: 50 additions & 116 deletions
diff --git a/‎dbt_project.yml‎
Lines changed: 1 addition & 17 deletions b/‎dbt_project.yml‎
Lines changed: 1 addition & 17 deletions
diff --git a/‎docs/common.md‎ ‎docs/data/common.md‎docs/common.md renamed to docs/data/common.md b/‎docs/common.md‎ ‎docs/data/common.md‎docs/common.md renamed to docs/data/common.md
diff --git a/‎docs/tides_vehicle_locations.md‎ ‎docs/data/tides_vehicle_locations.md‎docs/tides_vehicle_locations.md renamed to docs/data/tides_vehicle_locations.md b/‎docs/tides_vehicle_locations.md‎ ‎docs/data/tides_vehicle_locations.md‎docs/tides_vehicle_locations.md renamed to docs/data/tides_vehicle_locations.md
@@ -11,7 +11,7 @@
     "ghcr.io/eitsupi/devcontainer-features/duckdb-cli:1": {}
   },
 
-  "postCreateCommand": "uv sync && uv run dbt deps",
+  "postCreateCommand": "uv sync && uv run dbt deps && uv run python scripts/download_data.py --defaults",
 
   "customizations": {
     "vscode": {
 
@@ -51,6 +51,9 @@ jobs:
       - name: Install dbt dependencies
         run: uv run dbt deps
 
+      - name: Download sample data
+        run: uv run python scripts/download_data.py --defaults
+
       - name: Load seed data
         run: uv run dbt seed
 
@@ -65,8 +68,8 @@ jobs:
 
       - name: Verify database was created
         run: |
-          if [ ! -f workshop.duckdb ]; then
-            echo "Error: workshop.duckdb was not created"
+          if [ ! -f sandbox.duckdb ]; then
+            echo "Error: sandbox.duckdb was not created"
             exit 1
           fi
-          echo "workshop.duckdb exists ($(stat -c%s workshop.duckdb) bytes)"
+          echo "sandbox.duckdb exists ($(stat -c%s sandbox.duckdb) bytes)"
@@ -1,25 +1,15 @@
-# GTFS-RT DuckDB Workshop
+# GTFS-RT Sandbox
 
-Query real-time transit data using DuckDB and dbt.
+A sandbox environment for exploring transit operational data transformation patterns using DuckDB and dbt. Part of the **Common Transit Operations Data Framework**, this demo shows how raw operational data can be transformed into [TIDES](https://tides-transit.org/)-compliant analytics tables using architectural patterns that scale from a laptop to enterprise cloud infrastructure.
 
-## Overview
-
-This workshop demonstrates how to query GTFS Realtime parquet data from a public GCS bucket using DuckDB's httpfs extension and dbt for data transformation.
-
-**Data source**: `gs://parquet.gtfsrt.io/` (also available at <http://parquet.gtfsrt.io/>)
-
-Three feed types are available:
-
-- **vehicle_positions** - Real-time vehicle locations
-- **trip_updates** - Arrival/departure predictions
-- **service_alerts** - Service disruption notices
+This sandbox uses publicly available GTFS-RT feeds as source data. In production, you would typically use raw AVL system exports which contain richer data, but GTFS-RT provides an accessible starting point for learning the patterns.
 
 ## Quick Start
 
 ### Option 1: GitHub Codespaces (Recommended)
 
 1. Click the green "Code" button → "Open with Codespaces"
-2. Wait for the container to build (~2 minutes)
+2. Wait for setup (~3 minutes, includes sample data download)
 3. Run dbt:
 
    ```bash
@@ -29,7 +19,7 @@ Three feed types are available:
 4. Query your data:
 
    ```bash
-   duckdb workshop.duckdb -ui
+   duckdb sandbox.duckdb -ui
    ```
 
 ### Option 2: Local Setup
@@ -44,147 +34,88 @@ Three feed types are available:
 git clone https://github.com/JarvusInnovations/gtfsrt-sandbox.git
 cd gtfsrt-sandbox
 
-# Install Python dependencies
-uv sync
-uv run dbt deps
+# Install dependencies
+uv sync && uv run dbt deps
+
+# Download sample data (~30 seconds)
+uv run python scripts/download_data.py --defaults
 
-# Run dbt to download and transform data
+# Run dbt to create views
 uv run dbt run
 
 # Query the data
-duckdb workshop.duckdb -ui
+duckdb sandbox.duckdb -ui
 ```
 
 > **Note:** If you get a "Failed to download extension" error with `-ui`, see [DuckDB UI Extension Error](docs/troubleshooting.md#duckdb-ui-extension-error).
 
-## Choosing a Feed
-
-Available feeds are listed in `seeds/available_feeds.csv`. To use a different feed:
+## How It Works
 
-```bash
-# View available feeds
-duckdb -c "SELECT * FROM read_csv_auto('seeds/available_feeds.csv')"
-
-# Run dbt with specific feeds (one variable per feed type)
-uv run dbt run --vars '{
-  "vehicle_positions_feed": "aHR0cHM6Ly9hcGkuNTExLm9yZy90cmFuc2l0L3ZlaGljbGVwb3NpdGlvbnM_YWdlbmN5PVND",
-  "trip_updates_feed": "aHR0cHM6Ly9hcGkuNTExLm9yZy90cmFuc2l0L3RyaXB1cGRhdGVzP2FnZW5jeT1TQw",
-  "service_alerts_feed": "aHR0cHM6Ly9hcGkuNTExLm9yZy90cmFuc2l0L3NlcnZpY2VhbGVydHM_YWdlbmN5PVND",
-  "start_date": "2026-01-04",
-  "end_date": "2026-01-04"
-}'
-```
+This sandbox uses a two-phase approach:
 
-### Feed Examples
+1. **Download data** (`download_data.py`) - fetches parquet files to `data/`
+2. **Transform data** (`dbt run`) - creates views in DuckDB reading from local files
 
-| Agency | Feed Type | base64url |
-|--------|-----------|-----------|
-| SEPTA Regional Rail | vehicle_positions | `aHR0cHM6Ly93d3czLnNlcHRhLm9yZy9ndGZzcnQvc2VwdGEtcGEtdXMvVmVoaWNsZS9ydFZlaGljbGVQb3NpdGlvbi5wYg` |
-| 511.org SC | vehicle_positions | `aHR0cHM6Ly9hcGkuNTExLm9yZy90cmFuc2l0L3ZlaGljbGVwb3NpdGlvbnM_YWdlbmN5PVND` |
-| AC Transit | vehicle_positions | `aHR0cHM6Ly9hcGkuYWN0cmFuc2l0Lm9yZy90cmFuc2l0L2d0ZnNydC92ZWhpY2xlcw` |
-| Metrolink | vehicle_positions | `aHR0cHM6Ly9tZXRyb2xpbmstZ3Rmc3J0Lmdic2RpZ2l0YWwudXMvZmVlZC9ndGZzcnQtdmVoaWNsZXM` |
+This separation keeps dbt runs fast and makes the workflow easier to understand.
 
 ## Project Structure
 
 ```
 gtfsrt-sandbox/
-├── dbt_project.yml          # dbt configuration
-├── profiles.yml             # DuckDB connection settings
+├── data/                        # Downloaded parquet data (gitignored)
+│   ├── vehicle_positions/
+│   ├── trip_updates/
+│   └── service_alerts/
 ├── models/
-│   ├── staging/             # Data download & caching
+│   ├── staging/                 # Views reading from data/
 │   │   ├── stg_vehicle_positions.sql
 │   │   ├── stg_trip_updates.sql
 │   │   └── stg_service_alerts.sql
-│   └── marts/               # Analytics views
-│       ├── feed_summary.sql
-│       └── vehicle_activity.sql
-├── macros/
-│   └── read_gtfs_parquet.sql  # URL generation macro
-├── seeds/
-│   └── available_feeds.csv    # List of available feeds
+│   ├── intermediate/            # Transformations
+│   └── marts/                   # Analytics views
 └── scripts/
-    ├── explore_feeds.sql      # Direct DuckDB queries
-    ├── generate_feed_list.py  # Refresh feed list
-    └── prefetch_data.py       # Pre-download for offline use
+    └── download_data.py         # Data download script
 ```
 
-## How It Works
-
-1. **Staging models** download parquet data from the public GCS bucket
-2. Data is **cached locally** in `workshop.duckdb` as tables
-3. **Mart models** are views that query the cached staging tables
-4. Subsequent queries use **local data** (no repeated downloads)
-
-To refresh data: `uv run dbt run --full-refresh`
+## Downloading Different Data
 
-## Direct DuckDB Queries
+### See what's available
 
-You can query the data directly without dbt using `gs://` URLs with glob patterns:
-
-```sql
--- Start DuckDB CLI
-duckdb
-
--- Load httpfs extension
-INSTALL httpfs;
-LOAD httpfs;
-
--- Query with glob pattern (all dates for a feed)
-SELECT date, COUNT(*) as records
-FROM read_parquet(
-    'gs://parquet.gtfsrt.io/vehicle_positions/date=*/base64url=aHR0cHM6Ly93d3czLnNlcHRhLm9yZy9ndGZzcnQvc2VwdGEtcGEtdXMvVmVoaWNsZS9ydFZlaGljbGVQb3NpdGlvbi5wYg/data.parquet',
-    hive_partitioning=true
-)
-GROUP BY date;
-
--- Query all feeds for a date
-SELECT base64url, COUNT(*) as records
-FROM read_parquet(
-    'gs://parquet.gtfsrt.io/vehicle_positions/date=2026-01-04/base64url=*/data.parquet',
-    hive_partitioning=true
-)
-GROUP BY base64url;
+```bash
+uv run python scripts/download_data.py --list
 ```
 
-**Key advantage**: `gs://` URLs support glob patterns (`*`) for directory listing, while `http://` URLs do not.
-
-See `scripts/explore_feeds.sql` for more examples.
+### Download a different agency
 
-## Offline Use
+```bash
+uv run python scripts/download_data.py --agency septa --date 2026-01-20
+```
 
-To pre-download data for offline use:
+### Use a different date
 
 ```bash
-uv run python scripts/prefetch_data.py \
-    --feed-type vehicle_positions \
-    --feed-base64 aHR0cHM6Ly93d3czLnNlcHRhLm9yZy9ndGZzcnQvc2VwdGEtcGEtdXMvVmVoaWNsZS9ydFZlaGljbGVQb3NpdGlvbi5wYg \
-    --start-date 2026-01-01 \
-    --end-date 2026-01-07
+uv run python scripts/download_data.py --defaults --date 2026-01-20
 ```
 
-Files are saved to `data/` with the same Hive partition structure.
+See [docs/downloading_data.md](docs/downloading_data.md) for advanced options.
 
 ## Useful Commands
 
 ```bash
+# Download sample data
+uv run python scripts/download_data.py --defaults
+
 # Run all models
 uv run dbt run
 
 # Run specific model
 uv run dbt run --select stg_vehicle_positions
 
-# Force re-download (full refresh)
-uv run dbt run --full-refresh
-
-# Load seed data
-uv run dbt seed
-
-# Generate docs
-uv run dbt docs generate
-uv run dbt docs serve
+# Generate and view docs
+uv run dbt docs generate && uv run dbt docs serve
 
 # Query the database
-duckdb workshop.duckdb
+duckdb sandbox.duckdb -ui
 ```
 
 ## Data Schema
@@ -193,7 +124,8 @@ duckdb workshop.duckdb
 
 | Column | Type | Description |
 |--------|------|-------------|
-| partition_date | date | Date partition (from Hive partitioning) |
+| partition_date | date | Date partition |
+| feed_base64 | string | Base64url-encoded feed URL |
 | feed_timestamp | timestamp | When the feed was fetched |
 | vehicle_id | string | Vehicle identifier |
 | trip_id | string | Trip identifier |
@@ -206,7 +138,8 @@ duckdb workshop.duckdb
 
 | Column | Type | Description |
 |--------|------|-------------|
-| partition_date | date | Date partition (from Hive partitioning) |
+| partition_date | date | Date partition |
+| feed_base64 | string | Base64url-encoded feed URL |
 | feed_timestamp | timestamp | When the feed was fetched |
 | trip_id | string | Trip identifier |
 | stop_id | string | Stop identifier |
@@ -217,7 +150,8 @@ duckdb workshop.duckdb
 
 | Column | Type | Description |
 |--------|------|-------------|
-| partition_date | date | Date partition (from Hive partitioning) |
+| partition_date | date | Date partition |
+| feed_base64 | string | Base64url-encoded feed URL |
 | feed_timestamp | timestamp | When the feed was fetched |
 | header_text | string | Alert title |
 | description_text | string | Alert details |
@@ -226,8 +160,8 @@ duckdb workshop.duckdb
 
 ## Need Help?
 
-See [docs/troubleshooting.md](docs/troubleshooting.md) for common issues and solutions.
+See [docs/troubleshooting.md](docs/troubleshooting.md) for common issues and solutions, or [open an issue](https://github.com/JarvusInnovations/gtfsrt-sandbox/issues) if you're stuck.
 
 ## License
 
-Data sourced from public GTFS-RT feeds. Workshop materials are MIT licensed.
+Data sourced from public GTFS-RT feeds. Sandbox materials are MIT licensed.
@@ -7,32 +7,16 @@ profile: 'gtfsrt_sandbox'
 model-paths: ["models"]
 seed-paths: ["seeds"]
 macro-paths: ["macros"]
-docs-paths: ["docs"]
+docs-paths: ["docs/data"]
 target-path: "target"
 clean-targets:
   - "target"
   - "dbt_packages"
 
-vars:
-  # Feed URLs encoded as base64url (no padding)
-  # Default: AC Transit feeds (smaller, reliable, 106 routes)
-  vehicle_positions_feed: 'aHR0cHM6Ly9hcGkuYWN0cmFuc2l0Lm9yZy90cmFuc2l0L2d0ZnNydC92ZWhpY2xlcw'
-  trip_updates_feed: 'aHR0cHM6Ly9hcGkuYWN0cmFuc2l0Lm9yZy90cmFuc2l0L2d0ZnNydC90cmlwdXBkYXRlcw'
-  service_alerts_feed: 'aHR0cHM6Ly9hcGkuYWN0cmFuc2l0Lm9yZy90cmFuc2l0L2d0ZnNydC9hbGVydHM'
-  # directory prefix for where DuckDB should look for raw parquet files
-  # if it starts with gs://, will look to GCS
-  parquet_prefix: 'gs://parquet.gtfsrt.io'
-
-  # Date range for data query (adjust to available data)
-  start_date: '2026-01-24'
-  end_date: '2026-01-24'
-
 models:
   gtfsrt_sandbox:
     staging:
       +materialized: view
-      base:
-        +materialized: incremental
     intermediate:
       +materialized: view
     marts: