Skip to content

Simplify workflow and improve beginner experience#12

Merged
themightychris merged 4 commits intomainfrom
themightychris/python-download
Jan 28, 2026
Merged

Simplify workflow and improve beginner experience#12
themightychris merged 4 commits intomainfrom
themightychris/python-download

Conversation

@themightychris
Copy link
Member

@themightychris themightychris commented Jan 28, 2026

Summary

Major refactoring to simplify the project workflow and improve the beginner experience:

  • Python-first data download: Replaced direct GCS access with a Python script for downloading data locally before running dbt
  • Simplified dbt models: Removed base models and feed/date variables, staging models now read directly from local data/ directory
  • Dynamic inventory: Added --list and --agency commands to discover and download data for any available agency
  • Better docs organization: Separated human-readable docs from dbt doc fragments
  • Renamed to sandbox: Updated terminology from "workshop" to "sandbox" throughout

Key Changes

Data Download Script (scripts/download_data.py)

  • Zero-config downloads with --defaults flag for AC Transit sample data
  • --list command to browse all available agencies from inventory.json
  • --agency command to download all feeds for a specific agency
  • Shows estimated download sizes before downloading

Simplified dbt Architecture

  • Removed macros/read_gtfs_parquet.sql - logic inlined into staging models
  • Deleted base models (models/staging/base/)
  • Removed feed URL and date variables from dbt_project.yml
  • Removed httpfs extension from profiles (no longer needed for GCS)

Developer Experience

  • Codespaces auto-download sample data on creation
  • CI workflow includes data download step
  • New docs/downloading_data.md with comprehensive usage examples
  • Reorganized docs/ - dbt fragments moved to docs/data/

Terminology

  • Renamed database from workshop.duckdb to sandbox.duckdb
  • Updated all documentation to use "sandbox" terminology

Testing

  • Download script tested (3 feed types, ~110MB total)
  • dbt run completed (8 models, all passing)
  • dbt test passed (5 tests)
  • Data queries verified

Generated with Claude Code

themightychris and others added 4 commits January 27, 2026 19:08
- Rename prefetch_data.py → download_data.py with --defaults flag for zero-config download
- Simplify macro to read all data from local data/ directory with glob patterns
- Remove base models layer (incremental models with date filtering)
- Update staging models to use macro directly, add feed_base64 column
- Remove feed/date variables from dbt_project.yml (no longer needed)
- Remove httpfs extension from profiles.yml (no longer reading from gs://)
- Auto-download sample data in devcontainer and CI workflow
- Rewrite README for new two-phase workflow (download → transform)
- Create comprehensive docs/downloading_data.md with examples
- Update schema documentation to reflect new architecture

This change improves the beginner experience by:
- Separating download time from dbt run time
- Removing complex configuration (feed variables, date ranges)
- Simplifying the dbt layer (no incremental materialization)
- Making it obvious that data must be downloaded before running dbt

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
- Add inventory.json integration for live feed discovery
- Add --list command to show available agencies with date ranges
- Add --agency flag for downloading all feeds by agency name
- Show estimated download sizes before downloading
- Remove read_gtfs_parquet macro (inline read_parquet in staging models)
- Delete seeds/available_feeds.csv and scripts/generate_feed_list.py
- Update README and docs with simplified workflow

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Separates dbt documentation fragments from human-facing docs by moving
them to docs/data/ and updating docs-paths configuration.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Update project terminology to better reflect its purpose as a sandbox
environment for exploring transit operational data transformation patterns.

- Rename database file from workshop.duckdb to sandbox.duckdb
- Update README with new introduction describing the project context
- Add link to TIDES specification and mention Common Transit Operations
  Data Framework
- Update all references in CI, scripts, and documentation

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@themightychris themightychris changed the title Simplify workflow: Python-first data download Simplify workflow and improve beginner experience Jan 28, 2026
@themightychris themightychris merged commit 1a4e481 into main Jan 28, 2026
2 checks passed
@themightychris themightychris deleted the themightychris/python-download branch January 28, 2026 01:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant