You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Aug 18, 2025. It is now read-only.
Basic design brain dump, feel free to make suggestions / comments / challenge anything.
Raw GTFS Scraper
Script on a timer (1 minute? can probably be configurable?) to read the API endpoint and just dump raw data to an S3 file (can gzip or something if we care about compression / want to save some space). The files can be named based on scrape time.
The purpose for having this as an independent script is to ensure even if later steps fail, we are still storing the raw data such that we can backfill once later stuff is fixed. This job should be as simple as possible since failure results in unrecoverable data loss.
Normalizer
Reads the s3 files produced by the scraper and does some normalization (actual processing TBD based on what the data looks like) and write to a data lake table. I think it makes sense for this to read the output table to get the latest timestamp (high watermark) and then load ALL scraper files newer than the current watermark.
We should spend some time to look at the actual protobuf definition and come up with a standardized schema for the table we're looking to build.
Storage Data Lake Table
There's a bunch of solid open source solutions that we can use. I think Anthony mentioned some familiarity with Databricks (they have a pseudo-open source solution called Delta Lake)? I personally like Apache Iceberg. I've heard good things about Apache Hudi but never personally worked w/ it.This table should be partitioned by date (day) and indexed by either time or bus system (probably time).
Compaction
Given we're essentially writing 1-minute files, queries will hit a ton of files which is inefficient on the read side. We have some options here:
Rewrite the full partition on each write operation to a single file (a little wasteful - each write operation will be re-writing the full day's data)
Do compaction on some scheduled interval like EOD (queries for recent data will be slow until the compaction occurs)
Don't do compaction (just deal with poor read performance)
(1) or (2) is probably the right decision here - personally in favor of (2).If applicable to our data table of choice, we should also delete old table snapshots / metadata filesClient.
Maybe would be nice to provide some light-weight python client that reads data from the table and serves it up as a pandas dataframe or something?
Data Visualizations / Nice to Haves
I like Apache Superset (Anthony mentioned using this) but realistically think this could get expensive if you host it and open it up for anyone to use your compute to do random visualizations / analysis
Deployment
I like the idea of a docker-compose file that anyone can use on an EC2 instance or something. Not a cloud expert but if you're running this 24/7, it probably would be cheaper this way than calling AWS Lambda once a minute
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Basic design brain dump, feel free to make suggestions / comments / challenge anything.
Raw GTFS Scraper
Script on a timer (1 minute? can probably be configurable?) to read the API endpoint and just dump raw data to an S3 file (can gzip or something if we care about compression / want to save some space). The files can be named based on scrape time.
The purpose for having this as an independent script is to ensure even if later steps fail, we are still storing the raw data such that we can backfill once later stuff is fixed. This job should be as simple as possible since failure results in unrecoverable data loss.
Normalizer
Reads the s3 files produced by the scraper and does some normalization (actual processing TBD based on what the data looks like) and write to a data lake table. I think it makes sense for this to read the output table to get the latest timestamp (high watermark) and then load ALL scraper files newer than the current watermark.
We should spend some time to look at the actual protobuf definition and come up with a standardized schema for the table we're looking to build.
Storage Data Lake Table
There's a bunch of solid open source solutions that we can use. I think Anthony mentioned some familiarity with Databricks (they have a pseudo-open source solution called Delta Lake)? I personally like Apache Iceberg. I've heard good things about Apache Hudi but never personally worked w/ it.This table should be partitioned by date (day) and indexed by either time or bus system (probably time).
Compaction
Given we're essentially writing 1-minute files, queries will hit a ton of files which is inefficient on the read side. We have some options here:
(1) or (2) is probably the right decision here - personally in favor of (2).If applicable to our data table of choice, we should also delete old table snapshots / metadata filesClient.
Maybe would be nice to provide some light-weight python client that reads data from the table and serves it up as a pandas dataframe or something?
Data Visualizations / Nice to Haves
I like Apache Superset (Anthony mentioned using this) but realistically think this could get expensive if you host it and open it up for anyone to use your compute to do random visualizations / analysis
Deployment
I like the idea of a docker-compose file that anyone can use on an EC2 instance or something. Not a cloud expert but if you're running this 24/7, it probably would be cheaper this way than calling AWS Lambda once a minute
Beta Was this translation helpful? Give feedback.
All reactions