Design Braindump #4

geeli123 · 2024-08-16T13:09:16Z

geeli123
Aug 16, 2024
Maintainer

Basic design brain dump, feel free to make suggestions / comments / challenge anything.

Raw GTFS Scraper

Script on a timer (1 minute? can probably be configurable?) to read the API endpoint and just dump raw data to an S3 file (can gzip or something if we care about compression / want to save some space). The files can be named based on scrape time.

The purpose for having this as an independent script is to ensure even if later steps fail, we are still storing the raw data such that we can backfill once later stuff is fixed. This job should be as simple as possible since failure results in unrecoverable data loss.

Normalizer

Reads the s3 files produced by the scraper and does some normalization (actual processing TBD based on what the data looks like) and write to a data lake table. I think it makes sense for this to read the output table to get the latest timestamp (high watermark) and then load ALL scraper files newer than the current watermark.

We should spend some time to look at the actual protobuf definition and come up with a standardized schema for the table we're looking to build.

Storage Data Lake Table

There's a bunch of solid open source solutions that we can use. I think Anthony mentioned some familiarity with Databricks (they have a pseudo-open source solution called Delta Lake)? I personally like Apache Iceberg. I've heard good things about Apache Hudi but never personally worked w/ it.This table should be partitioned by date (day) and indexed by either time or bus system (probably time).

Compaction

Given we're essentially writing 1-minute files, queries will hit a ton of files which is inefficient on the read side. We have some options here:

Rewrite the full partition on each write operation to a single file (a little wasteful - each write operation will be re-writing the full day's data)
Do compaction on some scheduled interval like EOD (queries for recent data will be slow until the compaction occurs)
Don't do compaction (just deal with poor read performance)

(1) or (2) is probably the right decision here - personally in favor of (2).If applicable to our data table of choice, we should also delete old table snapshots / metadata filesClient.

Maybe would be nice to provide some light-weight python client that reads data from the table and serves it up as a pandas dataframe or something?

Data Visualizations / Nice to Haves

I like Apache Superset (Anthony mentioned using this) but realistically think this could get expensive if you host it and open it up for anyone to use your compute to do random visualizations / analysis

Deployment

I like the idea of a docker-compose file that anyone can use on an EC2 instance or something. Not a cloud expert but if you're running this 24/7, it probably would be cheaper this way than calling AWS Lambda once a minute

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Design Braindump #4

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Design Braindump #4

Uh oh!

geeli123 Aug 16, 2024 Maintainer

Replies: 0 comments

geeli123
Aug 16, 2024
Maintainer