-
Notifications
You must be signed in to change notification settings - Fork 17
Description
Description
We’ve identified a recurring pattern where multiple GTFS-Realtime feeds reference the same GTFS schedule feed, which can result in trips from different RT feeds resolving to the same trip_instance_key. When this happens, trip-level metrics in downstream tables (e.g., fct_observed_trip) and dashboards may unintentionally accumulate data from multiple RT feeds.
We’ve seen this scenario occur in multiple agencies (e.g., cases where agencies are transitioning systems, running parallel feeds, or exposing both customer-facing and non-customer-facing/test feeds). In at least one historical case, this was addressed via a hard-coded exclusion of a specific RT feed in the ETL/model logic feeding fct_observed_trip (Torrance Transit).
While that approach resolves the immediate issue, it does not scale well and requires maintaining one-off exclusions by agency or feed identifier.
Proposed direction
Instead of hard-coded exclusions, we should generalize the filtering logic in the formation of fct_observed_trip by:
- Systematically excluding non-customer-facing RT feeds using existing metadata/flags (e.g.,
customer_facing,private_dataset, or equivalent), and - Applying this logic uniformly so that trip-level metrics are derived only from the intended customer-facing feed(s).
This would make the pipeline more robust to similar scenarios in the future and reduce the need for agency-specific exceptions.
Open questions
- Which flag(s) should be considered authoritative for determining whether an RT feed should contribute to
fct_observed_trip? - Should this filtering be enforced strictly at the
fct_observed_triplayer, or earlier in the pipeline, (eg.fct_trip_updates_trip_summaries)?
Considerations / Caveats
In some cases, we intentionally define a GTFS-RT feed as a temporary or test feed for research or exploratory analysis (e.g., feeds where the private_dataset flag is set to true). Automatically excluding these feeds from fct_observed_trip could limit analysts’ ability to perform certain analyses or validations that rely on test data.
Because of this, any generalized filtering approach should consider:
- whether test / private datasets need to be accessible for analysis in specific contexts, and
- whether there should be a documented way to opt in to including these feeds (e.g., via alternative tables, query overrides, or explicit flags) when needed for research or debugging.
This suggests that the filtering logic should be systematic but flexible, rather than a hard exclusion with no escape hatch.