Skip to content

feat(flags): enable etag on hypercache writes#57049

Draft
matheus-vb wants to merge 3 commits intomasterfrom
matheus-vb/enable-etag-write
Draft

feat(flags): enable etag on hypercache writes#57049
matheus-vb wants to merge 3 commits intomasterfrom
matheus-vb/enable-etag-write

Conversation

@matheus-vb
Copy link
Copy Markdown
Member

Problem

PR #54793 shipped an in memory FlagDefinitionsCache keyed on (team_id, etag), gated by a Redis GET on cache/teams/{team_id}/feature_flags/flags.json:etag. Post deploy flags_definitions_inmem_cache_no_version_total{reason="etag_missing"} sustained at ~10 000/s, so ~80% of /flags requests bypassed the cache and ran the full payload load (Redis fetch, ZSTD, pickle, JSON, regex compile) on every request. No correctness regression, but the perf upside was almost entirely missing.

The Rust hot path calls get_etag on flags_hypercache_reader (namespace feature_flags, value flags.json), which the Django writer flags_hypercache at posthog/models/feature_flag/flags_cache.py:448 populated without enable_etag=True. Under that default _set_cache_value_redis writes the payload and actively deletes any etag key, so the etag was never present for the reader to find. The neighbouring flag_definitions_hypercache (value flags_with_cohorts.json) already had enable_etag=True but feeds a different Rust reader used by the SDK /flags/definitions endpoint, not the production /flags hot path.

Changes

flags_hypercache now passes enable_etag=True. _set_cache_value_redis takes the etag aware branch and writes payload and etag together via set_many with the same TTL on every cache write (signal driven, scheduled refresh, daily mass sync). The __missing__ sentinel still deletes the etag so empty teams land on the sentinel reason rather than etag_missing per the merged Rust loader logic.

verify_team_flags gained a new MISSING_ETAG mismatch branch after the existing MISSING_EVALUATION_METADATA check. The daily verify_flags_cache management command now reports MISSING_ETAG aggregates so future regressions of this class (someone disables enable_etag, eviction skew empties the etag, a new hypercache lands without it) cannot recur silently.

No Rust changes. The merged PR's Ok(Some(etag)), Ok(None), and Err branches are forward compatible. No S3 changes; the Rust get_etag reads only from Redis. No new triggers; etags appear at the existing write cadence and the daily mass sync covers all 363 674 teams within ~24h.

How did you test this code?

hogli test posthog/models/feature_flag/test/test_flags_cache.py runs 177 of 177 green. Four new regression tests cover the etag round trip:

  • test_update_flags_cache_writes_etag confirms update_flags_cache produces a 16 character hex etag.
  • test_clear_flags_cache_clears_etag confirms clear_cache removes both the payload and the etag.
  • test_missing_sentinel_clears_etag confirms the __missing__ sentinel deletes any prior etag.
  • test_verify_detects_missing_etag primes a payload without etag state and asserts verify_team_flags returns status="mismatch", issue="MISSING_ETAG".

hogli test posthog/storage/test/test_hypercache.py runs 55 of 55 green (storage layer untouched). ruff check and ruff format pass on both changed files.

Production impact verified against current prod metrics. Dedicated flags Redis sees writes double in pipelined SET ops (1 to 2 per write) at peaks of ~22/sec during the daily mass sync window, plus ~31 MB additional storage across 363 674 teams. The Rust read side drops ~95% in payload bandwidth (~72 MB/s to ~4 MB/s) once etags are populated, so net Redis load decreases. No backfill is triggered.

Watch on rollout: flags_definitions_inmem_cache_no_version_total{reason="etag_missing"} falls from ~10 000/s toward single digits over 24h while {reason="sentinel"} stays flat. Cache hit rate rises from ~165/s toward the /flags request rate. Pyroscope CPU on serde_json::Value::deserialize, serde_pickle::de::parse_value, ZSTD_decompressContinue, and prepare_regexes_in_place collapses on the feature flags service pods.

Publish to changelog?

no

@tests-posthog
Copy link
Copy Markdown
Contributor

tests-posthog Bot commented Apr 29, 2026

Query snapshots: Backend query snapshots updated

Changes: 1 snapshots (1 modified, 0 added, 0 deleted)

What this means:

  • Query snapshots have been automatically updated to match current output
  • These changes reflect modifications to database queries or schema

Next steps:

  • Review the query changes to ensure they're intentional
  • If unexpected, investigate what caused the query to change

Review snapshot changes →

@matheus-vb matheus-vb force-pushed the matheus-vb/enable-etag-write branch from 0460065 to a210447 Compare April 30, 2026 03:40
@tests-posthog
Copy link
Copy Markdown
Contributor

tests-posthog Bot commented Apr 30, 2026

Query snapshots: Backend query snapshots updated

Changes: 2 snapshots (2 modified, 0 added, 0 deleted)

What this means:

  • Query snapshots have been automatically updated to match current output
  • These changes reflect modifications to database queries or schema

Next steps:

  • Review the query changes to ensure they're intentional
  • If unexpected, investigate what caused the query to change

Review snapshot changes →

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant