Skip to content

GVL Startup Fetch Creates Thundering Herd at Scale — Need Cluster-Wide Caching #4687

@scr-oath

Description

@scr-oath

Problem

Every Prebid Server pod fetches the entire Global Vendor List history from vendor-list.consensu.org at startup. As of today, that's ~369 sequential HTTP requests per pod (TCF v2: 223 versions, TCF v3: 146 versions). Each pod builds its own in-memory cache from scratch — nothing is shared across pods, and the upstream cache-control: max-age=604800 (7 days) is not leveraged.

This creates a classic Thundering Herd problem during Kubernetes rollouts where many pods start simultaneously.

Impact — Back-of-Napkin Numbers

Deployment Size Pods GVL Fetches/Pod Requests per Rollout Requests/Week (daily deploys)
Small (20 pods) 20 ~369 ~7,400 ~52,000
Medium (50 pods) 50 ~369 ~18,500 ~130,000
Large (100 pods) 100 ~369 ~36,900 ~258,000

With caching honored (7-day TTL), the same data would require only ~369 requests per week regardless of fleet size — a 99%+ reduction.

Observed Failures

  • Bursts of concurrent requests to consensu.org (fronted by CloudFront) trigger transient failures (HTTP errors, timeouts, rate-limiting).
  • Even one failure during the init window can leave a pod without a complete GVL, causing GDPR consent processing errors (see: "Cookie syncs may be affected" warnings in logs).
  • In Kubernetes, a pod that can't process consent correctly may fail health checks and restart — creating a cascading restart loop that amplifies the herd further.
  • Deployment reliability becomes coupled to an external third-party CDN's ability to absorb bursty traffic — a fragile dependency for production rollouts.

The data is highly cacheable

$ curl -I https://vendor-list.consensu.org/v2/vendor-list.json
cache-control: max-age=604800
x-cache: Hit from cloudfront
  • Archived versions (e.g., v2/archives/vendor-list-v100.json) are immutable — they never change.
  • Only the "latest" endpoint updates, roughly weekly.
  • Yet today, every pod re-fetches all ~369 URLs from the origin on every startup.

What's Needed

A way to cache GVL data once for the cluster so that only the first requester fetches from origin, and all subsequent pods (and restarts) are served from a local, cluster-internal cache.


Technical Context

Current Implementation

In gdpr/vendorlist-fetching.go:

  • preloadCache() loops over TCF v2 and v3, fetching every archived version sequentially via saveOne().
  • saveOne() makes a plain http.GET — no retry, no backoff, no shared caching.
  • VendorListURLMaker() is hardcoded to https://vendor-list.consensu.org/... with no configuration to override the base URL.
  • The in-memory cache (sync.Map) is per-process only — lost on restart, not shared.

Proposed Solution: Cooperative Caching via Prebid Cache

Prebid Cache is already deployed as a cluster-wide microservice alongside Prebid Server in most production setups. It has mature storage backends (Redis, Aerospike, Memcache, etc.) and is owned by the Prebid project.

Proposal: Add a GVL caching endpoint to Prebid Cache that:

  1. Exposes a GVL-compatible URL path (e.g., /gvl/v2/vendor-list.json, /gvl/v3/archives/vendor-list-v100.json) that Prebid Server can be configured to use instead of vendor-list.consensu.org.
  2. Fetches from origin on cache miss, stores in its backend (Redis, etc.), and serves subsequent requests from cache.
  3. Respects cache-control / TTL — archived versions cached indefinitely (immutable); latest version cached for up to 7 days per upstream headers (a library like pquerna/cachecontrol could help here).
  4. Deduplicates concurrent origin fetches (singleflight pattern) to prevent the cache itself from herding against the origin.

On the Prebid Server side:

  1. Make the GVL base URL configurable — e.g., a new config parameter gdpr.vendorlist_base_url (default: https://vendor-list.consensu.org) that can be pointed at the local Prebid Cache instance.

Why Prebid Cache?

  • Already deployed: Most PBS operators already run Prebid Cache as a cluster-internal service — no new infrastructure needed.
  • Both are Prebid-owned: This is a natural cooperation between two projects in the same ecosystem.
  • Proven storage backends: Redis/Aerospike/Memcache already handle TTL-based caching at scale.
  • Simple integration: PBS only needs a configurable base URL; all the caching intelligence lives in Prebid Cache.

Result

Scenario External Requests to consensu.org
Today (100 pods, daily deploys) ~258,000/week
With cluster cache ~369/week (one cache fill)
Reduction ~99.9%

This eliminates the Thundering Herd, decouples deployment reliability from a third-party CDN, and is architecturally clean — leveraging infrastructure that already exists in the Prebid ecosystem.

Related Issues

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

Status

Community Review

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions