GVL Startup Fetch Creates Thundering Herd at Scale — Need Cluster-Wide Caching

## Problem

Every Prebid Server pod fetches the **entire Global Vendor List history** from `vendor-list.consensu.org` at startup. As of today, that's **~369 sequential HTTP requests per pod** (TCF v2: 223 versions, TCF v3: 146 versions). Each pod builds its own in-memory cache from scratch — nothing is shared across pods, and the upstream `cache-control: max-age=604800` (7 days) is not leveraged.

This creates a classic **Thundering Herd** problem during Kubernetes rollouts where many pods start simultaneously.

## Impact — Back-of-Napkin Numbers

| Deployment Size | Pods | GVL Fetches/Pod | Requests per Rollout | Requests/Week (daily deploys) |
|----------------|------|-----------------|---------------------|-------------------------------|
| Small (20 pods) | 20 | ~369 | **~7,400** | **~52,000** |
| Medium (50 pods) | 50 | ~369 | **~18,500** | **~130,000** |
| Large (100 pods) | 100 | ~369 | **~36,900** | **~258,000** |

With caching honored (7-day TTL), the same data would require only **~369 requests per week** regardless of fleet size — a **99%+ reduction**.

### Observed Failures

- Bursts of concurrent requests to `consensu.org` (fronted by CloudFront) trigger transient failures (HTTP errors, timeouts, rate-limiting).
- Even **one failure** during the init window can leave a pod without a complete GVL, causing GDPR consent processing errors (see: "Cookie syncs may be affected" warnings in logs).
- In Kubernetes, a pod that can't process consent correctly may fail health checks and restart — creating a **cascading restart loop** that amplifies the herd further.
- Deployment reliability becomes coupled to an external third-party CDN's ability to absorb bursty traffic — a fragile dependency for production rollouts.

### The data is highly cacheable

```
$ curl -I https://vendor-list.consensu.org/v2/vendor-list.json
cache-control: max-age=604800
x-cache: Hit from cloudfront
```

- Archived versions (e.g., `v2/archives/vendor-list-v100.json`) are **immutable** — they never change.
- Only the "latest" endpoint updates, roughly weekly.
- Yet today, every pod re-fetches all ~369 URLs from the origin on every startup.

## What's Needed

A way to **cache GVL data once for the cluster** so that only the first requester fetches from origin, and all subsequent pods (and restarts) are served from a local, cluster-internal cache.

---

## Technical Context

### Current Implementation

In [`gdpr/vendorlist-fetching.go`](https://github.com/prebid/prebid-server/blob/master/gdpr/vendorlist-fetching.go):

- `preloadCache()` loops over TCF v2 and v3, fetching every archived version sequentially via `saveOne()`.
- `saveOne()` makes a plain `http.GET` — no retry, no backoff, no shared caching.
- `VendorListURLMaker()` is hardcoded to `https://vendor-list.consensu.org/...` with no configuration to override the base URL.
- The in-memory cache (`sync.Map`) is per-process only — lost on restart, not shared.

### Proposed Solution: Cooperative Caching via Prebid Cache

[Prebid Cache](https://github.com/prebid/prebid-cache) is already deployed as a cluster-wide microservice alongside Prebid Server in most production setups. It has mature storage backends (Redis, Aerospike, Memcache, etc.) and is owned by the Prebid project.

**Proposal**: Add a **GVL caching endpoint** to Prebid Cache that:

1. **Exposes a GVL-compatible URL path** (e.g., `/gvl/v2/vendor-list.json`, `/gvl/v3/archives/vendor-list-v100.json`) that Prebid Server can be configured to use instead of `vendor-list.consensu.org`.
2. **Fetches from origin on cache miss**, stores in its backend (Redis, etc.), and serves subsequent requests from cache.
3. **Respects `cache-control` / TTL** — archived versions cached indefinitely (immutable); latest version cached for up to 7 days per upstream headers (a library like [`pquerna/cachecontrol`](https://github.com/pquerna/cachecontrol) could help here).
4. **Deduplicates concurrent origin fetches** (singleflight pattern) to prevent the cache itself from herding against the origin.

On the Prebid Server side:

5. **Make the GVL base URL configurable** — e.g., a new config parameter `gdpr.vendorlist_base_url` (default: `https://vendor-list.consensu.org`) that can be pointed at the local Prebid Cache instance.

### Why Prebid Cache?

- **Already deployed**: Most PBS operators already run Prebid Cache as a cluster-internal service — no new infrastructure needed.
- **Both are Prebid-owned**: This is a natural cooperation between two projects in the same ecosystem.
- **Proven storage backends**: Redis/Aerospike/Memcache already handle TTL-based caching at scale.
- **Simple integration**: PBS only needs a configurable base URL; all the caching intelligence lives in Prebid Cache.

### Result

| Scenario | External Requests to consensu.org |
|----------|----------------------------------|
| **Today** (100 pods, daily deploys) | ~258,000/week |
| **With cluster cache** | ~369/week (one cache fill) |
| **Reduction** | **~99.9%** |

This eliminates the Thundering Herd, decouples deployment reliability from a third-party CDN, and is architecturally clean — leveraging infrastructure that already exists in the Prebid ecosystem.

### Related Issues

- #504 — Original GVL fetching implementation
- #1632 — Request to change GVL URL
- #1687 — "Failed to fetch Vendor-List (context deadline exceeded)"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GVL Startup Fetch Creates Thundering Herd at Scale — Need Cluster-Wide Caching #4687

Problem

Impact — Back-of-Napkin Numbers

Observed Failures

The data is highly cacheable

What's Needed

Technical Context

Current Implementation

Proposed Solution: Cooperative Caching via Prebid Cache

Why Prebid Cache?

Result

Related Issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Deployment Size	Pods	GVL Fetches/Pod	Requests per Rollout	Requests/Week (daily deploys)
Small (20 pods)	20	~369	~7,400	~52,000
Medium (50 pods)	50	~369	~18,500	~130,000
Large (100 pods)	100	~369	~36,900	~258,000

Scenario	External Requests to consensu.org
Today (100 pods, daily deploys)	~258,000/week
With cluster cache	~369/week (one cache fill)
Reduction	~99.9%

GVL Startup Fetch Creates Thundering Herd at Scale — Need Cluster-Wide Caching #4687

Description

Problem

Impact — Back-of-Napkin Numbers

Observed Failures

The data is highly cacheable

What's Needed

Technical Context

Current Implementation

Proposed Solution: Cooperative Caching via Prebid Cache

Why Prebid Cache?

Result

Related Issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions