feat/gossip by Flo4604 · Pull Request #5015 · unkeyed/unkey

Flo4604 · 2026-02-12T13:35:01Z

What does this PR do?

Adds a specific gossip implementation that would work for us - in theory.

We have 2 seperate gossip memberlists, one for intra cluster messages and one for cross region messages.
The idea is to:

Have a single node be the broadcast who talks to other clusters meaning

We publish a message in us-east-1 one of our 3 nodes will send it to eu-central-1 and that itself will distribute the message to its local members.

That way we dont need everyone to know about everyone and keep latency shit for only a single req across the globe

Type of change

Bug fix (non-breaking change which fixes an issue)
Chore (refactoring code, technical debt, workflow improvements)
Enhancement (small improvements)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

How should this be tested?

Test A
Test B

Checklist

Required

Appreciated

If a UI change was made: Added a screen recording or screenshots to this PR
Updated the Unkey Docs if changes were necessary

vercel · 2026-02-12T13:35:06Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
dashboard	Ready	Preview, Comment	Feb 16, 2026 7:03pm

1 Skipped Deployment

Project	Deployment	Actions	Updated (UTC)
engineering	Ignored	Preview	Feb 16, 2026 7:03pm

coderabbitai · 2026-02-12T13:35:09Z

Caution

Review failed

The pull request is closed.

📝 Walkthrough

Walkthrough

This pull request replaces Kafka-based distributed cache invalidation with a gossip-based cluster membership system using HashiCorp memberlist. Changes include introducing a new cluster package implementing two-tier LAN/WAN gossip with automatic ambassador election, updating cache clustering to use a Broadcaster interface, removing the eventstream infrastructure, and wiring gossip configuration across API, Frontline, and Sentinel services with corresponding CLI flags and Kubernetes manifests.

Changes

Cohort / File(s)	Summary
Infrastructure & Configuration Removal `.github/workflows/job_bazel.yaml`, `dev/docker-compose.yaml`, `Makefile`, `dev/Tiltfile`	Removed Kafka from Docker Compose, CI workflows, and Makefile targets. Updated development environment to exclude Kafka container and dependencies.
Go Dependencies `go.mod`, `MODULE.bazel`, `tools/exportoneof/...`	Replaced `kafka-go` with `hashicorp/memberlist` dependency. Added new `exportoneof` tool for proto code generation. Updated Bazel modules configuration.
Cluster Package Implementation `pkg/cluster/...`	New 10-file cluster package implementing gossip-based two-tier membership (LAN/WAN) with SWIM protocol via memberlist. Includes bridge/ambassador election, DNS seed resolution, message multiplexing, and comprehensive tests.
Cache Clustering Updates `pkg/cache/clustering/broadcaster.go`, `broadcaster_gossip.go`, `broadcaster_noop.go`, `cluster_cache.go`, `dispatcher.go`, `gossip_e2e_test.go`, `BUILD.bazel`	Introduced Broadcaster interface replacing eventstream-based invalidation. Added GossipBroadcaster for cluster-based propagation and NoopBroadcaster for disabled mode. Removed Kafka-backed tests and added gossip E2E tests.
Eventstream Package Removal `pkg/events/`, `pkg/eventstream/`	Completely removed pub/sub Topic infrastructure, Producer/Consumer interfaces, and Kafka integration code. Deleted integration tests and no-op implementations.
API Service Integration `cmd/api/main.go`, `svc/api/config.go`, `svc/api/run.go`, `svc/api/BUILD.bazel`	Replaced Kafka broker configuration with Gossip cluster flags (gossip-enabled, gossip-bind-addr, LAN/WAN ports and seeds, secret-key). Updated config struct and wiring logic.
Frontline Service Integration `cmd/frontline/main.go`, `svc/frontline/config.go`, `svc/frontline/run.go`, `svc/frontline/services/caches/...`	Added Gossip configuration to CLI and service config. Updated cache service to use Broadcaster for distributed invalidation and NodeID for cluster identity.
Sentinel Service Integration `cmd/sentinel/main.go`, `svc/sentinel/config.go`, `svc/sentinel/run.go`, `svc/sentinel/services/router/...`	Added Gossip cluster configuration and wiring. Updated router service with Broadcaster and NodeID fields for cache invalidation propagation.
Kubernetes Manifests `dev/k8s/manifests/api.yaml`, `dev/k8s/manifests/frontline.yaml`, `dev/k8s/manifests/cilium-policies.yaml`	Added gossip LAN ports (7946 TCP/UDP) and environment variables to API and Frontline deployments. Created headless Services for gossip endpoints. Added CiliumNetworkPolicy rules for inter-pod gossip communication.
Proto Definitions `proto/cache/v1/invalidation.proto`, `proto/cluster/v1/envelope.proto`	Refactored CacheInvalidationEvent with oneof action field supporting cache_key or clear_all. Created new ClusterMessage envelope with Direction enum and payload routing.
Build Configuration Updates `internal/services/caches/BUILD.bazel`, `svc/api/integration/cluster/cache/BUILD.bazel`, `svc/api/BUILD.bazel`, `svc/frontline/BUILD.bazel`, `svc/sentinel/BUILD.bazel`, `svc/sentinel/services/router/BUILD.bazel`, `pkg/cluster/BUILD.bazel`, `tools/exportoneof/BUILD.bazel`	Added clustering and cluster dependencies across services. Removed eventstream and kafka-go dependencies. Narrowed test targets to new gossip-based implementations.
Kubernetes Controller Updates `svc/krane/internal/sentinel/apply.go`, `svc/krane/internal/sentinel/delete.go`, `svc/krane/internal/sentinel/controller.go`, `svc/krane/internal/sentinel/consts.go`, `svc/krane/pkg/labels/labels.go`, `svc/krane/run.go`	Extended Sentinel K8s controller to manage gossip headless Services and CiliumNetworkPolicy resources. Added dynamic client integration and gossip LAN port constant. Added ComponentGossipLAN label method.
Test Removals & Refactoring `pkg/cache/clustering/consume_events_test.go`, `pkg/cache/clustering/e2e_test.go`, `pkg/cache/clustering/produce_events_test.go`, `svc/api/integration/cluster/cache/consume_events_test.go`, `svc/api/integration/cluster/cache/produce_events_test.go`, `pkg/eventstream/eventstream_integration_test.go`	Removed all Kafka-backed integration tests. Replaced with new gossip E2E tests validating cross-node invalidation (Remove and Clear operations).
Integration Harness Updates `svc/api/integration/harness.go`, `svc/api/internal/testutil/http.go`	Removed Docker-based Kafka orchestration. Updated caches config to use Broadcaster instead of CacheInvalidationTopic.
Documentation & Tooling `web/apps/engineering/content/docs/architecture/services/cluster-service.mdx`, `tools/exportoneof/main.go`	Added comprehensive cluster architecture documentation. Introduced exportoneof code generation tool for proto oneof interface export.

Sequence Diagram(s)

sequenceDiagram
    participant Node1 as Node 1<br/>(API Instance)
    participant LAN1 as LAN Pool<br/>(memberlist)
    participant Node2 as Node 2<br/>(API Instance)
    participant LAN2 as LAN Pool<br/>(memberlist)
    participant WAN as WAN Pool<br/>(Ambassador)

    rect rgba(100, 150, 200, 0.5)
        Note over Node1,Node2: Same Region (LAN) Invalidation
        Node1->>LAN1: Broadcast(CacheInvalidation)
        LAN1->>Node2: NotifyMsg(ClusterMessage)
        Note over Node2: Deserialize & Apply<br/>Cache Invalidation
    end

    rect rgba(150, 100, 200, 0.5)
        Note over Node1,WAN: Inter-Region (WAN) Invalidation
        Node1->>LAN1: Broadcast(CacheInvalidation)
        LAN1->>WAN: Bridge relays to WAN<br/>(direction=DIRECTION_WAN)
        WAN->>LAN2: Ambassador notifies<br/>remote LAN pool
        LAN2->>Node2: NotifyMsg(ClusterMessage)
        Note over Node2: Deserialize & Apply<br/>Cache Invalidation
    end

sequenceDiagram
    participant App as Service Start
    participant Cluster as cluster.New()
    participant LAN as LAN Memberlist
    participant Seeds as LAN Seeds
    participant Bridge as Bridge Eval Loop
    participant WAN as WAN Memberlist
    participant WanSeeds as WAN Seeds

    App->>Cluster: New(cfg Config)
    activate Cluster
    Cluster->>LAN: Create with DefaultLANConfig
    Cluster->>LAN: Add Delegate & EventDelegate
    Cluster->>LAN: Create TransmitLimitedQueue
    Cluster->>Bridge: Start bridgeEvalLoop goroutine
    Cluster->>Seeds: joinSeeds(LANSeeds)
    activate Seeds
    Seeds->>LAN: Join with backoff/retry
    Seeds-->>Cluster: Success callback
    deactivate Seeds
    
    Note over Bridge: Periodic evaluation
    Bridge->>LAN: Get smallest member by name
    alt Is this node smallest?
        Bridge->>WAN: promoteToBridge
        activate WAN
        WAN->>WAN: Create with DefaultWANConfig
        WAN->>WAN: Add WAN delegate
        WAN->>WanSeeds: joinSeeds(WANSeeds)
        WanSeeds->>WAN: Join with backoff
        WAN-->>Bridge: Success
        deactivate WAN
    else Is not smallest
        Bridge->>WAN: demoteFromBridge (if currently bridge)
    end
    
    Cluster-->>App: Return Cluster instance
    deactivate Cluster

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 3

❌ Failed checks (2 warnings, 1 inconclusive)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description provides context on the gossip implementation and architectural goal but lacks critical information: testing steps, checklist items, and issue references are all missing or unchecked, failing to meet template requirements.	Complete the PR template: reference a tracking issue, provide concrete testing steps, run all required checks (fmt, build, etc.), and check off template items before merging.
Docstring Coverage	⚠️ Warning	Docstring coverage is 45.65% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The title 'feat/gossip' is vague and generic. While it indicates a feature related to gossip, it does not clearly convey the primary change (replacing Kafka-based cache invalidation with a two-tier gossip cluster for distributed cache invalidation).	Use a more descriptive title such as 'Replace Kafka-based cache invalidation with gossip cluster' to clearly summarize the main architectural change.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Merge Conflict Detection	✅ Passed	✅ No merge conflicts detected when merging into `main`

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feat/gossip

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

proto/cluster/v1/envelope.proto

pkg/cache/clustering/broadcaster_gossip.go

pkg/cache/clustering/cluster_cache.go

pkg/cluster/gateway.go

pkg/cluster/message.go

pkg/cluster/mux.go

coderabbitai

Actionable comments posted: 3

🤖 Fix all issues with AI agents

In `@internal/services/caches/caches.go`:
- Around line 168-249: The dispatcher created in New() must be closed when any
subsequent cache creation fails to avoid leaking resources: after the dispatcher
is successfully created (variable name dispatcher in New), ensure you call
dispatcher.Close() on every early return that follows (e.g., every "return
Caches{}, err" that occurs after calls to createCache such as when building
ratelimitNamespace, verificationKeyByHash, liveApiByID, clickhouseSetting,
keyAuthToApiRow, apiToKeyAuthRow, etc.), or preferably add a deferred cleanup
like "defer func(){ if !initialized { dispatcher.Close() } }()" immediately
after creating dispatcher and set initialized=true only on the final successful
return; update all error paths accordingly so dispatcher.Close() runs on
failure.

In `@pkg/cache/clustering/broadcaster_gossip.go`:
- Around line 60-63: GossipBroadcaster.Close currently forwards to
b.cluster.Close but ownership is ambiguous and can result in double-close;
modify GossipBroadcaster to make Close idempotent by adding a sync.Once (or
equivalent boolean + mutex) on the GossipBroadcaster struct and invoke
b.cluster.Close inside that Once, or clearly transfer/document ownership so only
one caller closes the cluster (e.g., remove cluster.Close from
GossipBroadcaster.Close if run.go defers closing); update the Close method on
GossipBroadcaster to use the Once/guard and ensure subsequent Close calls return
nil (or the original error) without calling cluster.Close again.

In `@svc/frontline/services/caches/caches.go`:
- Around line 104-160: When
clustering.NewInvalidationDispatcher(config.Broadcaster) succeeds but a
subsequent createCache call fails, the dispatcher is leaked; update the New()
path to call dispatcher.Close() (or dispatcher.Close(context?) depending on its
API) before each early return after dispatcher initialization (i.e., before each
fmt.Errorf return after createCache for frontlineRoute, sentinelsByEnvironment,
tlsCertificate). Guard the Close call with a nil check on dispatcher and ensure
you preserve the original returned error; do the same for any other early
returns in this function after dispatcher was set.

🧹 Nitpick comments (6)

svc/krane/internal/sentinel/apply.go (2)

392-446: Multiple gossip services with identical selectors per environment.

Each sentinel creates its own gossip service (<k8sName>-gossip-lan) but the selector matches ALL sentinels in the environment via EnvironmentID + ComponentSentinel. This means multiple headless services will resolve to the same set of pods.

While this works (DNS will resolve any of them to the same pod IPs), it creates redundant services. Consider either:

Use a single environment-scoped gossip service name (idempotent across sentinels)

Keep per-sentinel services but scope the selector to that sentinel

This isn't blocking since it functions correctly, but adds unnecessary resources.

448-524: Same redundancy applies to CiliumNetworkPolicy.

Similar to the gossip service, each sentinel creates its own policy with the same environment-scoped selector. Multiple policies with identical selectors are functionally equivalent but redundant.

pkg/cache/clustering/gossip_e2e_test.go (1)

54-55: Magic sleep may be fragile.

The 50ms sleep before node 2 creation appears to be a timing workaround. Consider documenting why this is needed or using a more deterministic approach (e.g., waiting for node 1 to be ready to accept connections).

dev/k8s/manifests/api.yaml (1)

78-84: Consider adding UNKEY_GOSSIP_BIND_ADDR.

Gossip enabled but bind address not specified. If the default (likely 0.0.0.0 or pod IP) is intentional, this is fine, but explicit config aids clarity.

svc/sentinel/services/router/service.go (1)

45-82: Consider extracting clusterOpts and createCache to a shared package.

This pattern is duplicated in svc/frontline/services/caches/caches.go. Could be a shared helper in pkg/cache/clustering.

pkg/cache/clustering/broadcaster_gossip.go (1)

31-39: Handler invocation uses context.Background() instead of propagating context.

The handler signature accepts a context, but HandleCacheInvalidation always passes context.Background(). Consider storing the subscription context or accepting context as a parameter if cancellation/deadline propagation is needed.

internal/services/caches/caches.go

pkg/cache/clustering/broadcaster_gossip.go

svc/frontline/services/caches/caches.go

chronark

maybe reorder the proto fields, but it's not super important

vercel bot deployed to Preview – dashboard February 12, 2026 13:36 View deployment

vercel bot deployed to Preview – dashboard February 12, 2026 14:03 View deployment

vercel bot deployed to Preview – dashboard February 12, 2026 14:09 View deployment

vercel bot deployed to Preview – dashboard February 12, 2026 16:19 View deployment

vercel bot deployed to Preview – dashboard February 12, 2026 16:31 View deployment

vercel bot deployed to Preview – dashboard February 12, 2026 16:47 View deployment

Flo4604 changed the title ~~feat/gossip: draft wip please ignore~~ feat/gossip Feb 12, 2026

vercel bot deployed to Preview – dashboard February 12, 2026 17:06 View deployment

Flo4604 force-pushed the feat/gossip branch from 0f1efff to 7dec655 Compare February 12, 2026 17:19

chronark reviewed Feb 12, 2026

View reviewed changes

proto/cluster/v1/envelope.proto Show resolved Hide resolved

Flo4604 changed the base branch from main to chore/remove-agent February 12, 2026 17:19

vercel bot deployed to Preview – engineering February 12, 2026 17:21 View deployment

vercel bot deployed to Preview – dashboard February 12, 2026 17:21 View deployment

vercel bot deployed to Preview – dashboard February 12, 2026 17:40 View deployment

vercel bot deployed to Preview – dashboard February 12, 2026 18:01 View deployment

Base automatically changed from chore/remove-agent to main February 12, 2026 18:06

Flo4604 requested a review from chronark February 12, 2026 18:07

vercel bot deployed to Preview – engineering February 12, 2026 18:20 View deployment

vercel bot deployed to Preview – dashboard February 12, 2026 18:22 View deployment