Skip to content

Comments

feat/gossip#5015

Merged
Flo4604 merged 30 commits intomainfrom
feat/gossip
Feb 16, 2026
Merged

feat/gossip#5015
Flo4604 merged 30 commits intomainfrom
feat/gossip

Conversation

@Flo4604
Copy link
Member

@Flo4604 Flo4604 commented Feb 12, 2026

What does this PR do?

Adds a specific gossip implementation that would work for us - in theory.

We have 2 seperate gossip memberlists, one for intra cluster messages and one for cross region messages.
The idea is to:

Have a single node be the broadcast who talks to other clusters meaning

We publish a message in us-east-1 one of our 3 nodes will send it to eu-central-1 and that itself will distribute the message to its local members.

That way we dont need everyone to know about everyone and keep latency shit for only a single req across the globe

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • Chore (refactoring code, technical debt, workflow improvements)
  • Enhancement (small improvements)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How should this be tested?

  • Test A
  • Test B

Checklist

Required

  • Filled out the "How to test" section in this PR
  • Read Contributing Guide
  • Self-reviewed my own code
  • Commented on my code in hard-to-understand areas
  • Ran pnpm build
  • Ran pnpm fmt
  • Ran make fmt on /go directory
  • Checked for warnings, there are none
  • Removed all console.logs
  • Merged the latest changes from main onto my branch with git pull origin main
  • My changes don't cause any responsiveness issues

Appreciated

  • If a UI change was made: Added a screen recording or screenshots to this PR
  • Updated the Unkey Docs if changes were necessary

@vercel
Copy link

vercel bot commented Feb 12, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
dashboard Ready Ready Preview, Comment Feb 16, 2026 7:03pm
1 Skipped Deployment
Project Deployment Actions Updated (UTC)
engineering Ignored Ignored Preview Feb 16, 2026 7:03pm

Request Review

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 12, 2026

Caution

Review failed

The pull request is closed.

📝 Walkthrough

Walkthrough

This pull request replaces Kafka-based distributed cache invalidation with a gossip-based cluster membership system using HashiCorp memberlist. Changes include introducing a new cluster package implementing two-tier LAN/WAN gossip with automatic ambassador election, updating cache clustering to use a Broadcaster interface, removing the eventstream infrastructure, and wiring gossip configuration across API, Frontline, and Sentinel services with corresponding CLI flags and Kubernetes manifests.

Changes

Cohort / File(s) Summary
Infrastructure & Configuration Removal
.github/workflows/job_bazel.yaml, dev/docker-compose.yaml, Makefile, dev/Tiltfile
Removed Kafka from Docker Compose, CI workflows, and Makefile targets. Updated development environment to exclude Kafka container and dependencies.
Go Dependencies
go.mod, MODULE.bazel, tools/exportoneof/...
Replaced kafka-go with hashicorp/memberlist dependency. Added new exportoneof tool for proto code generation. Updated Bazel modules configuration.
Cluster Package Implementation
pkg/cluster/...
New 10-file cluster package implementing gossip-based two-tier membership (LAN/WAN) with SWIM protocol via memberlist. Includes bridge/ambassador election, DNS seed resolution, message multiplexing, and comprehensive tests.
Cache Clustering Updates
pkg/cache/clustering/broadcaster.go, broadcaster_gossip.go, broadcaster_noop.go, cluster_cache.go, dispatcher.go, gossip_e2e_test.go, BUILD.bazel
Introduced Broadcaster interface replacing eventstream-based invalidation. Added GossipBroadcaster for cluster-based propagation and NoopBroadcaster for disabled mode. Removed Kafka-backed tests and added gossip E2E tests.
Eventstream Package Removal
pkg/events/*, pkg/eventstream/*
Completely removed pub/sub Topic infrastructure, Producer/Consumer interfaces, and Kafka integration code. Deleted integration tests and no-op implementations.
API Service Integration
cmd/api/main.go, svc/api/config.go, svc/api/run.go, svc/api/BUILD.bazel
Replaced Kafka broker configuration with Gossip cluster flags (gossip-enabled, gossip-bind-addr, LAN/WAN ports and seeds, secret-key). Updated config struct and wiring logic.
Frontline Service Integration
cmd/frontline/main.go, svc/frontline/config.go, svc/frontline/run.go, svc/frontline/services/caches/...
Added Gossip configuration to CLI and service config. Updated cache service to use Broadcaster for distributed invalidation and NodeID for cluster identity.
Sentinel Service Integration
cmd/sentinel/main.go, svc/sentinel/config.go, svc/sentinel/run.go, svc/sentinel/services/router/...
Added Gossip cluster configuration and wiring. Updated router service with Broadcaster and NodeID fields for cache invalidation propagation.
Kubernetes Manifests
dev/k8s/manifests/api.yaml, dev/k8s/manifests/frontline.yaml, dev/k8s/manifests/cilium-policies.yaml
Added gossip LAN ports (7946 TCP/UDP) and environment variables to API and Frontline deployments. Created headless Services for gossip endpoints. Added CiliumNetworkPolicy rules for inter-pod gossip communication.
Proto Definitions
proto/cache/v1/invalidation.proto, proto/cluster/v1/envelope.proto
Refactored CacheInvalidationEvent with oneof action field supporting cache_key or clear_all. Created new ClusterMessage envelope with Direction enum and payload routing.
Build Configuration Updates
internal/services/caches/BUILD.bazel, svc/api/integration/cluster/cache/BUILD.bazel, svc/api/BUILD.bazel, svc/frontline/BUILD.bazel, svc/sentinel/BUILD.bazel, svc/sentinel/services/router/BUILD.bazel, pkg/cluster/BUILD.bazel, tools/exportoneof/BUILD.bazel
Added clustering and cluster dependencies across services. Removed eventstream and kafka-go dependencies. Narrowed test targets to new gossip-based implementations.
Kubernetes Controller Updates
svc/krane/internal/sentinel/apply.go, svc/krane/internal/sentinel/delete.go, svc/krane/internal/sentinel/controller.go, svc/krane/internal/sentinel/consts.go, svc/krane/pkg/labels/labels.go, svc/krane/run.go
Extended Sentinel K8s controller to manage gossip headless Services and CiliumNetworkPolicy resources. Added dynamic client integration and gossip LAN port constant. Added ComponentGossipLAN label method.
Test Removals & Refactoring
pkg/cache/clustering/consume_events_test.go, pkg/cache/clustering/e2e_test.go, pkg/cache/clustering/produce_events_test.go, svc/api/integration/cluster/cache/consume_events_test.go, svc/api/integration/cluster/cache/produce_events_test.go, pkg/eventstream/eventstream_integration_test.go
Removed all Kafka-backed integration tests. Replaced with new gossip E2E tests validating cross-node invalidation (Remove and Clear operations).
Integration Harness Updates
svc/api/integration/harness.go, svc/api/internal/testutil/http.go
Removed Docker-based Kafka orchestration. Updated caches config to use Broadcaster instead of CacheInvalidationTopic.
Documentation & Tooling
web/apps/engineering/content/docs/architecture/services/cluster-service.mdx, tools/exportoneof/main.go
Added comprehensive cluster architecture documentation. Introduced exportoneof code generation tool for proto oneof interface export.

Sequence Diagram(s)

sequenceDiagram
    participant Node1 as Node 1<br/>(API Instance)
    participant LAN1 as LAN Pool<br/>(memberlist)
    participant Node2 as Node 2<br/>(API Instance)
    participant LAN2 as LAN Pool<br/>(memberlist)
    participant WAN as WAN Pool<br/>(Ambassador)

    rect rgba(100, 150, 200, 0.5)
        Note over Node1,Node2: Same Region (LAN) Invalidation
        Node1->>LAN1: Broadcast(CacheInvalidation)
        LAN1->>Node2: NotifyMsg(ClusterMessage)
        Note over Node2: Deserialize & Apply<br/>Cache Invalidation
    end

    rect rgba(150, 100, 200, 0.5)
        Note over Node1,WAN: Inter-Region (WAN) Invalidation
        Node1->>LAN1: Broadcast(CacheInvalidation)
        LAN1->>WAN: Bridge relays to WAN<br/>(direction=DIRECTION_WAN)
        WAN->>LAN2: Ambassador notifies<br/>remote LAN pool
        LAN2->>Node2: NotifyMsg(ClusterMessage)
        Note over Node2: Deserialize & Apply<br/>Cache Invalidation
    end
Loading
sequenceDiagram
    participant App as Service Start
    participant Cluster as cluster.New()
    participant LAN as LAN Memberlist
    participant Seeds as LAN Seeds
    participant Bridge as Bridge Eval Loop
    participant WAN as WAN Memberlist
    participant WanSeeds as WAN Seeds

    App->>Cluster: New(cfg Config)
    activate Cluster
    Cluster->>LAN: Create with DefaultLANConfig
    Cluster->>LAN: Add Delegate & EventDelegate
    Cluster->>LAN: Create TransmitLimitedQueue
    Cluster->>Bridge: Start bridgeEvalLoop goroutine
    Cluster->>Seeds: joinSeeds(LANSeeds)
    activate Seeds
    Seeds->>LAN: Join with backoff/retry
    Seeds-->>Cluster: Success callback
    deactivate Seeds
    
    Note over Bridge: Periodic evaluation
    Bridge->>LAN: Get smallest member by name
    alt Is this node smallest?
        Bridge->>WAN: promoteToBridge
        activate WAN
        WAN->>WAN: Create with DefaultWANConfig
        WAN->>WAN: Add WAN delegate
        WAN->>WanSeeds: joinSeeds(WANSeeds)
        WanSeeds->>WAN: Join with backoff
        WAN-->>Bridge: Success
        deactivate WAN
    else Is not smallest
        Bridge->>WAN: demoteFromBridge (if currently bridge)
    end
    
    Cluster-->>App: Return Cluster instance
    deactivate Cluster
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 3

❌ Failed checks (2 warnings, 1 inconclusive)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description provides context on the gossip implementation and architectural goal but lacks critical information: testing steps, checklist items, and issue references are all missing or unchecked, failing to meet template requirements. Complete the PR template: reference a tracking issue, provide concrete testing steps, run all required checks (fmt, build, etc.), and check off template items before merging.
Docstring Coverage ⚠️ Warning Docstring coverage is 45.65% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'feat/gossip' is vague and generic. While it indicates a feature related to gossip, it does not clearly convey the primary change (replacing Kafka-based cache invalidation with a two-tier gossip cluster for distributed cache invalidation). Use a more descriptive title such as 'Replace Kafka-based cache invalidation with gossip cluster' to clearly summarize the main architectural change.
✅ Passed checks (1 passed)
Check name Status Explanation
Merge Conflict Detection ✅ Passed ✅ No merge conflicts detected when merging into main

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/gossip

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@internal/services/caches/caches.go`:
- Around line 168-249: The dispatcher created in New() must be closed when any
subsequent cache creation fails to avoid leaking resources: after the dispatcher
is successfully created (variable name dispatcher in New), ensure you call
dispatcher.Close() on every early return that follows (e.g., every "return
Caches{}, err" that occurs after calls to createCache such as when building
ratelimitNamespace, verificationKeyByHash, liveApiByID, clickhouseSetting,
keyAuthToApiRow, apiToKeyAuthRow, etc.), or preferably add a deferred cleanup
like "defer func(){ if !initialized { dispatcher.Close() } }()" immediately
after creating dispatcher and set initialized=true only on the final successful
return; update all error paths accordingly so dispatcher.Close() runs on
failure.

In `@pkg/cache/clustering/broadcaster_gossip.go`:
- Around line 60-63: GossipBroadcaster.Close currently forwards to
b.cluster.Close but ownership is ambiguous and can result in double-close;
modify GossipBroadcaster to make Close idempotent by adding a sync.Once (or
equivalent boolean + mutex) on the GossipBroadcaster struct and invoke
b.cluster.Close inside that Once, or clearly transfer/document ownership so only
one caller closes the cluster (e.g., remove cluster.Close from
GossipBroadcaster.Close if run.go defers closing); update the Close method on
GossipBroadcaster to use the Once/guard and ensure subsequent Close calls return
nil (or the original error) without calling cluster.Close again.

In `@svc/frontline/services/caches/caches.go`:
- Around line 104-160: When
clustering.NewInvalidationDispatcher(config.Broadcaster) succeeds but a
subsequent createCache call fails, the dispatcher is leaked; update the New()
path to call dispatcher.Close() (or dispatcher.Close(context?) depending on its
API) before each early return after dispatcher initialization (i.e., before each
fmt.Errorf return after createCache for frontlineRoute, sentinelsByEnvironment,
tlsCertificate). Guard the Close call with a nil check on dispatcher and ensure
you preserve the original returned error; do the same for any other early
returns in this function after dispatcher was set.
🧹 Nitpick comments (6)
svc/krane/internal/sentinel/apply.go (2)

392-446: Multiple gossip services with identical selectors per environment.

Each sentinel creates its own gossip service (<k8sName>-gossip-lan) but the selector matches ALL sentinels in the environment via EnvironmentID + ComponentSentinel. This means multiple headless services will resolve to the same set of pods.

While this works (DNS will resolve any of them to the same pod IPs), it creates redundant services. Consider either:

  1. Use a single environment-scoped gossip service name (idempotent across sentinels)
  2. Keep per-sentinel services but scope the selector to that sentinel

This isn't blocking since it functions correctly, but adds unnecessary resources.


448-524: Same redundancy applies to CiliumNetworkPolicy.

Similar to the gossip service, each sentinel creates its own policy with the same environment-scoped selector. Multiple policies with identical selectors are functionally equivalent but redundant.

pkg/cache/clustering/gossip_e2e_test.go (1)

54-55: Magic sleep may be fragile.

The 50ms sleep before node 2 creation appears to be a timing workaround. Consider documenting why this is needed or using a more deterministic approach (e.g., waiting for node 1 to be ready to accept connections).

dev/k8s/manifests/api.yaml (1)

78-84: Consider adding UNKEY_GOSSIP_BIND_ADDR.

Gossip enabled but bind address not specified. If the default (likely 0.0.0.0 or pod IP) is intentional, this is fine, but explicit config aids clarity.

svc/sentinel/services/router/service.go (1)

45-82: Consider extracting clusterOpts and createCache to a shared package.

This pattern is duplicated in svc/frontline/services/caches/caches.go. Could be a shared helper in pkg/cache/clustering.

pkg/cache/clustering/broadcaster_gossip.go (1)

31-39: Handler invocation uses context.Background() instead of propagating context.

The handler signature accepts a context, but HandleCacheInvalidation always passes context.Background(). Consider storing the subscription context or accepting context as a parameter if cancellation/deadline propagation is needed.

Copy link
Collaborator

@chronark chronark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe reorder the proto fields, but it's not super important

@Flo4604 Flo4604 merged commit c5bef06 into main Feb 16, 2026
11 of 12 checks passed
@Flo4604 Flo4604 deleted the feat/gossip branch February 16, 2026 19:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants