Skip to content

bug: Discovery synchronization fails for large services due to shared memory payload constraints #12846

@jizhuozhi

Description

@jizhuozhi

Current Behavior

While integrating APISIX as an ingress / access gateway for large downstream services, I encountered a discovery synchronization issue that appears to be structural rather than configuration-related.

During routine testing last night (this surfaced as part of ongoing work), APISIX failed to update service discovery state with the following error:

2025/12/26 12:32:39 [error] 49#49: *734 [lua] init.lua:583: post_event failure with discovery_consul_update_all_services, update all services error: failed to publish event: payload exceeds the limitation (65536), context: ngx.timer

This happened during a full discovery update from the registry (Consul in this case), triggered by a timer.


Expected Behavior

APISIX should be able to reliably synchronize service discovery state from the control plane to all workers, regardless of downstream service scale or autoscaling behavior.

Specifically:

  • Discovery updates should not fail when the number of endpoints grows due to elastic scaling (e.g. Kubernetes HPA).
  • Newly added instances should be correctly propagated to all workers and become eligible to receive traffic.
  • Restarting an APISIX instance should always result in a complete and consistent discovery state being loaded from the registry.

In other words, service discovery should remain correct and complete as endpoint count grows.


Actual Behavior

When the number of instances for a single service grows beyond a certain point, APISIX exhibits the following failure modes:

  1. After elastic scaling, newly added instances cannot receive traffic

    As endpoint count increases, the serialized discovery snapshot exceeds the payload size that can be delivered via worker events.
    When this happens, the discovery update event fails to be published, and newly registered instances are not propagated to workers.

    As a result, these instances exist in the registry (e.g. Consul) but are invisible to APISIX workers and never receive traffic.

  2. After an APISIX restart, the node cannot load the full discovery state

    If APISIX is restarted while the total discovery payload already exceeds the event delivery limit, the initial full discovery synchronization also fails.

    In this case, the restarted APISIX node is unable to load all instances from the registry, leading to an incomplete local discovery state from startup.

Both behaviors result in stale or incomplete endpoint information, and the impact becomes more severe as service scale increases.


Why this is a structural issue

The current design assumes that discovery state can be delivered as a single event payload.

In large-scale environments:

  • Endpoint count grows naturally with traffic and autoscaling
  • Event payload size grows linearly with endpoint count
  • Delivery via a single shared-memory-backed event becomes a hard bottleneck

As a result, this failure mode is inevitable in sufficiently large deployments and cannot be fully mitigated through configuration tuning alone.

Error Logs

2025/12/26 12:32:39 [error] 49#49: *734 [lua] init.lua:583: post_event failure with discovery_consul_update_all_services, update all services error: failed to publish event: payload exceeds the limitation (65536), context: ngx.timer

Steps to Reproduce

  1. Prepare a Consul registry (local or remote) and configure APISIX to use Consul as the service discovery backend.

  2. Register a large number of instances for a single service in Consul.

    This can be done by repeatedly calling Consul’s /v1/agent/service/register API to register hundreds or thousands of instances under the same service name.
    Each instance uses a unique ID, but shares the same Name, Address, and Port, and includes instance-level metadata via the Meta field (used for endpoint matching in APISIX).

  3. Ensure that the total number of registered instances is large enough (e.g. 1000+) so that the serialized discovery snapshot (service + endpoints + metadata) exceeds the size that can be carried by a single worker event.

  4. Wait for APISIX to trigger a full discovery synchronization (e.g. via the periodic discovery timer), or restart APISIX to force an initial full sync.

  5. Observe the APISIX error log.
    When the full discovery snapshot is broadcast via worker events, APISIX fails to publish the event and logs an error similar to:

[lua] init.lua:583: post_event failure with discovery_consul_update_all_services, update all services error: failed to publish event: payload exceeds the limitation (65536), context: ngx.timer
  1. At this point, workers stop receiving updated discovery state for the affected service, resulting in stale or inconsistent endpoint information.
#!/usr/bin/env python3
"""
Bulk register multiple services in Consul to reproduce APISIX discovery payload overflow,
using Meta fields instead of tags.
"""

import requests

CONSUL_HOST = "http://127.0.0.1:8500"  # Consul address
SERVICE_NAME = "billing.ad.oversea.cost-entry"
INSTANCE_COUNT = 1000  # adjust to trigger payload limit
SERVICE_PORT = 8080

for i in range(1, INSTANCE_COUNT + 1):
    service_id = f"{SERVICE_NAME}-{i}"
    payload = {
        "ID": service_id,
        "Name": SERVICE_NAME,
        "Address": "127.0.0.1",
        "Port": SERVICE_PORT,
        "Meta": {
            "version": "v1",
            "region": "us-east-1"
        }
    }
    resp = requests.put(f"{CONSUL_HOST}/v1/agent/service/register", json=payload)
    if resp.status_code != 200:
        print(f"Failed to register {service_id}: {resp.text}")
    else:
        if i % 100 == 0:
            print(f"Registered {i} instances...")

print("Done. Now trigger APISIX discovery sync to reproduce the issue.")

Environment

  • APISIX version (run apisix version):
  • Operating system (run uname -a):
  • OpenResty / Nginx version (run openresty -V or nginx -V):
  • etcd version, if relevant (run curl http://127.0.0.1:9090/v1/server_info):
  • APISIX Dashboard version, if relevant:
  • Plugin runner version, for issues related to plugin runners:
  • LuaRocks version, for installation issues (run luarocks --version):

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    Status

    📋 Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions