-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Description
Current Behavior
While integrating APISIX as an ingress / access gateway for large downstream services, I encountered a discovery synchronization issue that appears to be structural rather than configuration-related.
During routine testing last night (this surfaced as part of ongoing work), APISIX failed to update service discovery state with the following error:
2025/12/26 12:32:39 [error] 49#49: *734 [lua] init.lua:583: post_event failure with discovery_consul_update_all_services, update all services error: failed to publish event: payload exceeds the limitation (65536), context: ngx.timer
This happened during a full discovery update from the registry (Consul in this case), triggered by a timer.
Expected Behavior
APISIX should be able to reliably synchronize service discovery state from the control plane to all workers, regardless of downstream service scale or autoscaling behavior.
Specifically:
- Discovery updates should not fail when the number of endpoints grows due to elastic scaling (e.g. Kubernetes HPA).
- Newly added instances should be correctly propagated to all workers and become eligible to receive traffic.
- Restarting an APISIX instance should always result in a complete and consistent discovery state being loaded from the registry.
In other words, service discovery should remain correct and complete as endpoint count grows.
Actual Behavior
When the number of instances for a single service grows beyond a certain point, APISIX exhibits the following failure modes:
-
After elastic scaling, newly added instances cannot receive traffic
As endpoint count increases, the serialized discovery snapshot exceeds the payload size that can be delivered via worker events.
When this happens, the discovery update event fails to be published, and newly registered instances are not propagated to workers.As a result, these instances exist in the registry (e.g. Consul) but are invisible to APISIX workers and never receive traffic.
-
After an APISIX restart, the node cannot load the full discovery state
If APISIX is restarted while the total discovery payload already exceeds the event delivery limit, the initial full discovery synchronization also fails.
In this case, the restarted APISIX node is unable to load all instances from the registry, leading to an incomplete local discovery state from startup.
Both behaviors result in stale or incomplete endpoint information, and the impact becomes more severe as service scale increases.
Why this is a structural issue
The current design assumes that discovery state can be delivered as a single event payload.
In large-scale environments:
- Endpoint count grows naturally with traffic and autoscaling
- Event payload size grows linearly with endpoint count
- Delivery via a single shared-memory-backed event becomes a hard bottleneck
As a result, this failure mode is inevitable in sufficiently large deployments and cannot be fully mitigated through configuration tuning alone.
Error Logs
2025/12/26 12:32:39 [error] 49#49: *734 [lua] init.lua:583: post_event failure with discovery_consul_update_all_services, update all services error: failed to publish event: payload exceeds the limitation (65536), context: ngx.timer
Steps to Reproduce
-
Prepare a Consul registry (local or remote) and configure APISIX to use Consul as the service discovery backend.
-
Register a large number of instances for a single service in Consul.
This can be done by repeatedly calling Consul’s
/v1/agent/service/registerAPI to register hundreds or thousands of instances under the same service name.
Each instance uses a uniqueID, but shares the sameName,Address, andPort, and includes instance-level metadata via theMetafield (used for endpoint matching in APISIX). -
Ensure that the total number of registered instances is large enough (e.g. 1000+) so that the serialized discovery snapshot (service + endpoints + metadata) exceeds the size that can be carried by a single worker event.
-
Wait for APISIX to trigger a full discovery synchronization (e.g. via the periodic discovery timer), or restart APISIX to force an initial full sync.
-
Observe the APISIX error log.
When the full discovery snapshot is broadcast via worker events, APISIX fails to publish the event and logs an error similar to:
[lua] init.lua:583: post_event failure with discovery_consul_update_all_services, update all services error: failed to publish event: payload exceeds the limitation (65536), context: ngx.timer
- At this point, workers stop receiving updated discovery state for the affected service, resulting in stale or inconsistent endpoint information.
#!/usr/bin/env python3
"""
Bulk register multiple services in Consul to reproduce APISIX discovery payload overflow,
using Meta fields instead of tags.
"""
import requests
CONSUL_HOST = "http://127.0.0.1:8500" # Consul address
SERVICE_NAME = "billing.ad.oversea.cost-entry"
INSTANCE_COUNT = 1000 # adjust to trigger payload limit
SERVICE_PORT = 8080
for i in range(1, INSTANCE_COUNT + 1):
service_id = f"{SERVICE_NAME}-{i}"
payload = {
"ID": service_id,
"Name": SERVICE_NAME,
"Address": "127.0.0.1",
"Port": SERVICE_PORT,
"Meta": {
"version": "v1",
"region": "us-east-1"
}
}
resp = requests.put(f"{CONSUL_HOST}/v1/agent/service/register", json=payload)
if resp.status_code != 200:
print(f"Failed to register {service_id}: {resp.text}")
else:
if i % 100 == 0:
print(f"Registered {i} instances...")
print("Done. Now trigger APISIX discovery sync to reproduce the issue.")Environment
- APISIX version (run
apisix version): - Operating system (run
uname -a): - OpenResty / Nginx version (run
openresty -Vornginx -V): - etcd version, if relevant (run
curl http://127.0.0.1:9090/v1/server_info): - APISIX Dashboard version, if relevant:
- Plugin runner version, for issues related to plugin runners:
- LuaRocks version, for installation issues (run
luarocks --version):
Metadata
Metadata
Assignees
Labels
Type
Projects
Status