-
Notifications
You must be signed in to change notification settings - Fork 241
Open
Description
MeshCore Telemetry Scaling Problem
Problem Statement
MeshCore moved from Meshtastic’s flood-push model to a PULL model (DM request when a path is known; flood request as fallback). While this reduces constant background chatter at small scale, it introduces risks as networks grow:
- REQ/RES doubling: Every telemetry read requires a request and a response; on lossy links this compounds with retries.
- Inversion of control: Harvesters can’t tell node offline vs link flakey vs route stale. That opacity drives retries and flood fallbacks that create bursts.
- Reliance on user good-faith: Meshtastic did not use polling, its failure mode was automatic flood telemetry saturating airtime. But MT still depended on users to configure responsibly (power, intervals, channels, topology). MeshCore risks a similar dynamic if it relies on external harvesters behaving; a single misconfigured harvester can overwhelm shared airtime. If left as is, users will inevitably abuse the network in a way that impacts it for everyone.
- Multi-request patterns for repeaters: Fetching “details” and “telemetry” separately costs airtime when it could be a single operation.
- Zombie harvesters - When telemetry harvesters are set-up there is a high likelihood they run indefinitely reaching out for telemetry sensors that may not even exist anymore.
These combine into airtime saturation, unfairness (some nodes starve), and instability during config changes or partial outages.
Pain Observations
- Traffic inflation from PULL: REQ + RES per read; retries multiply airtime.
- Flailing under uncertainty: Harvesters over-poll when responses are intermittent, often escalating to flood requests.
- Good-faith failure mode: Even one motivated user monitoring many telemetry sensors can unintentionally bring down a mesh, despite local rate-limits.
- Operational churn: Config changes correlate with reconnect storms and restarts; lifecycle gaps can leave background pollers running across reloads.
- Redundant queries for repeaters: Separate requests for details and telemetry increase contention and latency.
Example
PugetMesh Health. A single well-intended telemetry harvester monitored multiple deployed repeaters. When responses were intermittent, the harvester retried and escalated (including flood fallbacks), which led to:
- One origin saturating airtime: one harvester inadvertently degraded the mesh while “watching” many repeaters.
- Retry/flood amplification: intermittent replies triggered aggressive retries and flood requests, increasing contention for everyone.
- Collateral impact: retry bursts from that harvester hurt unrelated nodes’ ability to exchange traffic.
- Partial mitigation only: a local token bucket (~20 requests/hour per origin) helped, but is easy to bypass outside HA.
Potential alternatives
A) Combine repeater details + telemetry
- Single request/response with lpp channels.
- Cuts airtime/latency for common repeater checks; fall back to two-step if the peer lacks support.
B) Hybrid telemetry hooks (includes event-driven sensors)
- Harvester registers a telemetry hook with destination and preferred route.
- Firmware-enforced rate limits; no automatic flood fallback.
- Send on change with optional thresholds/debounce; bound with min_interval / max_interval. Optional heartbeat at max_interval for liveness.
- Hooks must be renewed; if renewal fails or expires, the node stops sending.
- Keeps airtime bounded and prevents a single harvester from saturating the mesh; places control with core firmware rather than external clients.
- Similar to implementation of roomservers in their pub/sub mechanism
C) Application level mitigation
- In HA we have implemented rate-limiting, auto disabling of unreachable nodes and other performance improvements like exponential back-off, but we can't guarantee all future scripts, automations, and clients implement these same safeguards.
Metadata
Metadata
Assignees
Labels
No labels