MeshCore Telemetry Scaling Problem

# MeshCore Telemetry Scaling Problem

## Problem Statement
MeshCore moved from Meshtastic’s flood-push model to a PULL model (DM request when a path is known; flood request as fallback). While this reduces constant background chatter at small scale, it introduces risks as networks grow:

- **REQ/RES doubling:** Every telemetry read requires a request and a response; on lossy links this compounds with retries.
- **Inversion of control:** Harvesters can’t tell *node offline* vs *link flakey* vs *route stale*. That opacity drives retries and flood fallbacks that create bursts.
- **Reliance on user good-faith:** Meshtastic did **not** use polling, its failure mode was automatic *flood* telemetry saturating airtime. But MT still depended on users to configure responsibly (power, intervals, channels, topology). MeshCore risks a similar dynamic if it relies on external harvesters behaving; a single misconfigured harvester can overwhelm shared airtime. If left as is, users will inevitably abuse the network in a way that impacts it for everyone.
- **Multi-request patterns for repeaters:** Fetching “details” and “telemetry” separately costs airtime when it could be a single operation.
- **Zombie harvesters** - When telemetry harvesters are set-up there is a high likelihood they run indefinitely reaching out for telemetry sensors that may not even exist anymore.

These combine into airtime saturation, unfairness (some nodes starve), and instability during config changes or partial outages.

## Pain Observations
- **Traffic inflation from PULL:** REQ + RES per read; retries multiply airtime.
- **Flailing under uncertainty:** Harvesters over-poll when responses are intermittent, often escalating to flood requests.
- **Good-faith failure mode:** Even one motivated user monitoring many telemetry sensors can unintentionally bring down a mesh, despite local rate-limits.
- **Operational churn:** Config changes correlate with reconnect storms and restarts; lifecycle gaps can leave background pollers running across reloads.
- **Redundant queries for repeaters:** Separate requests for details and telemetry increase contention and latency.

## Example
**PugetMesh Health.** A single well-intended telemetry harvester monitored multiple deployed repeaters. When responses were intermittent, the harvester retried and escalated (including flood fallbacks), which led to:
- **One origin saturating airtime:** one harvester inadvertently degraded the mesh while “watching” many repeaters.
- **Retry/flood amplification:** intermittent replies triggered aggressive retries and flood requests, increasing contention for everyone.
- **Collateral impact:** retry bursts from that harvester hurt unrelated nodes’ ability to exchange traffic.
- **Partial mitigation only:** a local token bucket (~20 requests/hour per origin) helped, but is easy to bypass outside HA.

## Potential alternatives

### A) Combine repeater details + telemetry
- Single request/response with lpp channels.
- Cuts airtime/latency for common repeater checks; fall back to two-step if the peer lacks support.

### B) Hybrid telemetry hooks (includes event-driven sensors)
- Harvester registers a **telemetry hook** with **destination** and **preferred route**.
- **Firmware-enforced rate limits**; **no automatic flood fallback**.
- **Send on change** with optional thresholds/debounce; bound with **min_interval / max_interval**. Optional heartbeat at max_interval for liveness.
- Hooks must be **renewed**; if renewal fails or expires, the node **stops sending**.
- Keeps airtime bounded and prevents a single harvester from saturating the mesh; places control with core firmware rather than external clients.
- Similar to implementation of roomservers in their pub/sub mechanism

### C) Application level mitigation
- In HA we have implemented rate-limiting, auto disabling of unreachable nodes and other performance improvements like exponential back-off, but we can't guarantee all future scripts, automations, and clients implement these same safeguards.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MeshCore Telemetry Scaling Problem #1031

MeshCore Telemetry Scaling Problem

Problem Statement

Pain Observations

Example

Potential alternatives

A) Combine repeater details + telemetry

B) Hybrid telemetry hooks (includes event-driven sensors)

C) Application level mitigation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

MeshCore Telemetry Scaling Problem #1031

Description

MeshCore Telemetry Scaling Problem

Problem Statement

Pain Observations

Example

Potential alternatives

A) Combine repeater details + telemetry

B) Hybrid telemetry hooks (includes event-driven sensors)

C) Application level mitigation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions