fix(#1775, #1489): path quality gating in onContactPathRecv + 3x flood ACK retry#2569
Conversation
…ontactPathRecv + 3x flood ACK retry - onContactPathRecv: don't replace a fresh stored path (<10 min old) with a longer-hop incoming path; prevents multipath duplicates from silently downgrading a working direct route (fixes meshcore-dev#1775) - ContactInfo: add out_path_timestamp (path freshness) and path_ack_count (cheap delivery success counter, reset on path change) - sendAckTo: send flood ACK 3x at 200/800/2000 ms staggered delays when no direct path is known; dedup handled by existing MeshTables::hasSeen() (fixes meshcore-dev#1489) - platformio.ini: add native_reliability env with GoogleTest unit tests - test/test_path_gating: 13 unit tests covering gating logic edge cases - test/test_flood_ack: 11 unit tests covering retry count, delays, direct path
…backup flood ACK Root causes of 'sender sees FAILED, recipient sees message': 1. Mesh::createPathReturn() omitted the random nonce when extra_len > 0 (path + embedded ACK case). AES-ECB is deterministic: without a nonce every PATH+ACK for the same message produced an identical ciphertext and thus an identical calculatePacketHash(). hasSeen() at intermediate nodes treated every retransmission as a duplicate and dropped it silently, so the single PATH+ACK was the only chance to deliver the ACK. Fix: always append the 4-byte random nonce regardless of extra presence. 2. BaseChatMesh::onPeerDataRecv(): when a flood message was received, only one PATH+ACK packet was sent. If that packet was lost over RF the sender had no fallback and showed the message as failed. Fix: after the PATH+ACK, also call sendAckTo() to send standalone flood ACKs at staggered delays (200/800/2000 ms) as an independent backup. 3. PATH_STICKINESS_WINDOW_SECS reduced 600 s -> 30 s. A 10-minute window prevented B from learning A's updated return path for up to 10 minutes, causing ACKs to travel via a stale direct route that no longer works. 30 s is long enough to reject multipath duplicates (50-200 ms apart) but short enough to adapt to topology changes. Tests: 29 tests pass (15 path-gating + 14 flood-ack, 3 new backup-ACK cases)
|
This is really interesting, and there is opportunity for improvement in this area. Less hops isn't automatically better, RX becomes a problem for high reach nodes so your stable 4 hop path with very high SNR, can degrade to a 2 hop path or 1 hop path where the sender got lucky with that packet. This is especially apparent on meshes that set If the first path received was locked for 10 seconds rather than 10 minutes and it didn't automatically accept a path with less hops, I think this would work very well. |
Good point, the hop-count preference fights against the rxdelay-based SNR ordering. And would test two more days on my t-1000-e. |
Revise the path-stickiness logic in onContactPathRecv based on reviewer feedback on PR meshcore-dev#2569. Problem with the previous hop-count gate: In meshes that set rxdelay > 1, SNR-ordered propagation means high-SNR (better) paths arrive first — which are often longer-hop paths. The previous 'keep if new_hops > stored_hops' check would then silently replace the first-arrived (better) path with the later-arrived shorter-hop (but lower-SNR) path. Fewer hops is not automatically better quality. New behaviour: Once a path is stored it is locked for PATH_STICKINESS_WINDOW_SECS (now 10 s, down from 30 s). ANY replacement is blocked during the window regardless of hop count — the first path to arrive wins. After 10 s the lock expires and any new path is accepted normally. Embedded ACKs/responses in the incoming path packet are still processed during the lock window so the sender's retry timer is cancelled correctly. path_ack_count is retained for future use as a proven-delivery signal (a path that delivered messages should be stickier after the window). Tests updated to match blanket-lock semantics: - removed hop-comparison test cases - added FreshPath_BlocksReplacement_ShorterHopIncoming (key regression) - added WindowIs10Seconds constant sanity check - 27 tests total, all passing
|
@CullenShane done - updated the PR |
Reliability Changes: Path Quality Gating & Flood ACK Retry
Related issues: #1775, #1489
Change 1 — Path Quality Gating in
onContactPathRecvThe Bug
BaseChatMesh::onContactPathRecvunconditionally overwrote the storedout_pathwith any newly-arriving path, regardless of the quality of thestored path or the quality of the incoming one.
In an RF mesh with multipath propagation, flood path-return packets can
arrive from multiple routes in quick succession. The first to arrive wins
— which is not necessarily the shortest route. More critically, a
longer-hop path arriving shortly after an established short-hop path
silently replaced the working route with a worse one.
Consequence (from issue #1775):
suboptimal multipath duplicate (e.g. 3 hops) arriving 50–200 ms later.
airtime, collision probability, and delivery failure rate.
peer even when the mesh topology has not changed.
The Fix
onContactPathRecvnow applies a stickiness window before accepting apath replacement:
PATH_STICKINESS_WINDOW_SECS(default 600 s / 10 min) and the incoming path has more hops than
the stored one, the stored path is kept.
processed regardless (so the sender's ACK timeout is cancelled correctly).
same or shorter hop count, or when no path was previously known — ensuring
the node still adapts to topology changes.
A new field
out_path_timestamp(uint32_t, zero-init) onContactInforecords the RTC time at which the current
out_pathwas last accepted.Delivery Success Tracking
A second new field
path_ack_count(uint8_t, zero-init, saturates at 255)is incremented in
onAckRecvwhenever an ACK arrives via the stored directpath (not flood). This provides a cheap per-contact signal of proven
delivery that can be used in future heuristics (e.g. giving a higher
stickiness weight to a path that has successfully delivered messages).
The counter is reset to zero whenever a new path is accepted, so it always
reflects delivery history on the current path.
Configurable Knobs
#definePATH_STICKINESS_WINDOW_SECS600Override before including
BaseChatMesh.hor via build flags.Change 2 — Flood ACK Reliability (
sendAckTo)The Bug
When a direct message arrived via flood routing (i.e. the recipient had no
stored direct path to the sender),
sendAckTosent exactly one floodACK packet.
LoRa RF environments are inherently lossy. A single ACK transmission:
When the ACK is lost the sender must wait for its full timeout (several
seconds) before attempting to retransmit the message, burning airtime and
battery, and degrading the user experience. In practice users reported
that doubling or tripling ACK transmissions (already possible via
getExtraAckTransmitCount()for direct-path ACKs) dramatically improvedperceived reliability (issue #1489).
The Fix
sendAckTonow sends three independent flood ACK packets at staggereddelays when
out_path_len == OUT_PATH_UNKNOWN:Each copy is a separately-scheduled, independent RF transmission.
If the first copy is lost the second (and third) have a fresh chance of
reaching the sender through a congestion-free window.
Deduplication on the Receiver Side
Duplicate suppression is handled by the pre-existing
MeshTables::hasSeen()mechanism at every node (including the destination):
is discarded immediately — the ACK is not processed twice.
repeaters nor the destination have seen it, so the second copy propagates
normally.
No protocol or wire-format changes are required. The feature is 100%
backward-compatible with older firmware nodes that forward ACK packets
without understanding the retry intent.
Configurable Knobs
#defineFLOOD_ACK_RETRY_COUNT3TXT_ACK_DELAY200The 2nd and 3rd delays (800 ms, 2000 ms) are currently fixed in the
flood_ack_delaysarray insidesendAckTo. They can be made configurablevia additional
#defines if needed.Files Changed
src/helpers/ContactInfo.hout_path_timestampandpath_ack_countfieldssrc/helpers/BaseChatMesh.cpponContactPathRecv— path gating logic;sendAckTo— 3× flood retry;onAckRecv— delivery trackingFiles Added
test/test_reliability/test_path_gating.cpptest/test_reliability/test_flood_ack.cppRunning the Unit Tests