quic: fix crash scenario by handling stale pkt_meta in stream retry path, add unit tests#8923
quic: fix crash scenario by handling stale pkt_meta in stream retry path, add unit tests#8923cmoyes-jump wants to merge 3 commits intomainfrom
pkt_meta in stream retry path, add unit tests#8923Conversation
Performance Measurements ⏳
|
There was a problem hiding this comment.
Pull request overview
This PR addresses a QUIC crash in fd_quic_pkt_meta_retry when ACK-driven loss detection leaves behind a stale pkt_meta for stream data that has already been fully acknowledged via retransmission. It adds a targeted unit-test suite to exercise stream ACK/loss/retry lifecycle behavior and prevent regressions.
Changes:
- Add a stale-
pkt_metaguard in the stream retry path to skip retries when the stream has no remaining data/FIN to send. - Introduce a new unit test
test_quic_pkt_meta_lifecyclecovering stream lifecycle, ACK ordering/gaps, retry behavior, and the stale-pkt_metaregression scenario. - Register the new unit test in the QUIC tests makefile.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| src/waltz/quic/fd_quic.c | Adds a stale-pkt_meta skip check in the stream retry path to prevent a crash on completed streams. |
| src/waltz/quic/tests/test_quic_pkt_meta_lifecycle.c | New lifecycle/regression unit tests for pkt_meta handling, including the stale-pkt_meta crash reproduction. |
| src/waltz/quic/tests/Local.mk | Adds the new unit test target to the build/test list. |
You can also share your feedback on Copilot code review. Take the survey.
Performance Measurements ⏳
|
There was a problem hiding this comment.
Pull request overview
This PR fixes a QUIC retransmission crash by treating certain expired pkt_meta entries as stale when the associated stream data has already been fully acknowledged, and adds a new unit test intended to prevent regressions in pkt_meta lifecycle handling.
Changes:
- Add a stale-
pkt_metaguard infd_quic_pkt_meta_retry()to skip retrying stream data that is already fully ACKed. - Introduce a new
test_quic_pkt_meta_lifecycleunit test covering stream send/ACK/retry flows and the stale-pkt_metascenario. - Register the new unit test target in the QUIC tests
Local.mk.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
src/waltz/quic/fd_quic.c |
Skips retry work for stale stream pkt_meta entries to avoid completed-stream assertions/crashes. |
src/waltz/quic/tests/test_quic_pkt_meta_lifecycle.c |
Adds lifecycle + regression coverage for ACK/loss/retry interactions around pkt_meta. |
src/waltz/quic/tests/Local.mk |
Adds the new unit test build target. |
You can also share your feedback on Copilot code review. Take the survey.
Performance Measurements ⏳
|
There was a problem hiding this comment.
Pull request overview
This PR fixes a QUIC retransmit crash by detecting and skipping stale pkt_meta entries during stream retry, and expands/refactors multiple runtime components (program cache, leaders/stakes plumbing, shred validation, etc.) to align with newer protocol/runtime behaviors.
Changes:
- QUIC: Skip retries for stale stream
pkt_metawhose byte range is already covered by the stream delivery watermark. - Runtime: Large refactor of program cache APIs/record handling (introducing explicit record closing), plus substantial updates across system/stake/vote plumbing and leader/stake weight handling.
- Tests/tooling: Add/adjust unit tests, fuzz target(s), and CI/backtest scripts/configuration.
Reviewed changes
Copilot reviewed 150 out of 195 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| src/waltz/quic/tests/Local.mk | Adds new QUIC unit test target for pkt_meta lifecycle |
| src/waltz/quic/fd_quic.c | Fixes stale pkt_meta retry crash; tightens free list validation |
| src/util/wksp/fd_wksp_user.c | Adds TLS flag to silence OOM warnings (conditionally) |
| src/util/tmpl/fd_map_chain_para.c | Adds MAP_CUSTOM_CHAIN option to customize chain header generation |
| src/flamenco/vm/syscall/test_cpi_shared_data_addr.c | Updates progcache initialization/join APIs used by tests |
| src/flamenco/types/fd_types_custom.h | Adds dummy pubkey constant for compressed stake weights |
| src/flamenco/stakes/fd_vote_stakes.c | Minor formatting change |
| src/flamenco/stakes/fd_stakes.h | Adds stakes APIs and constants; removes old prototypes |
| src/flamenco/stakes/Local.mk | Builds fd_stakes only when FD_HAS_DOUBLE |
| src/flamenco/runtime/tests/test_accounts_resize_delta.c | Updates progcache + runtime stack allocation APIs in tests |
| src/flamenco/runtime/tests/run_ledger_backtest.sh | Changes epoch timing reporting |
| src/flamenco/runtime/tests/fd_txn_harness.c | Updates progcache usage/compaction to new APIs |
| src/flamenco/runtime/tests/fd_solfuzz.h | Removes progcache admin from runner struct |
| src/flamenco/runtime/tests/fd_instr_harness.c | Updates progcache usage + ALUT/v4 loader ownership plumbing |
| src/flamenco/runtime/tests/fd_dump_pb.c | Switches to fd_alut.h; fixes stake delegation iteration constness; adjusts epoch credits source |
| src/flamenco/runtime/test_runtime_alut.c | Switches ALUT include to fd_alut.h |
| src/flamenco/runtime/test_instr_acct_bounds.c | Updates progcache lifecycle in tests |
| src/flamenco/runtime/test_hashes.c | Adds unit test for fd_hashes_apply_hard_forks |
| src/flamenco/runtime/test_deprecate_rent_exemption_threshold.c | Updates stake delegations API usage; updates runtime stack allocation |
| src/flamenco/runtime/test_bundle_exec.c | Updates runtime stack allocation APIs in test |
| src/flamenco/runtime/program/vote/fd_vote_state_v4.c | Uses bank rent query instead of sysvar cache rent read |
| src/flamenco/runtime/program/vote/fd_vote_state_v3.c | Uses bank rent query instead of sysvar cache rent read |
| src/flamenco/runtime/program/test_vote_program.c | Updates progcache + runtime stack allocation in tests |
| src/flamenco/runtime/program/fd_vote_program.h | Removes fd_vote_convert_to_current from public header |
| src/flamenco/runtime/program/fd_vote_program.c | Removes fd_vote_convert_to_current implementation |
| src/flamenco/runtime/program/fd_system_program.h | Adds create-account-allow-prefund handler decl; reflows prototypes |
| src/flamenco/runtime/program/fd_system_program.c | Implements create_account_allow_prefund and dispatch case |
| src/flamenco/runtime/program/fd_stake_program.h | Deletes old stake program header |
| src/flamenco/runtime/program/fd_loader_v4_program.c | Updates progcache pull API and ensures record close |
| src/flamenco/runtime/program/fd_config_program.h | Deletes old config program header |
| src/flamenco/runtime/program/fd_bpf_loader_program.c | Updates progcache record accessors and ensures record close |
| src/flamenco/runtime/program/fd_address_lookup_table_program.h | Deletes old ALUT program header |
| src/flamenco/runtime/program/Local.mk | Removes config/stake/alut program build; adds new unit test target for allow-prefund |
| src/flamenco/runtime/fd_system_ids_pp.h | Removes some buffer IDs no longer referenced |
| src/flamenco/runtime/fd_system_ids.h | Removes externs for deprecated buffer addresses |
| src/flamenco/runtime/fd_system_ids.c | Removes deprecated buffer constants; keeps remaining IDs |
| src/flamenco/runtime/fd_runtime_const.h | Updates runtime vote/stake account bounds and documentation |
| src/flamenco/runtime/fd_runtime.h | Removes stake_program scratch union from runtime struct |
| src/flamenco/runtime/fd_hashes.h | Adds public API doc + prototype for hard fork mixing |
| src/flamenco/runtime/fd_hashes.c | Implements fd_hashes_apply_hard_forks |
| src/flamenco/runtime/fd_executor.c | Removes old native programs; updates perfect-hash table; adds bundle duplicate-msg-hash check |
| src/flamenco/runtime/fd_cost_tracker.c | Extends account-data allocation size costing for new system instruction |
| src/flamenco/runtime/fd_alut_interp.h | Moves ALUT helper logic into interpreter header (and removes old dependency) |
| src/flamenco/rewards/fd_stake_rewards.c | Replaces index pool impl with linear array allocation and counter |
| src/flamenco/progcache/test_progcache_common.c | Updates progcache test env to new shmem + join APIs |
| src/flamenco/progcache/fd_progcache_user.h | Major progcache API changes: join/leave signature, record handle semantics, metrics, record close |
| src/flamenco/progcache/fd_progcache_rec.h | Refactors record layout; adds lock; moves value storage to wksp gaddr; refactors accessors |
| src/flamenco/progcache/fd_progcache_rec.c | Implements new value alloc/free and record load/noexec helpers |
| src/flamenco/progcache/fd_progcache_admin.h | Refactors admin API to operate on fd_progcache_join_t; adds metrics + wksp metrics update |
| src/flamenco/progcache/Local.mk | Adds new progcache core object; removes racesan unit test target |
| src/flamenco/leaders/test_leaders.c | Adds tests for leader membership bitset |
| src/flamenco/leaders/fd_multi_epoch_leaders.h | Updates stake weight storage bound for compressed stake weights |
| src/flamenco/leaders/fd_multi_epoch_leaders.c | Validates against new compressed stake weight bound |
| src/flamenco/leaders/fd_leaders_base.h | Adjusts stake/vote bounds and documents compressed stake weights scheme |
| src/flamenco/leaders/fd_leaders.h | Adds leader membership bitset APIs + footprint update |
| src/flamenco/leaders/fd_leaders.c | Builds and stores leader membership bitset |
| src/flamenco/gossip/fd_gossip.h | Updates stakes update API to consume fd_vote_stake_weight_t |
| src/flamenco/gossip/fd_gossip.c | Refactors stake map/pool sizing; dedups stakes keyed on identity; skips dummy weights |
| src/flamenco/genesis/fd_genesis_create.c | Switches stake include to new stakes header |
| src/flamenco/features/fd_features_generated.h | Adds new feature flag: create_account_allow_prefund; updates set size/id |
| src/flamenco/features/fd_features_generated.c | Adds new feature ID; updates an existing feature ID and marks some cleaned_up |
| src/flamenco/fd_rwlock.h | Adds tryread/trywrite/demote; fixes fd_rwlock_new return; adds fences/volatile stores |
| src/flamenco/accdb/fd_accdb_lineage.h | Treats root XID as always present in lineage queries |
| src/flamenco/accdb/fd_accdb_lineage.c | Removes implicit root insertion into fork list |
| src/flamenco/accdb/fd_accdb_admin_v1.c | Fixes typo; moves child head/tail reset after clearing list |
| src/discoh/bank/fd_bank_tile.c | Removes temporary “simple vote” compute override logic |
| src/discof/txsend/fd_txsend_tile.c | Updates epoch stake count bound check |
| src/discof/tower/fd_tower_tile.c | Replaces stake_ci usage with multi-epoch leaders |
| src/discof/rpc/fd_rpc_tile.c | Fixes JSON-RPC error code sign |
| src/discof/restore/utils/fd_ssmsg.h | Hard-codes snapshot manifest bounds with FIXMEs for larger support |
| src/discof/restore/utils/fd_ssload.c | Uses root stake delegations update; bounds epoch credits loading |
| src/discof/restore/fd_snapin_tile_funk.c | Updates OOM error message to point to accounts sizing |
| src/discof/restore/fd_snapin_tile.c | Applies hard fork mixing into computed bank hash |
| src/discof/resolv/fd_resolv_tile.c | Switches to fd_alut.h include |
| src/discof/replay/fd_sched.h | Adds fd_sched_metrics_write prototype |
| src/discof/repair/fuzz_repair_serde.c | Adds fuzz target for ping serde roundtrip |
| src/discof/repair/fd_repair.h | Exposes ping de/ser APIs |
| src/discof/repair/fd_repair.c | Implements ping de/ser |
| src/discof/repair/fd_policy.h | Updates high-level policy docs; changes peer ping state field |
| src/discof/repair/fd_policy.c | Updates peer init for ping state field |
| src/discof/repair/Local.mk | Adds fuzz test target for repair serde |
| src/discof/poh/fd_poh.h | Clarifies microblock limit semantics; removes slot_done comment/field |
| src/discof/poh/fd_poh.c | Adjusts done_packing accounting to reserve phantom “microblock” |
| src/discof/gossip/fd_gossip_tile.h | Removes stake conversion scratch; hard-codes snapshot bounds with FIXMEs |
| src/discof/gossip/fd_gossip_tile.c | Uses vote stake weights directly; adjusts snapshot bounds assertion; removes prior workspace alloc |
| src/discof/forest/fd_forest.h | Updates documentation for new repair policy/forest iteration behavior |
| src/discof/forest/fd_forest.c | Removes some FD_TEST assertions; replaces one with comment |
| src/discof/fd_accdb_topo.c | Updates progcache topo init to new join signature (no locks obj) |
| src/discof/execrp/fd_execrp_tile.c | Updates progcache metrics; adds TLS OOM silencing |
| src/discof/execle/fd_execle_tile.c | Fixes typo in comment |
| src/disco/topo/fd_topob.c | Resets agave affinity count during auto-layout |
| src/disco/shred/test_stake_ci.c | Updates test bounds for stake/validator counts |
| src/disco/shred/test_shred_dest_conformance.c | Increases stake_ci buffers to 64 MiB |
| src/disco/shred/test_shred_dest.c | Minor assignment formatting |
| src/disco/shred/test_fec_resolver.c | Fixes parity shred test setup for fec_set_idx/idx |
| src/disco/shred/fd_stake_ci.h | Updates MAX footprint and stake weight array sizing bound |
| src/disco/shred/fd_stake_ci.c | Accepts compressed weights; filters dummy stakes; updates bounds checks |
| src/disco/shred/fd_shred_tile.c | Wires discard_unexpected_data_complete_shreds feature into fec_resolver |
| src/disco/shred/fd_shred_dest.c | Reformat scratch alloc append lines |
| src/disco/shred/fd_fec_resolver.c | Tightens shred validation (idx/fec_set_idx invariants, merkle depth checks, recovered shred checks) |
| src/disco/plugin/fd_plugin_tile.c | Updates peer count bounds checks |
| src/disco/metrics/generated/fd_metrics_shred.h | Updates histogram max for contact info count |
| src/disco/metrics/generated/fd_metrics_replay.c | Adds scheduler + progcache metrics entries |
| src/disco/metrics/generated/fd_metrics_repair.h | Adds repair failed-sigverify metric; updates total |
| src/disco/metrics/generated/fd_metrics_repair.c | Adds repair failed-sigverify metric |
| src/disco/metrics/generated/fd_metrics_execrp.c | Replaces old progcache metrics with expanded set |
| src/disco/gui/fd_gui_peers.h | Sizes GUI epoch stakes by MAX_STAKED_LEADERS |
| src/disco/gui/fd_gui_peers.c | Filters dummy stake weights; enforces new bounds |
| src/disco/gui/fd_gui.h | Updates GUI peer count bound |
| src/disco/gui/fd_gui.c | Updates GUI bound handling and txn flags init logic |
| src/disco/gui/dist_stable/version | Updates bundled GUI build version hash |
| src/disco/gui/dist_stable/index.html | Updates bundled GUI asset URLs/preloads for new build |
| src/ballet/secp256r1/fd_secp256r1_s2n.c | Adds mixed-add helper to handle doubling/identity case in double-scalar mul; whitespace cleanups |
| src/app/shared/fd_config.c | Validates new program cache config parameters |
| src/app/firedancer/topology.c | Replaces progcache backing objects with new progcache topo obj model |
| src/app/firedancer/main.c | Registers progcache topology callbacks |
| src/app/firedancer/config/default.toml | Changes default program cache heap size |
| src/app/firedancer/callbacks.c | Adds progcache object callbacks for topo init |
| src/app/firedancer-dev/main.h | Registers progcache topology callbacks in dev build |
| src/app/firedancer-dev/commands/sim.c | Increases replay_epoch link MTU for compressed stakes |
| src/app/firedancer-dev/commands/send_test/send_test_helpers.c | Minor formatting |
| src/app/firedancer-dev/commands/backtest.c | Updates progcache topo uses to new obj model |
| src/app/fdctl/with-version.mk | Switches to Agave-aware version parsing/encoding |
| src/app/fdctl/with-agave.mk | Adds Agave semver/prerelease parsing helpers |
| src/app/fdctl/topology.c | Updates link MTUs for new larger stake/peer bounds |
| contrib/test/test-vectors-commit-sha.txt | Updates test vectors commit SHA |
| contrib/test/run_solcap_tests.sh | Updates ledger/test configuration; changes solcap-tools usage |
| contrib/test/ledger_common.sh | Makes echo_error robust when arg2 missing |
| contrib/tag-release.py | Updates prerelease patch parsing logic |
| config/everything.mk | Excludes fuzz targets from coverage collection |
| agave | Updates Agave submodule revision |
| .github/workflows/benchmark.yml | Updates benchmark ledger slots |
| .github/workflows/backtest.yml | Increases CI timeouts |
Comments suppressed due to low confidence (1)
src/discof/repair/fd_repair.c:1
fd_repair_ping_derejects non-FD_REPAIR_KIND_PINGkinds, butfd_repair_ping_serwill serialize whateverping->kindcontains. For a pair of*_ping_*APIs, this asymmetry is surprising and can lead to non-roundtrippable output. Consider enforcingping->kind == FD_REPAIR_KIND_PINGin the serializer (return error otherwise), or rename/generalize the APIs if the intent is to support additional kinds.
You can also share your feedback on Copilot code review. Take the survey.
c597e64 to
89f6550
Compare
Test Vectors CoverageOverall (lines / functions / branches across all instrumented files)
Per directory (line coverage, recursive)
Full HTML report: coverage report. |
Performance Measurements ⏳
|
89f6550 to
7d5acd2
Compare
Performance Measurements ⏳
|
There was a problem hiding this comment.
Pull request overview
This PR addresses a QUIC retransmission crash caused by stale pkt_meta entries that can remain after stream data has already been fully acknowledged via retransmission. It adds a guard in the retry path to skip retrying such stale entries and introduces a dedicated unit test suite covering stream lifecycle, ACK handling, and pkt_meta retry/loss behavior (including the regression).
Changes:
- Prevent
fd_quic_pkt_meta_retryfrom attempting to retry stale streampkt_metaentries when the stream has no remaining data to send. - Add
test_quic_pkt_meta_lifecycleunit test covering stream send/ACK cleanup, ACK ordering/gaps, retry behavior, and a regression test for the stalepkt_metacrash. - Register the new unit test target in the QUIC tests
Local.mk.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
src/waltz/quic/fd_quic.c |
Adds a stale-pkt-meta guard in the stream retry path to avoid retrying completed streams. |
src/waltz/quic/tests/test_quic_pkt_meta_lifecycle.c |
Adds comprehensive unit tests for pkt_meta lifecycle and the crash regression scenario. |
src/waltz/quic/tests/Local.mk |
Hooks the new lifecycle unit test into the build and test run list. |
| ulong last_pkt = conn->pkt_number[ APP_PN_SPACE ]; | ||
| for( uint j=0; j<n; j++ ) { | ||
| conn->flags = ( conn->flags & ~FD_QUIC_CONN_FLAGS_PING_SENT ) | FD_QUIC_CONN_FLAGS_PING; | ||
| conn->upd_pkt_number = FD_QUIC_PKT_NUM_PENDING; | ||
| sandbox->wallclock += (long)1e6; | ||
| conn->svc_meta.next_timeout = sandbox->wallclock; | ||
| fd_quic_svc_timers_schedule( state->svc_timers, conn, sandbox->wallclock ); | ||
| fd_quic_service( quic, sandbox->wallclock ); | ||
| while( fd_quic_sandbox_next_packet( sandbox ) ) {} | ||
| last_pkt = conn->pkt_number[ APP_PN_SPACE ] - 1UL; | ||
| } |
There was a problem hiding this comment.
send_pings sets last_pkt = conn->pkt_number[ APP_PN_SPACE ] - 1UL without verifying that fd_quic_service actually transmitted a packet. If no packet is sent (e.g. due to pkt_meta exhaustion or other gating), this underflows and returns a bogus packet number. Consider asserting conn->pkt_number[APP_PN_SPACE] advanced (similar to other QUIC tests), or compute last_pkt from emitted packets / remove the return value if it’s not needed.
Our QUIC protocol implementation keeps a ledger of every packet it sent (
pkt_meta). When a peer ACK's a packet, its ledger entry is crossed off. When a packet is presumed lost, the ledger entry triggers a re-transmit.RFC 9002 Section 6 "Loss Detection" Subsection 1 "Acknowledgment-Based Detection" states that "a packet is declared lost if ... the packet was sent
kPacketThresholdpackets before an acknowledged packet (Section 6.1.1), or it was sent long enough in the past (Section 6.1.2)."A single stream's data can span multiple packets. Suppose e.g. packet 5 carries bytes [0, 100) and that packet 8 carries bytes [100, 200). Each gets its own ledger entry. With
kPacketThreshold=3, this means an ACK for e.g. for packets 9 through 11 force-declares packet 5 lost (three packets above it were ACK'ed), but packet 8 is not declared lost because it's only one below the ACK'ed range. Note that QUIC does not require contiguous packets to contain contiguous byte ranges. For example, packets 6-7 could be another ACK, a PING, or data for a different stream on the same connection.Here's an example sequence that currently crashes with
FD_LOG_CRITdue to the assertion on( !( (stream->tx_sent < stream->tx_buf.head) | (stream->state & FD_QUIC_STREAM_STATE_TX_FIN) ) ):unacked_low) advances to 200. Every byte is now confirmed delivered.Fix: Before the panic, check whether the ledger entry is simply stale (i.e. the delivery watermark already covers its byte range). If so, skip the retry, and let the entry be cleaned up normally. With this fix, the existing
FD_FD_LOG_CRITassertion can still be preserved below this newly-added check as a safety net for any genuinely unexpected state.