Skip to content

Conversation

Jiloc
Copy link
Contributor

@Jiloc Jiloc commented Aug 5, 2025

This PR builds on top of #6344 and will remain in review until that one is merged.

Previously, the /v3/health endpoint relied on data from NakamotoDownloadStateMachine, which is consumed after evaluation. This led to inconsistent behavior: during syncing, the endpoint worked correctly, but once the node reached chaintip, the necessary data was often already consumed, resulting in missing neighbor info and a 500 error.

This PR introduces a new approach builds on top of the changes in #6344, which ensure that all RPC endpoints include the X-Canonical-Stacks-Tip-Height header in their responses. Now, the node caches the highest tip received from neighbors during normal operations and uses that as the reference point for the /v3/health endpoint.

Key changes:

  • Removed old logic and associated functions like get_max_stacks_height_of_neighbors.
  • Instead of adding a dedicated 10-minute integration test with multiple nodes, I extended the existing multiple_miner test. This ensures that ConversationHttp::chat is called and that the peer height is updated correctly during normal operation.
  • Updated StacksNodeState::canonical_stacks_tip_height to return PeerNetwork::stacks_tip.height instead of PeerNetwork::burnchain_tip.stacks_block_height.

To validate the reliability, I analyzed the logs from a mainnet node at chaintip over the past 24 hours. I measured how often it receives responses from peers

--- Request Time Interval Analysis (Last 24 Hours) ---
Average Time: 6.4775 seconds
Minimum Time: 0.0002 seconds
Maximum Time: 55.2894 seconds
Standard Deviation: 8.1859
Total Intervals Found: 13,440

This suggests the new method won't always have the absolute latest tip but provides a close and stable approximation.

Applicable issues

Additional info (benefits, drawbacks, caveats)

Checklist

  • Test coverage for new or modified code paths
  • Changelog is updated
  • Required documentation changes (e.g., docs/rpc/openapi.yaml and rpc-endpoints.md for v2 endpoints, event-dispatcher.md for new events)
  • New clarity functions have corresponding PR in clarity-benchmarking repo
  • New integration test(s) added to bitcoin-tests.yml

@Jiloc Jiloc added this to the 3.2.0.0.1 milestone Aug 5, 2025
@Jiloc Jiloc self-assigned this Aug 5, 2025
@Jiloc Jiloc moved this to Status: 💻 In Progress in Stacks Core Eng Aug 5, 2025
Copy link

codecov bot commented Aug 5, 2025

Codecov Report

❌ Patch coverage is 74.48980% with 25 lines in your changes missing coverage. Please review.
✅ Project coverage is 77.00%. Comparing base (54a7412) to head (e4c57fb).
⚠️ Report is 57 commits behind head on develop.

Files with missing lines Patch % Lines
stacks-node/src/tests/neon_integrations.rs 0.00% 11 Missing ⚠️
stacks-node/src/tests/nakamoto_integrations.rs 0.00% 8 Missing ⚠️
stackslib/src/net/mod.rs 88.23% 4 Missing ⚠️
stackslib/src/net/api/tests/gethealth.rs 86.66% 2 Missing ⚠️

❌ Your project status has failed because the head coverage (77.00%) is below the target coverage (80.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #6347      +/-   ##
===========================================
- Coverage    79.28%   77.00%   -2.28%     
===========================================
  Files          556      556              
  Lines       351170   350660     -510     
===========================================
- Hits        278424   270042    -8382     
- Misses       72746    80618    +7872     
Files with missing lines Coverage Δ
stackslib/src/net/api/gethealth.rs 77.64% <100.00%> (-11.83%) ⬇️
stackslib/src/net/db.rs 95.65% <ø> (-0.01%) ⬇️
...rc/net/download/nakamoto/download_state_machine.rs 86.19% <ø> (-0.32%) ⬇️
stackslib/src/net/inv/epoch2x.rs 79.57% <ø> (-2.48%) ⬇️
stackslib/src/net/p2p.rs 69.00% <100.00%> (-3.05%) ⬇️
stackslib/src/net/rpc.rs 83.87% <100.00%> (+0.09%) ⬆️
stackslib/src/net/tests/inv/epoch2x.rs 64.10% <ø> (-3.55%) ⬇️
stackslib/src/net/tests/mod.rs 96.86% <100.00%> (+<0.01%) ⬆️
stackslib/src/net/api/tests/gethealth.rs 94.59% <86.66%> (-0.08%) ⬇️
stackslib/src/net/mod.rs 77.94% <88.23%> (-0.03%) ⬇️
... and 2 more

... and 105 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 54a7412...e4c57fb. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@Jiloc Jiloc linked an issue Aug 6, 2025 that may be closed by this pull request
@aldur aldur requested a review from jcnelson August 7, 2025 14:37
@aldur aldur moved this from Status: 💻 In Progress to Status: In Review in Stacks Core Eng Aug 7, 2025
@jcnelson
Copy link
Member

Looks like some of your tests are failing.

@aldur aldur modified the milestones: 3.2.0.0.1, 3.2.0.0.2 Aug 20, 2025
@Jiloc Jiloc requested a review from fdefelici August 26, 2025 08:06
@Jiloc Jiloc marked this pull request as ready for review August 26, 2025 14:45
@Jiloc Jiloc requested a review from a team as a code owner August 26, 2025 14:45
@Jiloc Jiloc requested a review from a team as a code owner August 26, 2025 14:45
@Jiloc Jiloc requested a review from jcnelson August 26, 2025 14:48
jcnelson
jcnelson previously approved these changes Aug 26, 2025
Copy link
Member

@jcnelson jcnelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but please address the nit before merging. Thanks!

Copy link
Contributor

@fdefelici fdefelici left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGMT!

Waiting for the nit to be addressed before providing final approval

@Jiloc
Copy link
Contributor Author

Jiloc commented Aug 27, 2025

QQ for both of you: I’ve already been bitten a couple of times by my own assumptions about how this endpoint gets the information (first I was extracting the chaintip from unconfirmed tenures only, then I added confirmed ones… both times I was solving the wrong problem). Tests were passing, but they didn’t really reflect real operations.

For the previous PRs I could have catched this by running the updates on a synced node, but in this case I can’t since the canonical chaintip header isn’t yet released on the operational nodes.

For this PR, I validated the approach by:

  • a more robust check into the multiple_miners integration test
  • checking how often a node receives RPC responses in a real mainnet envirnoment (which always include the peers’ chaintip eventually).

Do you think that’s solid enough validation, or would you suggest adding anything else?

@fdefelici
Copy link
Contributor

QQ for both of you: I’ve already been bitten a couple of times by my own assumptions about how this endpoint gets the information (first I was extracting the chaintip from unconfirmed tenures only, then I added confirmed ones… both times I was solving the wrong problem). Tests were passing, but they didn’t really reflect real operations.

For the previous PRs I could have catched this by running the updates on a synced node, but in this case I can’t since the canonical chaintip header isn’t yet released on the operational nodes.

For this PR, I validated the approach by:

  • a more robust check into the multiple_miners integration test
  • checking how often a node receives RPC responses in a real mainnet envirnoment (which always include the peers’ chaintip eventually).

Do you think that’s solid enough validation, or would you suggest adding anything else?

Actually I'm fine with this. Aside from running on a live network, this seems like the most thorough validation we can do at this stage.

@jcnelson
Copy link
Member

Do you think that’s solid enough validation, or would you suggest adding anything else?

I think that's fine for now.

@Jiloc Jiloc added this pull request to the merge queue Aug 28, 2025
Merged via the queue into stacks-network:develop with commit 03d84fa Aug 28, 2025
570 of 574 checks passed
@Jiloc Jiloc deleted the fix/health-v3-use-canonical-stacks-height-header branch August 28, 2025 12:33
@github-project-automation github-project-automation bot moved this from Status: In Review to Status: ✅ Done in Stacks Core Eng Aug 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Status: ✅ Done
Development

Successfully merging this pull request may close these issues.

[Bug] /v3/health always fails with HTTP 500
4 participants