Reduce flakiness in JWT initial-refresh test (post-failover timeout)#8022
Open
achamayou wants to merge 2 commits into
Open
Reduce flakiness in JWT initial-refresh test (post-failover timeout)#8022achamayou wants to merge 2 commits into
achamayou wants to merge 2 commits into
Conversation
test_jwt_key_initial_refresh waits for the one-off JWKS refresh with a flat 5s timeout. The check also runs immediately after a primary failover, where the newly-elected primary must restart the refresh and re-fetch the keys; under CI load this cold start can exceed 5s, causing intermittent 'assert kid in latest_jwt_signing_keys' failures (observed in a bucket_c programmability_and_jwt run). The refresh itself succeeds - no node error - so this raises the timeout to 15s.
Contributor
There was a problem hiding this comment.
Pull request overview
This PR reduces intermittent failures in the JWT “manual” e2e scenario by increasing the wait time for the one-off JWKS refresh that occurs after adding a new JWT issuer (including the second run immediately after a forced primary failover).
Changes:
- Increase the
with_timeout(...)timeout intest_jwt_key_initial_refreshfrom 5s to 15s. - Add an explanatory comment noting the post-failover “cold start” behavior on the newly elected primary.
Custom instructions used
.github/copilot-instructions.md.github/instructions/reviewing.instructions.md
Co-authored-by: achamayou <4016369+achamayou@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
test_jwt_key_initial_refresh(the JWT "manual" scenario, part of theprogrammability_and_jwt/ bucket_c e2e test) intermittently fails with:Observed failing job: VMSS Virtual C -
programmability_and_jwt(from run 28668826918).Root cause
The test waits for the one-off JWKS refresh with a flat
timeout=5. It runs the check twice: once on the initial primary (fast, ~0.6s in the failing run) and again immediately after a deliberate primary failover ("initial refresh also works on backups").On the newly-elected primary the refresh has to be (re)started and the keys re-fetched over TLS from the local OpenID server and applied via governance - a cold start that can occasionally exceed 5s under CI load. In the observed failure the
manualscenario logged no node error (the refresh succeeds, just late); it simply had not populated the key within 5s when the assertion fired. It is plausibly more variable now that JWK fetching was migrated to the new curl-based client (#8005), and this test has a history of flakiness (#7543).Fix
Raise the timeout for this specific check from 5s to 15s, matching the intent of the sibling auto-refresh check which already uses a scaled timeout (
max(5, args.jwt_key_refresh_interval_s * 5)).Test-only change; no product code or user-facing behaviour is affected, so no CHANGELOG entry.