Skip to content

Reduce flakiness in JWT initial-refresh test (post-failover timeout)#8022

Open
achamayou wants to merge 2 commits into
mainfrom
fix-flaky-jwt-initial-refresh-timeout
Open

Reduce flakiness in JWT initial-refresh test (post-failover timeout)#8022
achamayou wants to merge 2 commits into
mainfrom
fix-flaky-jwt-initial-refresh-timeout

Conversation

@achamayou

@achamayou achamayou commented Jul 3, 2026

Copy link
Copy Markdown
Member

Problem

test_jwt_key_initial_refresh (the JWT "manual" scenario, part of the programmability_and_jwt / bucket_c e2e test) intermittently fails with:

AssertionError: assert kid in latest_jwt_signing_keys
RuntimeError: ['Failure in manual: AssertionError()']

Observed failing job: VMSS Virtual C - programmability_and_jwt (from run 28668826918).

Root cause

The test waits for the one-off JWKS refresh with a flat timeout=5. It runs the check twice: once on the initial primary (fast, ~0.6s in the failing run) and again immediately after a deliberate primary failover ("initial refresh also works on backups").

On the newly-elected primary the refresh has to be (re)started and the keys re-fetched over TLS from the local OpenID server and applied via governance - a cold start that can occasionally exceed 5s under CI load. In the observed failure the manual scenario logged no node error (the refresh succeeds, just late); it simply had not populated the key within 5s when the assertion fired. It is plausibly more variable now that JWK fetching was migrated to the new curl-based client (#8005), and this test has a history of flakiness (#7543).

Fix

Raise the timeout for this specific check from 5s to 15s, matching the intent of the sibling auto-refresh check which already uses a scaled timeout (max(5, args.jwt_key_refresh_interval_s * 5)).

Test-only change; no product code or user-facing behaviour is affected, so no CHANGELOG entry.

test_jwt_key_initial_refresh waits for the one-off JWKS refresh with a flat 5s timeout. The check also runs immediately after a primary failover, where the newly-elected primary must restart the refresh and re-fetch the keys; under CI load this cold start can exceed 5s, causing intermittent 'assert kid in latest_jwt_signing_keys' failures (observed in a bucket_c programmability_and_jwt run). The refresh itself succeeds - no node error - so this raises the timeout to 15s.
Copilot AI review requested due to automatic review settings July 3, 2026 16:01
@achamayou achamayou requested a review from a team as a code owner July 3, 2026 16:01

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reduces intermittent failures in the JWT “manual” e2e scenario by increasing the wait time for the one-off JWKS refresh that occurs after adding a new JWT issuer (including the second run immediately after a forced primary failover).

Changes:

  • Increase the with_timeout(...) timeout in test_jwt_key_initial_refresh from 5s to 15s.
  • Add an explanatory comment noting the post-failover “cold start” behavior on the newly elected primary.

Custom instructions used

  • .github/copilot-instructions.md
  • .github/instructions/reviewing.instructions.md

Co-authored-by: achamayou <4016369+achamayou@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants