MWI: Add certificate expiry check leeway for tbot#64293
Conversation
This adds some leeway to the application-tunnel service's certificate expiration check. For context, the app tunnel service does not follow Machine ID's usual certificate renewal cycle and instead opts to renew certificates just-in-time before completing a request if the certificate has expired per the local clock. Unfortunately, the local clock is not always accurate, and if the certificate or underlying app session has already expired from the server's perspective, client requests can still fail until the local clock catches up and the certificate is refreshed. To mitigate this, this change adds a `leeway` parameter to the service, configurable via YAML, with a default value of 1m. This is added to the current time when certificate validity is checked. This means that, in the worst case, certificates will be refreshed to early rather than too late. See also: #64284
|
#57697 discusses a similar issue where connection reuse can result in an "invalid session" response with a 403 error which persists for the remainder of the connection. It is possible to reproduce this situation using this new https://gist.github.com/timothyb89/24539fdfb39c9d456a7e76deedab781f This just makes repeated requests to an endpoint over the same connection. Once the certificate expires, we see the following: |
|
@codex please review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 731c5c1022
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
This makes the leeway parameter global, and additionally uses it in the main renewal loop (for the expired bot internal identity detection) and in the database tunnel service. Also adds some test coverage for app tunnel cert renewals.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fb74c464f7
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // VerifyCertificateExpiryWithLeeway checks the certificate's expiration status | ||
| // with leeway. The provided leeway value is added to the current time and can | ||
| // be used to account for potential client-side clock drift. | ||
| func VerifyCertificateExpiryWithLeeway(c *x509.Certificate, clock clockwork.Clock, leeway time.Duration) error { |
There was a problem hiding this comment.
I worry a little about this function being misleading in a general-use utilities package. Adding to the local clock is the appropriate way of determining "leeway" as a client evaluating their own certificates, but, for a server examining a remote client's certificate, adding to the clock time actually does the opposite.
There was a problem hiding this comment.
I've tweaked the doc comment at least to give usage examples, including use of negative values for use from the server's perspective. Our semantics here are similar to go-jose's ValidateWithLeeway: https://github.com/go-jose/go-jose/blob/main/jwt/validation.go#L79 (though perhaps reversed).
I'd also argue we have very little reason to ever provide leeway on the server with certificate validation. In theory Teleport issued the cert, so we can probably trust our own timestamps; any client-side clock drift doesn't really matter.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1fdc6ecfe9
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if conf.Leeway == 0 { | ||
| conf.Leeway = DefaultLeeway | ||
| } |
There was a problem hiding this comment.
Preserve explicit zero leeway from YAML config
This defaulting logic rewrites leeway: 0s to DefaultLeeway (1 minute), so operators cannot disable leeway even when they intentionally configure zero. That makes the setting non-representable in config and can force unwanted early-expiry behavior in renewal paths that use conf.Leeway (notably with short-lived credentials), instead of honoring an explicit opt-out.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
There's no trivial way to make 0 representable (thanks Go), so I think if any users want zero leeway, they can configure 1ns for an effectively zero value.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2c286caff6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a0e7d89797
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| } | ||
|
|
||
| leeway := cfg.Leeway | ||
| if leeway >= cfg.TTL { |
There was a problem hiding this comment.
Bound identity leeway using issued cert lifetime
This clamp only compares leeway to cfg.TTL, but non-renewable bot certs can be issued with a shorter lifetime than requested (the join/auth path caps requested expiries, e.g. in generateInitialBotCerts and generateUserCert), and renewIdentity later checks expiry from the actual cert via facade.Expiry(). In that case, a default 1m leeway can still exceed the real cert lifetime, so now := time.Now().Add(leeway) treats fresh identities as expired and repeatedly falls back to botIdentityFromToken(..., nil), causing unnecessary re-registration attempts or renewal failures in constrained join environments.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
I've added a comparison against the returned cert's TTL and a note that further accounting for TTL minutiae is out of scope; the juice is not worth the squeeze to account for sub-minute differences when we're already explicitly suspicious of the local clock.
Factor in actual cert TTL in the leeway cap.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d03a217a77
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| leeway := s.leeway | ||
| if leeway >= effectiveTTL { | ||
| s.log.WarnContext(ctx, |
There was a problem hiding this comment.
Recompute tunnel leeway after each certificate reissue
leeway is derived once from the initial cert and then captured for the lifetime of OnNewConnection, but after lp.SetCert(*cert) the code never re-clamps it against the newly issued certificate's validity window. If the auth server later starts returning shorter app certs (for example after a role/session TTL reduction), a previously safe leeway can become greater than the new cert lifetime, making each fresh cert fail CheckCertExpiryWithLeeway and forcing reissue on every new connection until restart.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Noted, but IMO not worth fixing. If you configure huge leeway values (or tiny session limits) unexpected things can and will happen; otherwise we're talking 1 minute of leeway by default. If that somehow exceeds your actual cert TTL in a production environment (or the server resets your TTL to less than that) you have much larger problems to worry about.
And maybe more critically, I don't want to write the test for the "proper" fix for this 🙂
| leeway := s.leeway | ||
| if leeway >= realTTL { | ||
| s.log.WarnContext(ctx, |
There was a problem hiding this comment.
Recalculate DB tunnel leeway when certificate lifetime changes
The DB tunnel computes and caps leeway once from the startup certificate and then reuses that captured value for all future expiry checks, even after lp.SetCert(*cert) installs newly issued certs. If subsequent database certs are shorter-lived than the initial one (e.g., cluster TTL policy is tightened while the process is running), the fixed leeway can exceed the new lifetime and CheckDBCertWithLeeway will treat each new cert as expired, causing continuous per-connection reissuance.
Useful? React with 👍 / 👎.
|
@timothyb89 See the table below for backport results.
|
This adds some leeway to various parts of
tbot, including the main identity renewal loop, theapplication-tunnel, and thedatabase-tunnelservice.For context, the app and database tunnel services do not follow Machine ID's usual certificate renewal cycle and instead opt to renew certificates just-in-time when opening a connection if the certificate has expired per the local clock.
Unfortunately, the local clock is not always accurate, and if the certificate or underlying app session has already expired from the server's perspective, client requests can still fail until the local clock catches up and the certificate is refreshed.
To mitigate this, this change adds a
leewayparameter to the service, configurable via YAML, with a default value of 1m. This is added to the current time when certificate validity is checked. This means that, in the worst case, certificates will be refreshed to early rather than too late.See also: #64284
changelog: MWI: Add 1 minute configurable leeway to
application-tunnelcertificate renewalsManual Test Plan
Recommended config:
Note the very short
credential_ttlandrenewal_interval. Sincerenewal_interval + leeway >= credential_ttl, it should trigger the internal identity expiry detection and force an early renewal, e.g.: