Skip to content

MWI: Add certificate expiry check leeway for tbot#64293

Merged
timothyb89 merged 12 commits intomasterfrom
timothyb89/app-tunnel-leeway
Mar 13, 2026
Merged

MWI: Add certificate expiry check leeway for tbot#64293
timothyb89 merged 12 commits intomasterfrom
timothyb89/app-tunnel-leeway

Conversation

@timothyb89
Copy link
Contributor

@timothyb89 timothyb89 commented Mar 5, 2026

This adds some leeway to various parts of tbot, including the main identity renewal loop, the application-tunnel, and the database-tunnel service.

For context, the app and database tunnel services do not follow Machine ID's usual certificate renewal cycle and instead opt to renew certificates just-in-time when opening a connection if the certificate has expired per the local clock.

Unfortunately, the local clock is not always accurate, and if the certificate or underlying app session has already expired from the server's perspective, client requests can still fail until the local clock catches up and the certificate is refreshed.

To mitigate this, this change adds a leeway parameter to the service, configurable via YAML, with a default value of 1m. This is added to the current time when certificate validity is checked. This means that, in the worst case, certificates will be refreshed to early rather than too late.

See also: #64284

changelog: MWI: Add 1 minute configurable leeway to application-tunnel certificate renewals

Manual Test Plan

Recommended config:

# tbot config file generated by `configure` command
version: v2
storage:
  type: directory
  path: ./storage
  symlinks: try-secure
  acls: "off"
services:
  - type: application-tunnel
    name: application-tunnel-1
    listen: tcp://localhost:1234
    app_name: dumper
debug: false
join_uri: ...
credential_ttl: 1m30s
renewal_interval: 30s
oneshot: false
fips: false
leeway: 1m

Note the very short credential_ttl and renewal_interval. Since renewal_interval + leeway >= credential_ttl, it should trigger the internal identity expiry detection and force an early renewal, e.g.:

2026-03-06T22:12:39.843-07:00 WARN  The bot identity appears to be expired and will not be used to authenticate the identity renewal. If it is possible to rejoin, a new bot instance will be created. If this occurs repeatedly, ensure the local machine's clock is properly synchronized, the certificate TTL is adjusted to your environment, and that no external issues will prevent the bot from renewing its identity on schedule. now:2026-03-06T22:13:39.842-07:00 expiry:2026-03-07T05:13:39.000Z ttl:1m30s renewal_interval:30s leeway:1m0s identity/service.go:558
  • Internal identity: observe renewal failure with negative leeway
  • Internal identity: observe early renewal with positive leeway (if expiry detection is triggered, requires renewal interval config as noted above)
  • App tunnel: observe failures with negative leeway
  • App tunnel: observe early renewal with positive leeway

This adds some leeway to the application-tunnel service's certificate
expiration check.

For context, the app tunnel service does not follow Machine ID's
usual certificate renewal cycle and instead opts to renew
certificates just-in-time before completing a request if the
certificate has expired per the local clock.

Unfortunately, the local clock is not always accurate, and if the
certificate or underlying app session has already expired from the
server's perspective, client requests can still fail until the local
clock catches up and the certificate is refreshed.

To mitigate this, this change adds a `leeway` parameter to the
service, configurable via YAML, with a default value of 1m. This is
added to the current time when certificate validity is checked. This
means that, in the worst case, certificates will be refreshed to early
rather than too late.

See also: #64284
@timothyb89
Copy link
Contributor Author

#57697 discusses a similar issue where connection reuse can result in an "invalid session" response with a 403 error which persists for the remainder of the connection. It is possible to reproduce this situation using this new leeway parameter with a negative value, e.g. leeway: -1m along with this script courtesy of ChatGPT:

https://gist.github.com/timothyb89/24539fdfb39c9d456a7e76deedab781f

This just makes repeated requests to an endpoint over the same connection. Once the certificate expires, we see the following:

$ python3 tools/app_tunnel_keepalive_repro.py --target localhost:1234 --interval 1 --duration 90
[...]
2026-03-05T04:03:22.391822Z req=58 status=200 bytes=1043 latency_ms=3.1 conn=ConnInfo(local="('127.0.0.1', 49301)", remote="('127.0.0.1', 1234)", fileno=5) body_prefix='GET / HTTP/1.1\r\\nHost: 127.0.0.1:35853\r\\nAccept-Encoding: identity\r\\nTeleport-Jwt-Assertion: foo'
2026-03-05T04:03:23.395282Z req=59 status=200 bytes=1043 latency_ms=3.0 conn=ConnInfo(local="('127.0.0.1', 49301)", remote="('127.0.0.1', 1234)", fileno=5) body_prefix='GET / HTTP/1.1\r\\nHost: 127.0.0.1:35853\r\\nAccept-Encoding: identity\r\\nTeleport-Jwt-Assertion: foo'
2026-03-05T04:03:24.400089Z req=60 status=200 bytes=1043 latency_ms=3.0 conn=ConnInfo(local="('127.0.0.1', 49301)", remote="('127.0.0.1', 1234)", fileno=5) body_prefix='GET / HTTP/1.1\r\\nHost: 127.0.0.1:35853\r\\nAccept-Encoding: identity\r\\nTeleport-Jwt-Assertion: foo'
2026-03-05T04:03:25.403445Z req=61 status=200 bytes=1043 latency_ms=3.2 conn=ConnInfo(local="('127.0.0.1', 49301)", remote="('127.0.0.1', 1234)", fileno=5) body_prefix='GET / HTTP/1.1\r\\nHost: 127.0.0.1:35853\r\\nAccept-Encoding: identity\r\\nTeleport-Jwt-Assertion: foo'
2026-03-05T04:03:26.411185Z req=62 status=403 bytes=16 latency_ms=6.9 conn=ConnInfo(local="('127.0.0.1', 49301)", remote="('127.0.0.1', 1234)", fileno=5) body_prefix='invalid session\\n'
2026-03-05T04:03:27.415869Z req=63 status=403 bytes=16 latency_ms=8.1 conn=ConnInfo(local="('127.0.0.1', 49301)", remote="('127.0.0.1', 1234)", fileno=5) body_prefix='invalid session\\n'
2026-03-05T04:03:28.413774Z req=64 status=403 bytes=16 latency_ms=3.2 conn=ConnInfo(local="('127.0.0.1', 49301)", remote="('127.0.0.1', 1234)", fileno=5) body_prefix='invalid session\\n'
2026-03-05T04:03:29.418353Z req=65 status=403 bytes=16 latency_ms=2.7 conn=ConnInfo(local="('127.0.0.1', 49301)", remote="('127.0.0.1', 1234)", fileno=5) body_prefix='invalid session\\n'
2026-03-05T04:03:30.423470Z req=66 status=403 bytes=16 latency_ms=3.2 conn=ConnInfo(local="('127.0.0.1', 49301)", remote="('127.0.0.1', 1234)", fileno=5) body_prefix='invalid session\\n'
2026-03-05T04:03:31.423029Z req=67 status=403 bytes=16 latency_ms=2.4 conn=ConnInfo(local="('127.0.0.1', 49301)", remote="('127.0.0.1', 1234)", fileno=5) body_prefix='invalid session\\n'

@timothyb89
Copy link
Contributor Author

@codex please review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 731c5c1022

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@timothyb89 timothyb89 marked this pull request as ready for review March 5, 2026 05:54
This makes the leeway parameter global, and additionally uses it in
the main renewal loop (for the expired bot internal identity
detection) and in the database tunnel service.

Also adds some test coverage for app tunnel cert renewals.
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fb74c464f7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@timothyb89 timothyb89 changed the title MWI: Add certificate expiry check leeway for app tunnel service MWI: Add certificate expiry check leeway for tbot Mar 7, 2026
// VerifyCertificateExpiryWithLeeway checks the certificate's expiration status
// with leeway. The provided leeway value is added to the current time and can
// be used to account for potential client-side clock drift.
func VerifyCertificateExpiryWithLeeway(c *x509.Certificate, clock clockwork.Clock, leeway time.Duration) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry a little about this function being misleading in a general-use utilities package. Adding to the local clock is the appropriate way of determining "leeway" as a client evaluating their own certificates, but, for a server examining a remote client's certificate, adding to the clock time actually does the opposite.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tweaked the doc comment at least to give usage examples, including use of negative values for use from the server's perspective. Our semantics here are similar to go-jose's ValidateWithLeeway: https://github.com/go-jose/go-jose/blob/main/jwt/validation.go#L79 (though perhaps reversed).

I'd also argue we have very little reason to ever provide leeway on the server with certificate validation. In theory Teleport issued the cert, so we can probably trust our own timestamps; any client-side clock drift doesn't really matter.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1fdc6ecfe9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +271 to +273
if conf.Leeway == 0 {
conf.Leeway = DefaultLeeway
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve explicit zero leeway from YAML config

This defaulting logic rewrites leeway: 0s to DefaultLeeway (1 minute), so operators cannot disable leeway even when they intentionally configure zero. That makes the setting non-representable in config and can force unwanted early-expiry behavior in renewal paths that use conf.Leeway (notably with short-lived credentials), instead of honoring an explicit opt-out.

Useful? React with 👍 / 👎.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no trivial way to make 0 representable (thanks Go), so I think if any users want zero leeway, they can configure 1ns for an effectively zero value.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2c286caff6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a0e7d89797

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

}

leeway := cfg.Leeway
if leeway >= cfg.TTL {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Bound identity leeway using issued cert lifetime

This clamp only compares leeway to cfg.TTL, but non-renewable bot certs can be issued with a shorter lifetime than requested (the join/auth path caps requested expiries, e.g. in generateInitialBotCerts and generateUserCert), and renewIdentity later checks expiry from the actual cert via facade.Expiry(). In that case, a default 1m leeway can still exceed the real cert lifetime, so now := time.Now().Add(leeway) treats fresh identities as expired and repeatedly falls back to botIdentityFromToken(..., nil), causing unnecessary re-registration attempts or renewal failures in constrained join environments.

Useful? React with 👍 / 👎.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a comparison against the returned cert's TTL and a note that further accounting for TTL minutiae is out of scope; the juice is not worth the squeeze to account for sub-minute differences when we're already explicitly suspicious of the local clock.

Factor in actual cert TTL in the leeway cap.
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d03a217a77

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +202 to +204
leeway := s.leeway
if leeway >= effectiveTTL {
s.log.WarnContext(ctx,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Recompute tunnel leeway after each certificate reissue

leeway is derived once from the initial cert and then captured for the lifetime of OnNewConnection, but after lp.SetCert(*cert) the code never re-clamps it against the newly issued certificate's validity window. If the auth server later starts returning shorter app certs (for example after a role/session TTL reduction), a previously safe leeway can become greater than the new cert lifetime, making each fresh cert fail CheckCertExpiryWithLeeway and forcing reissue on every new connection until restart.

Useful? React with 👍 / 👎.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted, but IMO not worth fixing. If you configure huge leeway values (or tiny session limits) unexpected things can and will happen; otherwise we're talking 1 minute of leeway by default. If that somehow exceeds your actual cert TTL in a production environment (or the server resets your TTL to less than that) you have much larger problems to worry about.

And maybe more critically, I don't want to write the test for the "proper" fix for this 🙂

Comment on lines +154 to +156
leeway := s.leeway
if leeway >= realTTL {
s.log.WarnContext(ctx,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Recalculate DB tunnel leeway when certificate lifetime changes

The DB tunnel computes and caps leeway once from the startup certificate and then reuses that captured value for all future expiry checks, even after lp.SetCert(*cert) installs newly issued certs. If subsequent database certs are shorter-lived than the initial one (e.g., cluster TTL policy is tightened while the process is running), the fixed leeway can exceed the new lifetime and CheckDBCertWithLeeway will treat each new cert as expired, causing continuous per-connection reissuance.

Useful? React with 👍 / 👎.

@timothyb89 timothyb89 enabled auto-merge March 13, 2026 00:46
@timothyb89 timothyb89 added this pull request to the merge queue Mar 13, 2026
Merged via the queue into master with commit 5a22c83 Mar 13, 2026
44 checks passed
@timothyb89 timothyb89 deleted the timothyb89/app-tunnel-leeway branch March 13, 2026 01:08
@backport-bot-workflows
Copy link
Contributor

@timothyb89 See the table below for backport results.

Branch Result
branch/v18 Create PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants