Skip to content

plugins/rest: various changes re: TLS, *http.Client caching#8376

Open
srenatus wants to merge 10 commits intoopen-policy-agent:mainfrom
srenatus:sr/rest-auth-tls-reload
Open

plugins/rest: various changes re: TLS, *http.Client caching#8376
srenatus wants to merge 10 commits intoopen-policy-agent:mainfrom
srenatus:sr/rest-auth-tls-reload

Conversation

@srenatus
Copy link
Contributor

@srenatus srenatus commented Feb 25, 2026

This hasn't been asked for, but it seems like a logical addition. When there's a change to the certificate or keys modification time (an FS attribute), we'll reload them.

An alternative I had considered was checking the hash of the times, which we can also do, but it'll be a little more computationally expensive, and we'd do this for every TLS handshake. I think we should then add another configurable (or good default) for "how often to compare the hash". The mod time seemed a lot simpler in comparison.

Another alternative is using a path watcher goroutine, but due to how our rest clients work, there's no natural location for a "Close()" call that would allow us to clean up the routine. So I didn't take this route.

🧹 I've moved the auth TLS code into its own file, but in a separate commit, for easier review of the diff.

@netlify
Copy link

netlify bot commented Feb 25, 2026

Deploy Preview for openpolicyagent ready!

Name Link
🔨 Latest commit 3b8ce68
🔍 Latest deploy log https://app.netlify.com/projects/openpolicyagent/deploys/699ff8705a83da0008970244
😎 Deploy Preview https://deploy-preview-8376--openpolicyagent.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@srenatus srenatus force-pushed the sr/rest-auth-tls-reload branch 2 times, most recently from 23b2d64 to 0d428c8 Compare February 25, 2026 12:49
Copy link
Member

@philipaconrad philipaconrad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having looked through each commit, I like the general direction of this PR a lot, and didn't see anything obviously amiss anywhere. 🙂

My main points of concern when reviewing were around the locking (which I think is correct), and around the new branching in behavior for the plugin (cert_refresh_duration_seconds > 0 and cert_refresh_duration_seconds == 0 cases).

As far as I can tell by just eyeballing it, it looks like this is all implemented correctly, and it makes sense why we'd have different behaviors. Overall, nicely done!

When you're ready to take this PR out of Draft state, feel free to ping me, and I'll give it a more in-depth review.

@srenatus srenatus force-pushed the sr/rest-auth-tls-reload branch from 0d428c8 to 3b8ce68 Compare February 26, 2026 07:38
@srenatus
Copy link
Contributor Author

@philipaconrad thanks for the early review! I had seen it just now -- after having decided to switch to a simpler mod time based approach. Can you have another look? The idea here is that we have a new feature: if the certs are rotated, they'll be reloaded, without any new config or twist. The rationale being that either you don't touch your cert files, then nothing will be different; or you do and then you probably appreciate having OPA react accordingly by reloading them 😅 What do you think about that?

@srenatus srenatus marked this pull request as ready for review February 26, 2026 08:22
Copy link
Member

@philipaconrad philipaconrad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm marking Approve, because I think this PR is in a good state to ship (possibly with a small docs update; see comments).

I think the modification time detection approach will work well, and the edge cases are fairly niche. I especially enjoy that it keeps the plugin config from sprouting additional fields.

I can't think of a good reason for mtime to update continuously for a TLS cert, but someone using touch (which updates mtime on files that already exist) in a loop could trigger the aggressive reloading behavior if the OPA instance is under load. 🤔


As a point of comparison to these changes, the localfile data source plugin in EOPA does something similar to your initial approach: it periodically polls the file on disk, hashes its contents, and then updates the file buffer in memory only if the hash is different.

I think we'd planned to use file modtimes over there as a heuristic to reduce disk reads, although we'd probably still hard-reload from disk every N-many polls, just to make sure the hashed contents weren't different.

Comment on lines +104 to +105
certModTime := certInfo.ModTime()
keyModTime := keyInfo.ModTime()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I did some googling around, and there are a few specific cases where even on a POSIX-compliant system you might not get a correct mtime update, even after a write that changes the file contents:

  • NFS / Samba filesystems (updates may be delayed or set in the server's timezone, among other weirdness from being on a network)
  • FAT family filesystems (mtime has a 2 second resolution. Writes happening within 2 seconds of each other won't be detected.)
    • See the 0x0E entry in the linked table for a description of the time format used for both create/modify time records.
  • mmap()'d files may have delayed updates.
  • Device and virtual files (/dev, /proc, and so on) may not get mtime updates when they change.

I think mtime detection should work fine, 99%+ of the time. In all but the NFS-related cases above, the timestamp will usually only be off by a few seconds at most, which won't affect the average user.

I feel that we should still document somewhere that we're relying on mtime as the refresh trigger, or someone will get surprised by it. 🤔

Maybe adding a sentence or two to the Configuration page entry for the client TLS config items might do the job?

@srenatus srenatus marked this pull request as draft February 27, 2026 14:22
@srenatus
Copy link
Contributor Author

Taken back to "draft" status. I'd like to think about this some more, thanks for the input @philipaconrad, I appreciate it a lot! 🔍 👀

@srenatus srenatus force-pushed the sr/rest-auth-tls-reload branch 3 times, most recently from 8ee76a8 to 58adc22 Compare March 4, 2026 10:30
@srenatus srenatus marked this pull request as ready for review March 4, 2026 10:30
@netlify
Copy link

netlify bot commented Mar 4, 2026

Deploy Preview for openpolicyagent ready!

Name Link
🔨 Latest commit 420da4d
🔍 Latest deploy log https://app.netlify.com/projects/openpolicyagent/deploys/69a808eab8eebc00085aea13
😎 Deploy Preview https://deploy-preview-8376--openpolicyagent.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@srenatus
Copy link
Contributor Author

srenatus commented Mar 4, 2026

OK I took a bit of a detour here, went on the scenic route and fixed what whatever I found.

  1. We're not caching the http.Client. Comments about this for custom plugins were added to the CHANGELOG. Existing in-tree plugins should be OK (as of this PR)
  2. There's a re-read interval config for our TLS auth plugin. By default it'll re-read on every handshake, more or less doing what it did before. But it'll check the hash sums first to avoid parsing certs when it's unnecessary.
  3. Min TLS versions should now be propagated from the server to the REST plugins. If nothing was set, we'll enforce TLS v1.2.

@netlify
Copy link

netlify bot commented Mar 4, 2026

Deploy Preview for openpolicyagent ready!

Name Link
🔨 Latest commit 2423816
🔍 Latest deploy log https://app.netlify.com/projects/openpolicyagent/deploys/69b293e53405e20008e56955
😎 Deploy Preview https://deploy-preview-8376--openpolicyagent.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@srenatus srenatus force-pushed the sr/rest-auth-tls-reload branch from 58adc22 to 05d309c Compare March 4, 2026 10:35
@srenatus srenatus changed the title plugins/rest: allow reloading certs from disk for mTLS client auth plugins/rest: various changes re: TLS, *http.Client caching Mar 4, 2026

const (
// DefaultMinTLSVersion is the minimum TLS version used by OPA server and REST clients
DefaultMinTLSVersion = tls.VersionTLS12
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is declared in config but it only used in the rest package, for now. Do we have plans to use this elsewhere that might make it make sense to keep here?

The defaulting also seems to happen in the rest package.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The server package uses it, too. We could move it... but since it's a constant, I didn't think it mattered too much 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// nolint: staticcheck // We don't want to forbid users from using this encryption.
if x509.IsEncryptedPEMBlock(block) {
if ap.PrivateKeyPassphrase == "" {
return nil, errors.New("client certificate passphrase is needed, because the certificate is password encrypted")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it not PK that's encrypted here? rather than the cert.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved code I hadn't looked at this closely, but you are right, I think. Will update!


cert, err := tls.X509KeyPair(certPEMBlock, keyPEMBlock)
if err != nil {
return nil, err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return nil, err
return nil, fmt.Errorf("failed to parse public/private key pair: %v", err)

keys map[string]*keys.Config
logger logging.Logger
minTLSVersion uint16
cipherSuites *[]uint16
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of this might have been my handiwork, but we are getting a number of top level TLS configs here. TLS, AllowInsecureTLS, minTLSVersion etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're sharing one struct for both unmarshalling config and for storing derivates of it... I think that's part of why it's like it is.

@srenatus srenatus force-pushed the sr/rest-auth-tls-reload branch 2 times, most recently from 7784d0b to f388b1b Compare March 5, 2026 12:41
@alex60217101990
Copy link
Contributor

Could we consider using fasthttp instead of net/http? The standard library is great, but fasthttp would be more efficient for high-load scenarios.

@srenatus
Copy link
Contributor Author

srenatus commented Mar 6, 2026

Could we consider using fasthttp instead of net/http? The standard library is great, but fasthttp would be more efficient for high-load scenarios.

This would be tricky, as it's impossible to do without breaking the interfaces for custom auth plugin implementations. So, we'd need a very convincing case to pull this off -- like a real world problem with decision logs (the only thing that would be affected in high-load scenarios) that would be solved by swapping out net/http with fasthttp.

@alex60217101990
Copy link
Contributor

With the PR caching *http.Client per rest.Client instance, each OPA service still gets its own transport (and therefore its own connection pool). If multiple services point to the same host (e.g., a single backend serving both bundles and decision log ingestion), connections aren't shared between them.

Consider allowing an opt-in shared *http.Transport keyed by (host, TLS fingerprint) so that services targeting the same endpoint can reuse the underlying TCP/TLS connections. The *http.Client can remain per-service for auth isolation, but the transport underneath could be shared.

This matters in practice: a typical deployment might configure 3-5 services (bundles, decision logs, status) all hitting the same control plane. Each one maintaining its own idle connection pool means 1-1,5x the memory for connection state and TLS session data that could otherwise be amortized.

@srenatus
Copy link
Contributor Author

srenatus commented Mar 6, 2026

This sounds good on paper, but the extra complexity needs to be warranted. Let's keep it in mind in case someone brings up a case where measurements reveal this as a bottleneck.

@alex60217101990
Copy link
Contributor

oauth2ClientCredentialsAuthPlugin.requestToken() creates a throwaway *http.Client on every token refresh:

// v1/plugins/rest/auth.go:717

client := DefaultRoundTripperClient(&tls.Config{InsecureSkipVerify: ap.tlsSkipVerify}, 10)

response, err := client.Do(r)

Even with the PR caching the main *http.Client, this path is untouched. Every token refresh allocates a new transport, does a TLS handshake from scratch, and then discards everything. For short-lived tokens or high-frequency refreshes, this defeats connection reuse entirely.

The fix is straightforward: cache the token endpoint client alongside the main client, or reuse the main client for token requests (since Prepare() is where auth headers are set, using the main client for token fetches won't cause recursion as long as the token endpoint request skips Prepare).

@alex60217101990
Copy link
Contributor

Decision logs are uploaded as gzip-compressed chunks (default limit: 32KB, max: ~4GB). The current DefaultRoundTripperClient inherits Go's default transport settings, which are conservative:

  • WriteBufferSize: 4KB (default) — for a 32KB chunk this means multiple write syscalls per upload

  • ReadBufferSize: 4KB (default)

  • MaxIdleConnsPerHost: 2 (default) — if decision logs back up and multiple uploads happen concurrently, only 2 connections are reused

For large batch uploads (which is common in high-throughput deployments), it would help to expose transport-level tunables in the service config, or at least set better defaults:

tr := http.DefaultTransport.(*http.Transport).Clone()

tr.WriteBufferSize = 32 * 1024  // match the default upload chunk size

tr.ReadBufferSize = 8 * 1024

tr.MaxIdleConnsPerHost = 4       // allow more connection reuse under load

This is especially relevant now that the client is cached — the transport settings persist for the lifetime of the service, so getting them right matters more than before.

@srenatus srenatus requested a review from charlieegan3 March 9, 2026 13:07
@srenatus srenatus requested a review from philipaconrad March 9, 2026 13:07
@srenatus srenatus force-pushed the sr/rest-auth-tls-reload branch from f388b1b to 34cdf08 Compare March 9, 2026 13:09
srenatus added 10 commits March 12, 2026 11:22
Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com>
Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com>
Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com>
Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com>
This seems lighter than a hash calculcation, so we can do it on every
TLS connection init.

It would lead to problems when certificate's modtime changes frequently
without the actual cert contents changing, but I cannot imagine when
that would be the case...

Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com>
This will require further changes to cert TLS and token auth methods to
stay compatible with the previous behaviour.

Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com>
Defaulting to re-reading all the time, more or less like we did before.

(I write "more or less" because we now do it in `GetClientCertificate()`.)

Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com>
Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com>
Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com>
...as  configured with the server.

Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com>
@srenatus srenatus force-pushed the sr/rest-auth-tls-reload branch from 34cdf08 to 2423816 Compare March 12, 2026 10:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants