plugins/rest: various changes re: TLS, *http.Client caching#8376
plugins/rest: various changes re: TLS, *http.Client caching#8376srenatus wants to merge 10 commits intoopen-policy-agent:mainfrom
Conversation
✅ Deploy Preview for openpolicyagent ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
23b2d64 to
0d428c8
Compare
philipaconrad
left a comment
There was a problem hiding this comment.
Having looked through each commit, I like the general direction of this PR a lot, and didn't see anything obviously amiss anywhere. 🙂
My main points of concern when reviewing were around the locking (which I think is correct), and around the new branching in behavior for the plugin (cert_refresh_duration_seconds > 0 and cert_refresh_duration_seconds == 0 cases).
As far as I can tell by just eyeballing it, it looks like this is all implemented correctly, and it makes sense why we'd have different behaviors. Overall, nicely done!
When you're ready to take this PR out of Draft state, feel free to ping me, and I'll give it a more in-depth review.
0d428c8 to
3b8ce68
Compare
|
@philipaconrad thanks for the early review! I had seen it just now -- after having decided to switch to a simpler mod time based approach. Can you have another look? The idea here is that we have a new feature: if the certs are rotated, they'll be reloaded, without any new config or twist. The rationale being that either you don't touch your cert files, then nothing will be different; or you do and then you probably appreciate having OPA react accordingly by reloading them 😅 What do you think about that? |
There was a problem hiding this comment.
I'm marking Approve, because I think this PR is in a good state to ship (possibly with a small docs update; see comments).
I think the modification time detection approach will work well, and the edge cases are fairly niche. I especially enjoy that it keeps the plugin config from sprouting additional fields.
I can't think of a good reason for mtime to update continuously for a TLS cert, but someone using touch (which updates mtime on files that already exist) in a loop could trigger the aggressive reloading behavior if the OPA instance is under load. 🤔
As a point of comparison to these changes, the localfile data source plugin in EOPA does something similar to your initial approach: it periodically polls the file on disk, hashes its contents, and then updates the file buffer in memory only if the hash is different.
I think we'd planned to use file modtimes over there as a heuristic to reduce disk reads, although we'd probably still hard-reload from disk every N-many polls, just to make sure the hashed contents weren't different.
v1/plugins/rest/auth_tls.go
Outdated
| certModTime := certInfo.ModTime() | ||
| keyModTime := keyInfo.ModTime() |
There was a problem hiding this comment.
So, I did some googling around, and there are a few specific cases where even on a POSIX-compliant system you might not get a correct mtime update, even after a write that changes the file contents:
- NFS / Samba filesystems (updates may be delayed or set in the server's timezone, among other weirdness from being on a network)
- FAT family filesystems (
mtimehas a 2 second resolution. Writes happening within 2 seconds of each other won't be detected.)- See the
0x0Eentry in the linked table for a description of the time format used for both create/modify time records.
- See the
mmap()'d files may have delayed updates.- Device and virtual files (
/dev,/proc, and so on) may not getmtimeupdates when they change.
I think mtime detection should work fine, 99%+ of the time. In all but the NFS-related cases above, the timestamp will usually only be off by a few seconds at most, which won't affect the average user.
I feel that we should still document somewhere that we're relying on mtime as the refresh trigger, or someone will get surprised by it. 🤔
Maybe adding a sentence or two to the Configuration page entry for the client TLS config items might do the job?
|
Taken back to "draft" status. I'd like to think about this some more, thanks for the input @philipaconrad, I appreciate it a lot! 🔍 👀 |
8ee76a8 to
58adc22
Compare
✅ Deploy Preview for openpolicyagent ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
OK I took a bit of a detour here, went on the scenic route and fixed what whatever I found.
|
✅ Deploy Preview for openpolicyagent ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
58adc22 to
05d309c
Compare
|
|
||
| const ( | ||
| // DefaultMinTLSVersion is the minimum TLS version used by OPA server and REST clients | ||
| DefaultMinTLSVersion = tls.VersionTLS12 |
There was a problem hiding this comment.
This is declared in config but it only used in the rest package, for now. Do we have plans to use this elsewhere that might make it make sense to keep here?
The defaulting also seems to happen in the rest package.
There was a problem hiding this comment.
The server package uses it, too. We could move it... but since it's a constant, I didn't think it mattered too much 🤔
There was a problem hiding this comment.
v1/plugins/rest/auth_tls.go
Outdated
| // nolint: staticcheck // We don't want to forbid users from using this encryption. | ||
| if x509.IsEncryptedPEMBlock(block) { | ||
| if ap.PrivateKeyPassphrase == "" { | ||
| return nil, errors.New("client certificate passphrase is needed, because the certificate is password encrypted") |
There was a problem hiding this comment.
is it not PK that's encrypted here? rather than the cert.
There was a problem hiding this comment.
Moved code I hadn't looked at this closely, but you are right, I think. Will update!
v1/plugins/rest/auth_tls.go
Outdated
|
|
||
| cert, err := tls.X509KeyPair(certPEMBlock, keyPEMBlock) | ||
| if err != nil { | ||
| return nil, err |
There was a problem hiding this comment.
| return nil, err | |
| return nil, fmt.Errorf("failed to parse public/private key pair: %v", err) |
| keys map[string]*keys.Config | ||
| logger logging.Logger | ||
| minTLSVersion uint16 | ||
| cipherSuites *[]uint16 |
There was a problem hiding this comment.
Some of this might have been my handiwork, but we are getting a number of top level TLS configs here. TLS, AllowInsecureTLS, minTLSVersion etc.
There was a problem hiding this comment.
We're sharing one struct for both unmarshalling config and for storing derivates of it... I think that's part of why it's like it is.
7784d0b to
f388b1b
Compare
|
Could we consider using fasthttp instead of net/http? The standard library is great, but fasthttp would be more efficient for high-load scenarios. |
This would be tricky, as it's impossible to do without breaking the interfaces for custom auth plugin implementations. So, we'd need a very convincing case to pull this off -- like a real world problem with decision logs (the only thing that would be affected in high-load scenarios) that would be solved by swapping out |
|
With the PR caching Consider allowing an opt-in shared This matters in practice: a typical deployment might configure 3-5 services (bundles, decision logs, status) all hitting the same control plane. Each one maintaining its own idle connection pool means 1-1,5x the memory for connection state and TLS session data that could otherwise be amortized. |
|
This sounds good on paper, but the extra complexity needs to be warranted. Let's keep it in mind in case someone brings up a case where measurements reveal this as a bottleneck. |
|
// v1/plugins/rest/auth.go:717
client := DefaultRoundTripperClient(&tls.Config{InsecureSkipVerify: ap.tlsSkipVerify}, 10)
response, err := client.Do(r)Even with the PR caching the main The fix is straightforward: cache the token endpoint client alongside the main client, or reuse the main client for token requests (since |
|
Decision logs are uploaded as gzip-compressed chunks (default limit: 32KB, max: ~4GB). The current
For large batch uploads (which is common in high-throughput deployments), it would help to expose transport-level tunables in the service config, or at least set better defaults: tr := http.DefaultTransport.(*http.Transport).Clone()
tr.WriteBufferSize = 32 * 1024 // match the default upload chunk size
tr.ReadBufferSize = 8 * 1024
tr.MaxIdleConnsPerHost = 4 // allow more connection reuse under loadThis is especially relevant now that the client is cached — the transport settings persist for the lifetime of the service, so getting them right matters more than before. |
f388b1b to
34cdf08
Compare
Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com>
Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com>
Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com>
Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com>
This seems lighter than a hash calculcation, so we can do it on every TLS connection init. It would lead to problems when certificate's modtime changes frequently without the actual cert contents changing, but I cannot imagine when that would be the case... Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com>
This will require further changes to cert TLS and token auth methods to stay compatible with the previous behaviour. Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com>
Defaulting to re-reading all the time, more or less like we did before. (I write "more or less" because we now do it in `GetClientCertificate()`.) Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com>
Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com>
Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com>
...as configured with the server. Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com>
34cdf08 to
2423816
Compare
This hasn't been asked for, but it seems like a logical addition. When there's a change to the certificate or keys modification time (an FS attribute), we'll reload them.
An alternative I had considered was checking the hash of the times, which we can also do, but it'll be a little more computationally expensive, and we'd do this for every TLS handshake. I think we should then add another configurable (or good default) for "how often to compare the hash". The mod time seemed a lot simpler in comparison.
Another alternative is using a path watcher goroutine, but due to how our rest clients work, there's no natural location for a "Close()" call that would allow us to clean up the routine. So I didn't take this route.
🧹 I've moved the auth TLS code into its own file, but in a separate commit, for easier review of the diff.