fix: Refresh client cert when CA rotation#934
Conversation
6a61292 to
20c6162
Compare
28e50db to
3a7c320
Compare
20c6162 to
bb306f8
Compare
3a7c320 to
d237673
Compare
bb306f8 to
0ae60ad
Compare
d237673 to
1d5dd2e
Compare
95d468c to
fb0dcbf
Compare
| key := createKey(i) | ||
| if cachedC, ok := d.cache[key]; ok && cachedC == c { | ||
| delete(d.cache, key) | ||
| } |
There was a problem hiding this comment.
nit: can we add a comment around this logic?
What case is this for?
There was a problem hiding this comment.
Comments updated. removeCached() may be called multiple times for the same monitoredCache instance, for example if several sockets are open and they all error at the same time.
The logic here will ensure that the cache entry is safely removed from the cache. This accounts for the case where this instance was already removed, and for the case where it was removed and replaced with a new instance.
dialer_test.go
Outdated
| // Note: this succeeds with lazy refresh, but fails with lazy. | ||
| // because dialer.ForceRefresh does not block connections while the | ||
| // refresh is in progress. |
There was a problem hiding this comment.
I think there is a typo in this comment...?
There was a problem hiding this comment.
+1 This looks like a dev-artifact that isn't true anymore?
There was a problem hiding this comment.
Fixed. Deleted the comment - it was a dev artifact.
dialer.go
Outdated
| }, d.dialerID, cn.String()) | ||
| }, | ||
| func(err error) { | ||
| // ignore EOF |
There was a problem hiding this comment.
Could we expand this comment on why we're ignoring EOF? I think EOFs are the server sending an RST packet, but that's not obvious knowledge for future maintainers.
dialer.go
Outdated
| trace.RecordOpenConnections(context.Background(), int64(n), d.dialerID, cn.String()) | ||
| }, d.dialerID, cn.String()) | ||
| }, | ||
| func(err error) { |
There was a problem hiding this comment.
whitespace nit: putting this up a line would make it more obvious it's an argument at a glance.
Alternatively, you could pull this anonymous function out into a variable:
errFunc := func(err error) {
// ...
}
iConn := newInstrumentedConn(/*...*/, errFunc)| } | ||
| func (d *Dialer) isTLSError(err error) bool { | ||
| if nErr, ok := err.(net.Error); ok { | ||
| return !nErr.Timeout() && // it's a permanent net error |
There was a problem hiding this comment.
Have you manually tested this logic as part of doing a cert rotation to confirm it's working as intended?
There was a problem hiding this comment.
Yes. Also the new tests in dialer_test.go create genuine TLS certificate errors, so this is covered by unit tests too.
There was a problem hiding this comment.
Might be worth adding an e2e test for this, but otherwise this seems fine.
There was a problem hiding this comment.
This only happens on a support action: when the client CA is rotated, but the server certificate remains the same. There is no public API for that. So I decided e2e was overkill.
There was a problem hiding this comment.
Ah got it. I missed that detail. I agree then -- no e2e is fine.
dialer_test.go
Outdated
| // Note: this succeeds with lazy refresh, but fails with lazy. | ||
| // because dialer.ForceRefresh does not block connections while the | ||
| // refresh is in progress. |
There was a problem hiding this comment.
+1 This looks like a dev-artifact that isn't true anymore?
dialer_test.go
Outdated
| // Expect an error on read. This should trigger the dialer to refresh. | ||
| _, err = io.ReadAll(conn) | ||
| if err != nil { | ||
| t.Log("Got error on read as expected.", err) |
There was a problem hiding this comment.
t.Logf("Got error on read as expected: %v")
dialer_test.go
Outdated
| t.Log("Second attempt should fail...") | ||
| _, err := d.Dial(context.Background(), "my-project:my-region:my-instance") | ||
| if err != nil { | ||
| t.Log("Got error on dial as expected.", err) |
dialer_test.go
Outdated
| if err != nil { | ||
| t.Log("Got error on dial as expected.", err) | ||
| } else { | ||
| t.Fatal("Want dial error, got no error") |
dialer_test.go
Outdated
| t.Log("Third attempt OK.") | ||
| } | ||
|
|
||
| func TestDialerRefreshesAfterRotateServerCA(t *testing.T) { |
There was a problem hiding this comment.
Could this test and the one above be combined into a table test by parameterizing the only differences between the two tests?
There was a problem hiding this comment.
Test parameterized.
internal/mock/cloudsql.go
Outdated
| return toPEMFormat(f.certs.serverCert) | ||
| } | ||
| return toPEMFormat(f.certs.casServerCertificate, f.certs.serverIntermediateCaCert, f.certs.serverCaCert) | ||
|
|
There was a problem hiding this comment.
nit: let's omit these changes from the PR since they're unrelated
| reqPath: fmt.Sprintf("/sql/v1beta4/projects/%s/instances/%s/connectSettings", i.project, i.name), | ||
| reqCt: ct, | ||
| handle: func(resp http.ResponseWriter, _ *http.Request) { | ||
| var ips []*sqladmin.IpMapping |
There was a problem hiding this comment.
Is this a necessary change? Looks like formatting to me?
There was a problem hiding this comment.
Yes, this is necessary. The Server CA changes during the test. The ServerCaCert response field needs to hold the up-to-date value. Thus, this method needs to create the response at the time that the request is made.
There was a problem hiding this comment.
Got it -- a comment to that effect above i.serverCACert() would help future maintainers.
bb7814e to
78d62bf
Compare
4b7b13e to
27125b6
Compare
wip: merged down new tests
27125b6 to
0f5f778
Compare
| } | ||
| func (d *Dialer) isTLSError(err error) bool { | ||
| if nErr, ok := err.(net.Error); ok { | ||
| return !nErr.Timeout() && // it's a permanent net error |
There was a problem hiding this comment.
Might be worth adding an e2e test for this, but otherwise this seems fine.
| err, | ||
| ) | ||
|
|
||
| // If this instance of monitoredCache is still in the cache, remove it. |
There was a problem hiding this comment.
Would we just always call delete, regardless of whether the monitoredCache is present or not in cache?
There was a problem hiding this comment.
The removeCached() method can be called multiple times for the same monitoredCache instance, and needs to be idempotent. Therefore, it should only delete the cache entry if the monitoredCache instance is the same. Otherwise, it is not idempotent and will cause incorrect behavior.
Here's an example scenario:
Suppose at the start that d.cache["proj:reg:inst"] contains a valid monitoredCache instance.
Then, the instance "proj:reg:inst" Client CA certificates for are rotated. Now the dialer no longer has valid metadata for this instance, new sockets will fail with a TLS error after being returned from d.Dial().
The following actions occur across 3 different goroutines. By chance, they occur in this specific order.
- (goroutine 1) The application calls
d.Dial("proj:reg:inst"), it establishes a connection that is about to fail. - (goroutine 2) The application calls
d.Dial("proj:reg:inst"), it establishes a second connection that is about to fail. - (goroutine 1) The first connection fails, and
removeCached(c)is called.d.cache["proj:reg:inst"] == c, so the the invalid monitoredCache instance removed from thed.cache. Then the invalid monitoredCache instance is closed. - (goroutine 3) Then the application calls d.Dial() again, Dialer does not contain a cache entry, so it fetches the instance data and creates a new, valid monitoredCache instance. A third successful connection is created.
Now d.cache contains a new, valid monitoredCache instance that was created during the third call to d.Dial(). However the second connection hasn't failed yet. Then...
- (goroutine 2) The second connection fails, and
removeCached(c)is called for the old, invalid instance.d.cache["proj:reg:inst"] != c. Instead, the cache holds the valid, new monitoredCache instance.
The removeCache(c) method should leave the new, valid monitoredCache instance in the cache.
There was a problem hiding this comment.
I see. That's probably an edge case, but also worth solving for. Maybe add a note like, "in the case of multiple goroutines creating cache entries, only purge the cache if it's for a known failed attempt and don't overwrite separate goroutine entries." This is a subtle little detail otherwise.
There was a problem hiding this comment.
Or maybe just:
// If this instance was already removed from the cache or *a separate goroutine* replaced it with
// a new instance, do nothing.
dialer_test.go
Outdated
| wantErrorOnDial bool | ||
| wantErrorOnRead bool | ||
| useLazyRefresh bool | ||
| }{{ |
There was a problem hiding this comment.
nit: this isn't consistently formatted as below
dialer_test.go
Outdated
|
|
||
| opts := []Option{ | ||
| WithTokenSource(mock.EmptyTokenSource{}), | ||
| //WithDebugLogger(&dialerTestLogger{t: t}), |
There was a problem hiding this comment.
I uncommented it. It helps to understand the test.
| reqPath: fmt.Sprintf("/sql/v1beta4/projects/%s/instances/%s/connectSettings", i.project, i.name), | ||
| reqCt: ct, | ||
| handle: func(resp http.ResponseWriter, _ *http.Request) { | ||
| var ips []*sqladmin.IpMapping |
There was a problem hiding this comment.
Got it -- a comment to that effect above i.serverCACert() would help future maintainers.
eb3a6db to
6b9221f
Compare
| err, | ||
| ) | ||
|
|
||
| // If this instance of monitoredCache is still in the cache, remove it. |
There was a problem hiding this comment.
I see. That's probably an edge case, but also worth solving for. Maybe add a note like, "in the case of multiple goroutines creating cache entries, only purge the cache if it's for a known failed attempt and don't overwrite separate goroutine entries." This is a subtle little detail otherwise.
| err, | ||
| ) | ||
|
|
||
| // If this instance of monitoredCache is still in the cache, remove it. |
There was a problem hiding this comment.
Or maybe just:
// If this instance was already removed from the cache or *a separate goroutine* replaced it with
// a new instance, do nothing.
jackwotherspoon
left a comment
There was a problem hiding this comment.
LGTM ✅
Thanks for the extra assist here @enocom 👏
6b9221f to
76c0a59
Compare
When client CA is rotated, this can cause TLS read errors after the
Dialer.Dial()has returned. The certificate should be refreshed.This adds logic to the dialer to refresh the connection cache if a TLS error occurs after
net.Connis returned to the database driver.Fixes #932