dynamic TLS CA rotation for sidecar and query service#530
dynamic TLS CA rotation for sidecar and query service#530cendhu wants to merge 1 commit intohyperledger:mainfrom
Conversation
utils/connection/config.go
Outdated
| // When set, the gRPC server's TLS config will use GetConfigForClient to obtain | ||
| // the current TLS configuration on each new client connection. | ||
| func (c *ServerConfig) SetTLSConfigProvider(p TLSConfigProvider) { | ||
| c.tlsConfigProvider = p |
There was a problem hiding this comment.
I'm not comfortable putting tlsConfigProvider under a config struct.
We already established that "Config" structs should only contain config elements (YAML).
The only exception we have is preAllocatedListener and it is only as a hack of testing, not a production design element.
There was a problem hiding this comment.
Agreed. Alternative is to pass this tlsConfigProvider as an argument in grpcservice.StartAndServer. To avoid passing nil for majority of the services, we can make it optional with .... What do you think?
There was a problem hiding this comment.
An optional parameter (can be nil) for StartAndServe() sounds better.
|
|
||
| // TLS refresh runs as a standalone goroutine rather than in an errgroup because | ||
| // a transient failure to read config from the DB should not stop the query service. | ||
| go q.refreshTLSFromDB(ctx, pool) |
There was a problem hiding this comment.
The refreshTLSFromDB goroutine polls the database every TLSRefreshInterval regardless of whether config blocks are being received. Since config blocks are typically infrequent, this results in many unnecessary queries.
Alternative: Lazy loading on client connection (as in my PR):
- Query DB only when clients connect (not continuously)
- Use refresh interval to throttle concurrent queries during connection spikes
- Zero DB queries during idle periods
There was a problem hiding this comment.
Yes, I agree. As the impact is going to be negligible, this simple design is preferred. PostgreSQL supports LISTEN/NOTIFY. YugabyteDB has implemented it but yet to be released. Hence, this approach will be retained for a few more months and then we go with the cleaner approach.
| case <-ctx.Done(): | ||
| return | ||
| case <-ticker.C: | ||
| configTX, err := queryConfig(ctx, pool) |
There was a problem hiding this comment.
The queryConfig(ctx, pool) call has no timeout protection. If the database is slow or unresponsive, this goroutine will block indefinitely, preventing TLS certificate updates.
We should add a timeout context for each query:
case <-ticker.C:
queryCtx, cancel := context.WithTimeout(ctx, q.config.CAFetchTimeout)
configTX, err := queryConfig(queryCtx, pool)
cancel()
if err != nil {
logger.Errorf("Failed to read config transaction from DB: %v", err)
continue
}There was a problem hiding this comment.
Not needed. pgx connections already have timeout protection configured in ConfigureConnReadDeadline: 10s dial timeout and 60s per-read deadline on the underlying connection. So queryConfig won't block indefinitely even without an explicit context timeout — the read deadline on the connection will fail it.
| incomingBlockToBeCommitted <-chan *common.Block | ||
| outgoingCommittedBlock chan<- *common.Block | ||
| outgoingStatusUpdates chan<- []*committerpb.TxStatus | ||
| outgoingConfigEnvelopes chan<- []byte |
There was a problem hiding this comment.
The outgoingConfigEnvelopes channel creates unnecessary indirection between relay and sidecar. Since relay is a sub-component owned by sidecar, direct method calls are simpler and more efficient.
if mappedBlock.isConfig {
rootCAs, err := connection.GetApplicationRootCAsFromConfigBlock(block)
if err != nil {
logger.Warnf("Failed to load application root CAs: %v", err)
} else {
r.updateTLSConfig(rootCAs)
}
}There was a problem hiding this comment.
No, this is as per our existing coding style. We have outgoingCommitterBlock. It is consumed outside relay. Similarly, we have outgoingConfigEnvelopes which is consumed by a goroutine in the sidecar. We have used this coding style consistently. Earlier, we used to call a method on sidecar from relay to process the config block which I was not comfortable with. Hence, I went with this approach.
| // The design separates writers (services) from readers (server) through | ||
| // interface segregation, ensuring a linear dependency flow: | ||
| // | ||
| // Service -> TLSCertUpdater -> DynamicTLS <- TLSConfigProvider <- Server |
There was a problem hiding this comment.
Do we really need the DynamicTLS mediator?
I see the interface segregation thinking here, but it feels like we're adding layers without a clear benefit.
Suggested flow: Service → TLSCertUpdater → DynamicTLS ← TLSConfigProvider ← Server
Alternative Flow:
Service (owns atomic.Pointer) and updates it by demand ← GetTLSConfig() ← Server
Requirements: Extension of the current Service interface defined as DynamicService.
The mediator adds 3 extra components (DynamicTLS + 2 interfaces), but doesn't solve a problem that direct ownership can't handle.
There was a problem hiding this comment.
What we have here is an alternative design. In fact, I preferred Service → DynamicTLS ← Server but @liran-funaro raised a concern that it is not clear who is the writer and reader. Hence, I added the interface. This design addresses both @liran-funaro and my concerns in a middle ground.
│ Concern │ How this resolves it │
├──────────────────────────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Circular dependency (cendhu) │ Linear flow: Service -> DynamicTLS <- Server. No callback into service. │
├──────────────────────────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Service interface bloat (cendhu) │ Service interface unchanged. No GetDynamicTLSConfig method. │
├──────────────────────────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Avoiding function parameters (cendhu, liran) │ Typed interfaces instead of bare func() [][]byte args. │
├──────────────────────────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Initialization complexity (liran) │ One extra line: dynamicTLS := NewDynamicTLS(...). But the wiring is explicit and traceable. │
├──────────────────────────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ CertPool built per-handshake (cendhu) │ CertPool built once per CA change in SetClientRootCAs. GetConfigForClient is a single atomic load. │
├──────────────────────────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ No mediator cognitive load (liran) │ The interfaces are tiny (one method each) and named by intent. You never interact with both sides at once. │
├──────────────────────────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Non-dynamic services (liran) │ Pass nil for TLSConfigProvider. No dummy methods on coordinator/verifier/etc. │
└──────────────────────────────────────────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
|
|
||
| // GrpcServer instantiate a [grpc.Server]. | ||
| func (c *ServerConfig) GrpcServer() (*grpc.Server, error) { | ||
| func (c *ServerConfig) GrpcServer(tlsProvider TLSConfigProvider) (*grpc.Server, error) { |
There was a problem hiding this comment.
We are implicitly passing a function here as an argument. If this is something we want to avoid, we should think of another way to do so.
This is true for all the chain of functions using this argument.
There was a problem hiding this comment.
This implementation accepts an interface, not a method.
While systematically this is identical to passing a method, it adds clarity as we can easily discover all the implementations of this interface.
This is more difficult to achieve with a method (find all methods that have this specific signature).
There was a problem hiding this comment.
Valid point. I can explain why I am afraid of functions as argument or in struct. Building on Liran's comment, I generally try to avoid passing functions as arguments for core components. I'm speaking from some painful experience here—reading through the gossip code in Fabric taught me that while passing functions is super quick to write initially, it makes tracing the execution path a real headache down the road. After a few years, only the people who wrote the code were able to make modifications to gossip, even though all Fabric maintainers were experienced and great at coding.
Go standard library does use func as an argument or func in struct (like in tls.Config). But it really comes down to the use case: if we are writing a broadly used library package, leaning towards functions for configuration options makes sense because it prioritizes the convenience of the people consuming the code. One does not need to write a whole new struct and implement an interface just to pass in custom behavior. A great example of this is sort.Slice(mySlice, func(i, j int) bool { ... })—it's vastly easier for a developer to drop in an anonymous function than it is to define a custom type just to satisfy the sort.Interface.
However, since we are not writing an external library, leaning towards interfaces is usually the better move. It prioritizes strict contracts, maintainability, and traceability. We don't really need that drop-in consumer flexibility; an interface guarantees we can easily track down exactly what is implementing the behavior without having to hunt through the codebase for anonymous functions.
dd37ff7 to
afb2062
Compare
Add dynamic TLS support so that peer organization CA certificates are automatically updated when config blocks add or remove organizations. - DynamicTLSManager in utils/connection/ watches config block updates and rebuilds the trusted CA pool on the fly - Sidecar updates TLS immediately during config block processing - Query service polls the DB periodically for config changes - Integration test covers add/remove/restore of peer orgs with mTLS - Infrastructure plumbing for QueryTLSRefreshInterval and PeerOrganizationCount in runner config Signed-off-by: Senthilnathan <cendhu@gmail.com>
Type of change
Description
Add dynamic TLS support so that peer organization CA certificates are automatically updated when config blocks add or remove organizations.
Related issues