Skip to content

dynamic TLS CA rotation for sidecar and query service#530

Open
cendhu wants to merge 1 commit intohyperledger:mainfrom
cendhu:dynamic-tls
Open

dynamic TLS CA rotation for sidecar and query service#530
cendhu wants to merge 1 commit intohyperledger:mainfrom
cendhu:dynamic-tls

Conversation

@cendhu
Copy link
Copy Markdown
Contributor

@cendhu cendhu commented Apr 8, 2026

Type of change

  • New feature

Description

Add dynamic TLS support so that peer organization CA certificates are automatically updated when config blocks add or remove organizations.

  • DynamicTLSManager in utils/connection/ watches config block updates and rebuilds the trusted CA pool on the fly
  • Sidecar updates TLS immediately during config block processing
  • Query service polls the DB periodically for config changes
  • Integration test covers add/remove/restore of peer orgs with mTLS
  • Infrastructure plumbing for QueryTLSRefreshInterval and PeerOrganizationCount in runner config

Related issues

@coveralls
Copy link
Copy Markdown

coveralls commented Apr 8, 2026

Coverage Status

coverage: 91.356% (+0.06%) from 91.295% — cendhu:dynamic-tls into hyperledger:main

// When set, the gRPC server's TLS config will use GetConfigForClient to obtain
// the current TLS configuration on each new client connection.
func (c *ServerConfig) SetTLSConfigProvider(p TLSConfigProvider) {
c.tlsConfigProvider = p
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not comfortable putting tlsConfigProvider under a config struct.
We already established that "Config" structs should only contain config elements (YAML).
The only exception we have is preAllocatedListener and it is only as a hack of testing, not a production design element.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Alternative is to pass this tlsConfigProvider as an argument in grpcservice.StartAndServer. To avoid passing nil for majority of the services, we can make it optional with .... What do you think?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An optional parameter (can be nil) for StartAndServe() sounds better.


// TLS refresh runs as a standalone goroutine rather than in an errgroup because
// a transient failure to read config from the DB should not stop the query service.
go q.refreshTLSFromDB(ctx, pool)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The refreshTLSFromDB goroutine polls the database every TLSRefreshInterval regardless of whether config blocks are being received. Since config blocks are typically infrequent, this results in many unnecessary queries.

Alternative: Lazy loading on client connection (as in my PR):

  • Query DB only when clients connect (not continuously)
  • Use refresh interval to throttle concurrent queries during connection spikes
  • Zero DB queries during idle periods

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree. As the impact is going to be negligible, this simple design is preferred. PostgreSQL supports LISTEN/NOTIFY. YugabyteDB has implemented it but yet to be released. Hence, this approach will be retained for a few more months and then we go with the cleaner approach.

Comment on lines +357 to +360
case <-ctx.Done():
return
case <-ticker.C:
configTX, err := queryConfig(ctx, pool)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The queryConfig(ctx, pool) call has no timeout protection. If the database is slow or unresponsive, this goroutine will block indefinitely, preventing TLS certificate updates.

We should add a timeout context for each query:

case <-ticker.C:
    queryCtx, cancel := context.WithTimeout(ctx, q.config.CAFetchTimeout)
    configTX, err := queryConfig(queryCtx, pool)
    cancel()
    if err != nil {
        logger.Errorf("Failed to read config transaction from DB: %v", err)
        continue
    }

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed. pgx connections already have timeout protection configured in ConfigureConnReadDeadline: 10s dial timeout and 60s per-read deadline on the underlying connection. So queryConfig won't block indefinitely even without an explicit context timeout — the read deadline on the connection will fail it.

incomingBlockToBeCommitted <-chan *common.Block
outgoingCommittedBlock chan<- *common.Block
outgoingStatusUpdates chan<- []*committerpb.TxStatus
outgoingConfigEnvelopes chan<- []byte
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The outgoingConfigEnvelopes channel creates unnecessary indirection between relay and sidecar. Since relay is a sub-component owned by sidecar, direct method calls are simpler and more efficient.

if mappedBlock.isConfig {
    rootCAs, err := connection.GetApplicationRootCAsFromConfigBlock(block)
    if err != nil {
        logger.Warnf("Failed to load application root CAs: %v", err)
    } else {
        r.updateTLSConfig(rootCAs)
    }
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is as per our existing coding style. We have outgoingCommitterBlock. It is consumed outside relay. Similarly, we have outgoingConfigEnvelopes which is consumed by a goroutine in the sidecar. We have used this coding style consistently. Earlier, we used to call a method on sidecar from relay to process the config block which I was not comfortable with. Hence, I went with this approach.

// The design separates writers (services) from readers (server) through
// interface segregation, ensuring a linear dependency flow:
//
// Service -> TLSCertUpdater -> DynamicTLS <- TLSConfigProvider <- Server
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need the DynamicTLS mediator?
I see the interface segregation thinking here, but it feels like we're adding layers without a clear benefit.

Suggested flow: Service → TLSCertUpdater → DynamicTLS ← TLSConfigProvider ← Server

Alternative Flow:
Service (owns atomic.Pointer) and updates it by demand ← GetTLSConfig() ← Server

Requirements: Extension of the current Service interface defined as DynamicService.

The mediator adds 3 extra components (DynamicTLS + 2 interfaces), but doesn't solve a problem that direct ownership can't handle.

Copy link
Copy Markdown
Contributor Author

@cendhu cendhu Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What we have here is an alternative design. In fact, I preferred Service → DynamicTLS ← Server but @liran-funaro raised a concern that it is not clear who is the writer and reader. Hence, I added the interface. This design addresses both @liran-funaro and my concerns in a middle ground.

  │                   Concern                    │                                            How this resolves it                                            │
  ├──────────────────────────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
  │ Circular dependency (cendhu)                 │ Linear flow: Service -> DynamicTLS <- Server. No callback into service.                                    │
  ├──────────────────────────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
  │ Service interface bloat (cendhu)             │ Service interface unchanged. No GetDynamicTLSConfig method.                                                │
  ├──────────────────────────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
  │ Avoiding function parameters (cendhu, liran) │ Typed interfaces instead of bare func() [][]byte args.                                                     │
  ├──────────────────────────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
  │ Initialization complexity (liran)            │ One extra line: dynamicTLS := NewDynamicTLS(...). But the wiring is explicit and traceable.                │
  ├──────────────────────────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
  │ CertPool built per-handshake (cendhu)        │ CertPool built once per CA change in SetClientRootCAs. GetConfigForClient is a single atomic load.         │
  ├──────────────────────────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
  │ No mediator cognitive load (liran)           │ The interfaces are tiny (one method each) and named by intent. You never interact with both sides at once. │
  ├──────────────────────────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
  │ Non-dynamic services (liran)                 │ Pass nil for TLSConfigProvider. No dummy methods on coordinator/verifier/etc.                              │
  └──────────────────────────────────────────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────┘


// GrpcServer instantiate a [grpc.Server].
func (c *ServerConfig) GrpcServer() (*grpc.Server, error) {
func (c *ServerConfig) GrpcServer(tlsProvider TLSConfigProvider) (*grpc.Server, error) {
Copy link
Copy Markdown
Contributor

@dean-amar dean-amar Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are implicitly passing a function here as an argument. If this is something we want to avoid, we should think of another way to do so.

This is true for all the chain of functions using this argument.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation accepts an interface, not a method.
While systematically this is identical to passing a method, it adds clarity as we can easily discover all the implementations of this interface.
This is more difficult to achieve with a method (find all methods that have this specific signature).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid point. I can explain why I am afraid of functions as argument or in struct. Building on Liran's comment, I generally try to avoid passing functions as arguments for core components. I'm speaking from some painful experience here—reading through the gossip code in Fabric taught me that while passing functions is super quick to write initially, it makes tracing the execution path a real headache down the road. After a few years, only the people who wrote the code were able to make modifications to gossip, even though all Fabric maintainers were experienced and great at coding.

Go standard library does use func as an argument or func in struct (like in tls.Config). But it really comes down to the use case: if we are writing a broadly used library package, leaning towards functions for configuration options makes sense because it prioritizes the convenience of the people consuming the code. One does not need to write a whole new struct and implement an interface just to pass in custom behavior. A great example of this is sort.Slice(mySlice, func(i, j int) bool { ... })—it's vastly easier for a developer to drop in an anonymous function than it is to define a custom type just to satisfy the sort.Interface.

However, since we are not writing an external library, leaning towards interfaces is usually the better move. It prioritizes strict contracts, maintainability, and traceability. We don't really need that drop-in consumer flexibility; an interface guarantees we can easily track down exactly what is implementing the behavior without having to hunt through the codebase for anonymous functions.

@cendhu cendhu force-pushed the dynamic-tls branch 2 times, most recently from dd37ff7 to afb2062 Compare April 9, 2026 16:14
Add dynamic TLS support so that peer organization CA certificates are
automatically updated when config blocks add or remove organizations.

- DynamicTLSManager in utils/connection/ watches config block updates
  and rebuilds the trusted CA pool on the fly
- Sidecar updates TLS immediately during config block processing
- Query service polls the DB periodically for config changes
- Integration test covers add/remove/restore of peer orgs with mTLS
- Infrastructure plumbing for QueryTLSRefreshInterval and
  PeerOrganizationCount in runner config

Signed-off-by: Senthilnathan <cendhu@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[sidecar, notification & query service] Dynamic updates of server root CAs according to config-block

4 participants