[ate]: Enable client-side load balancing and update docs

moidx · moidx · commit dd9f32703c60 · 2025-06-22T13:48:00.000+08:00
This change introduces gRPC client-side load balancing to the ATE client
library and documents its usage.

ATE Client Library:
- The client options now accept a `pa_target` in gRPC name-syntax format
  (e.g., "ipv4:host1:port,host2:port") to enable connecting to multiple
  server instances.
- A `load_balancing_policy` option has been added to allow callers to
  select policies like "round_robin".
- The client creation logic was updated to use `grpc::CreateCustomChannel`
  to apply the specified load balancing configuration.
- Test programs and integration test scripts were updated to use the new
  `--pa_target` and `--load_balancing_policy` flags.

Documentation (`docs/ate.md`):
- Added a section on "Client-Side Load Balancing and Failover,"
  explaining how to configure it and what to expect during partial and
  total server outages.
- Added a "Client Lifecycle and Resource Management" section outlining
  best practices for creating and destroying the client instance.
- Added a "Monitoring and Debugging" section that instructs users on
  how to enable gRPC tracing via environment variables to debug
  connection health and load balancer behavior.

Signed-off-by: Miguel Osorio &lt;miguelosorio@google.com&gt;
diff --git a/docs/ate.md b/docs/ate.md
@@ -1,35 +1,149 @@
 # Automated Test Equipment (ATE) Client
 
-Standard ATE is used in various silicion manufacturing stages, as well as
-for device provisioning. The ATE system generally runs on a PC system running
-the Windows operating system.
+The Automated Test Equipment (ATE) client library and associated test programs
+are used to drive provisioning flows for OpenTitan devices. The client
+communicates with one or more
+[Provisioning Appliance (PA)](https://github.com/lowRISC/opentitan-provisioning/wiki/pa)
+servers to perform secure provisioning operations.
 
-The ATE client connects to the [Provisioning Appliance](https://github.com/lowRISC/opentitan-provisioning/wiki/pa) to perform
-provisioning operations.
+## Client-Side Load Balancing and Failover
 
-## Developer Notes
+The ATE client library supports gRPC client-side load balancing, allowing it
+to distribute requests across multiple Provisioning Appliance (PA) server
+instances. This enhances reliability and scalability.
 
-## Run ATE Client (Linux)
+### Enabling Load Balancing
 
-Run the following steps before proceeding.
+To enable load balancing, you must provide a list of server addresses in a
+gRPC-compliant format via the `--pa_target` command-line argument when running
+a test program (e.g., `cp` or `ft`).
 
-* Generate [enpoint certificates](https://github.com/lowRISC/opentitan-provisioning/wiki/auth#endpoint-certificates).
-* Start [PA server](https://github.com/lowRISC/opentitan-provisioning/wiki/pa#start-pa-server).
+*   **Target URI Format**: The target should be specified using gRPC's
+    name-syntax.
+    *   For IPv4: `ipv4:<ip_addr1>:<port1>,<ip_addr2>:<port2>,...`
+    *   For IPv6: `ipv6:[<ip_addr1>]:<port1>,[<ip_addr2>]:<port2>,...`
 
-Take note of the PA server target address and port number. In the following
-command we start the client pointing to `localhost:5001`.
+    Example: `--pa_target="ipv4:10.0.0.1:50051,10.0.0.2:50051"`
+
+### Load Balancing Policies
+
+You can select a load balancing policy using the `--load_balancing_policy`
+argument. If unspecified, gRPC's default (`pick_first`) is used.
+
+*   `pick_first` (Default): The client attempts to connect to the first
+    address in the list. All RPCs are sent to this single server. If the
+    connection fails, it will try the next address in the list. This policy
+    provides basic failover but does not distribute load.
+*   `round_robin`: The client connects to all servers in the list and
+    distributes RPCs across them in a round-robin fashion. This policy
+    provides both load balancing and high-availability failover.
+
+### Failover Scenarios
+
+The behavior of the client during server outages depends on the configured
+policy.
+
+*   **Partial Outage (with `round_robin`)**: If one server in the pool becomes
+    unavailable, the gRPC runtime will automatically detect the failed
+    connection and temporarily remove it from the pool of healthy endpoints.
+    Subsequent API calls will be transparently routed to the remaining healthy
+    servers. From the caller's perspective, the operations will continue to
+    succeed without any errors.
+
+*   **Total Outage**: If all server endpoints become unavailable, any API call
+    made through the library will fail.
+    *   The C API functions (e.g., `InitSession`, `DeriveTokens`) will return a
+        non-zero status code. This code will correspond to the gRPC status
+        code `UNAVAILABLE` (14).
+    *   Callers must check the return value of every function call to handle
+        this scenario gracefully. A persistent failure with this status code
+        indicates that the client cannot reach any of the configured
+        provisioning servers.
+
+## Client Lifecycle and Resource Management
+
+The ATE client is designed to be a long-lived object that manages the
+underlying gRPC channel, including all network connections and load balancing
+state. To ensure optimal performance and efficient resource use, follow these
+best practices.
+
+### Singleton Client Instance
+
+It is strongly recommended to treat the `ate_client_ptr` as a singleton within
+your application. You should call `CreateClient` once when your program
+initializes and reuse that same client instance for all subsequent gRPC calls.
+
+Repeatedly calling `CreateClient` and `DestroyClient` for different operations
+is an anti-pattern. Each call to `CreateClient` initializes a new gRPC channel,
+which involves setting up new TCP connections, performing TLS handshakes (if
+enabled), and resolving server addresses. This process is computationally
+expensive and introduces significant latency.
+
+### When to Call `DestroyClient`
+
+The `DestroyClient` function should only be called when you are certain that no
+more gRPC calls will be made for the remainder of the program's lifetime,
+typically during application shutdown. Calling `DestroyClient` will tear down
+all underlying network connections, and any subsequent attempt to use the
+client instance will result in an error.
+
+## Monitoring and Debugging
+
+While the ATE client library does not expose a direct API to query the health
+of individual server endpoints, it is possible to monitor the underlying gRPC
+channel's behavior using gRPC's built-in tracing capabilities. This is an
+effective method for debugging connection issues and observing the load
+balancer's real-time behavior.
+
+### Enabling gRPC Tracing
+
+You can enable detailed logging by setting environment variables in your shell
+before launching the application that uses the ATE client library.
+
+```bash
+# Enable tracing for connectivity state, resolvers, and load balancing
+export GRPC_TRACE=connectivity_state,resolver,load_balancer
+
+# Set the logging verbosity for maximum detail
+export GRPC_VERBOSITY=DEBUG
+```
+
+### Interpreting the Output
+
+When tracing is enabled, the gRPC runtime will print detailed logs to `stderr`.
+If a server in the load balancing pool becomes unavailable, you will see log
+entries showing the subchannel's state changing from `READY` to `CONNECTING`
+and then to `TRANSIENT_FAILURE`. When the server becomes available again, the
+logs will show the state transitioning back to `READY`.
+
+This provides a definitive, real-time view of the connection health from the
+client's perspective and is an useful tool in active debugging sessions.
+
+## Running an ATE Test Program
+
+Before running, ensure you have:
+*   Generated the required
+    [endpoint certificates](https://github.com/lowRISC/opentitan-provisioning/wiki/auth#endpoint-certificates).
+*   Started one or more
+    [PA servers](https://github.com/lowRISC/opentitan-provisioning/wiki/pa#start-pa-server).
+
+The following example shows how to run the `ft` test program with load
+balancing enabled against two PA servers.
 
 ```console
-bazelisk build //src/ate:ate_main
-bazel-bin/src/ate/ate_main \
-    --target=localhost:5001 \
+# The specific test program can be :cp or :ft
+bazelisk run //src/ate/test_programs:cp -- \
+    --pa_target="ipv4:localhost:5001,localhost:5002" \
+    --load_balancing_policy="round_robin" \
     --enable_mtls \
     --client_key=$(pwd)/config/certs/out/ate-client-key.pem \
     --client_cert=$(pwd)/config/certs/out/ate-client-cert.pem \
-    --ca_root_certs=$(pwd)/config/certs/out/ca-cert.pem
+    --ca_root_certs=$(pwd)/config/certs/out/ca-cert.pem \
+    --sku="sival" \
+    --sku_auth_pw="test_password"
 ```
 
 ## Read More
 
-* [Provisioning Appliance](https://github.com/lowRISC/opentitan-provisioning/wiki/pa)
-* [Documentation index](https://github.com/lowRISC/opentitan-provisioning/wiki/Home)
+*   [Provisioning Appliance](https://github.com/lowRISC/opentitan-provisioning/wiki/pa)
+*   [Documentation index](https://github.com/lowRISC/opentitan-provisioning/wiki/Home)
diff --git a/run_integration_tests.sh b/run_integration_tests.sh
@@ -67,7 +67,7 @@ for OTSKU in "${FPGA_SKUS[@]}"; do
     --client_cert="${DEPLOYMENT_DIR}/certs/out/ate-client-cert.pem" \
     --client_key="${DEPLOYMENT_DIR}/certs/out/ate-client-key.pem" \
     --ca_root_certs=${DEPLOYMENT_DIR}/certs/out/ca-cert.pem \
-    --pa_socket="ipv4:${OTPROV_IP_PA}:${OTPROV_PORT_PA}" \
+    --pa_target="ipv4:${OTPROV_IP_PA}:${OTPROV_PORT_PA}" \
     --sku="${OTSKU}" \
     --sku_auth_pw="test_password" \
     --fpga="${FPGA}" \
@@ -82,7 +82,7 @@ for OTSKU in "${FPGA_SKUS[@]}"; do
     --client_cert="${DEPLOYMENT_DIR}/certs/out/ate-client-cert.pem" \
     --client_key="${DEPLOYMENT_DIR}/certs/out/ate-client-key.pem" \
     --ca_root_certs=${DEPLOYMENT_DIR}/certs/out/ca-cert.pem \
-    --pa_socket="ipv4:${OTPROV_IP_PA}:${OTPROV_PORT_PA}" \
+    --pa_target="ipv4:${OTPROV_IP_PA}:${OTPROV_PORT_PA}" \
     --sku="${OTSKU}" \
     --sku_auth_pw="test_password" \
     --fpga="${FPGA}" \
diff --git a/src/ate/ate_api.h b/src/ate/ate_api.h
@@ -78,6 +78,9 @@ enum {
    * provisioning_data.h in the lowRISC/opentitan repo.
    */
   kPersoBlobMaxSize = 8192,
+
+  /** Maximum length of an endpoint address string. */
+  kEndpointAddressMaxSize = 256,
 };
 
 /**
@@ -87,9 +90,16 @@ typedef struct {
 } * ate_client_ptr;
 
 typedef struct {
-  // Endpoint address in IP or DNS format including port number. For example:
-  // "localhost:5000".
-  const char* pa_socket;
+  // Endpoint address in gRPC name-syntax format, including port number. For
+  // example: "localhost:5000", "ipv4:127.0.0.1:5000,127.0.0.2:5000", or
+  // "ipv6:[::1]:5000,[::1]:5001".
+  // Using a single address will disable load balancing.
+  const char* pa_target;
+
+  // gRPC load balancing policy. If not set, it will be selected by the gRPC
+  // library. For example: "round_robin" or "pick_first". Leaving this field
+  // empty will use the default policy.
+  const char* load_balancing_policy;
 
   // File containing the Client certificate in PEM format. Required when
   // `enable_mtls` set to true.
diff --git a/src/ate/ate_client.cc b/src/ate/ate_client.cc
@@ -92,7 +92,12 @@ std::unique_ptr<AteClient> AteClient::Create(AteClient::Options options) {
     credentials = BuildCredentials(options);
   }
   // 2. create the grpc channel between the client and the targeted server
-  auto channel = grpc::CreateChannel(options.pa_socket, credentials);
+  grpc::ChannelArguments args;
+  if (!options.load_balancing_policy.empty()) {
+    args.SetLoadBalancingPolicyName(options.load_balancing_policy);
+  }
+  auto channel =
+      grpc::CreateCustomChannel(options.pa_target, credentials, args);
   auto ate = absl::make_unique<AteClient>(
       ProvisioningApplianceService::NewStub(channel));
 
@@ -189,7 +194,9 @@ Status AteClient::RegisterDevice(RegistrationRequest& request,
 // overloads operator<< for AteClient::Options objects printouts
 std::ostream& operator<<(std::ostream& os, const AteClient::Options& options) {
   // write obj to stream
-  os << std::endl << "options.pa_socket = " << options.pa_socket << std::endl;
+  os << std::endl << "options.pa_target = " << options.pa_target << std::endl;
+  os << "options.load_balancing_policy = " << options.load_balancing_policy
+     << std::endl;
   os << "options.enable_mtls = " << options.enable_mtls << std::endl;
   os << "options.pem_cert_chain = " << options.pem_cert_chain << std::endl;
   os << "options.pem_private_key = " << options.pem_private_key << std::endl;
diff --git a/src/ate/ate_client.h b/src/ate/ate_client.h
@@ -20,9 +20,14 @@ namespace ate {
 class AteClient {
  public:
   struct Options {
-    // Endpoint address in IP or DNS format including port number. For example:
-    // "localhost:5000".
-    std::string pa_socket;
+    // Endpoint address in gRPC name-syntax format, including port number. For
+    // example: "localhost:5000", "ipv4:127.0.0.1:5000,127.0.0.2:5000", or
+    // "ipv6:[::1]:5000,[::1]:5001".
+    std::string pa_target;
+
+    // gRPC load balancing policy. If not set, it will be selected by the gRPC
+    // library. For example: "round_robin" or "pick_first".
+    std::string load_balancing_policy;
 
     // Set to true to enable mTLS connection. When set to false, the connection
     // is established with insecure credentials.
diff --git a/src/ate/ate_dll.cc b/src/ate/ate_dll.cc
@@ -178,7 +178,10 @@ DLLEXPORT int CreateClient(
 
   // convert from ate_client_ptr to AteClient::Options
   o.enable_mtls = options->enable_mtls;
-  o.pa_socket = options->pa_socket;
+  o.pa_target = options->pa_target;
+  if (options->load_balancing_policy != nullptr) {
+    o.load_balancing_policy = options->load_balancing_policy;
+  }
   if (o.enable_mtls) {
     // Load the PEM data from the pointed files
     absl::Status s =
diff --git a/src/ate/test_programs/cp.cc b/src/ate/test_programs/cp.cc
@@ -39,7 +39,15 @@ ABSL_FLAG(std::string, cp_sram_elf, "", "CP SRAM ELF (device binary).");
 /**
  * PA configuration flags.
  */
-ABSL_FLAG(std::string, pa_socket, "", "host:port of the PA server.");
+ABSL_FLAG(std::string, pa_target, "",
+          "Endpoint address in gRPC name-syntax format, including port "
+          "number. For example: \"localhost:5000\", "
+          "\"ipv4:127.0.0.1:5000,127.0.0.2:5000\", or "
+          "\"ipv6:[::1]:5000,[::1]:5001\".");
+ABSL_FLAG(std::string, load_balancing_policy, "",
+          "gRPC load balancing policy. If not set, it will be selected by "
+          "the gRPC library. For example: \"round_robin\" or "
+          "\"pick_first\".");
 ABSL_FLAG(std::string, sku, "", "SKU string to initialize the PA session.");
 ABSL_FLAG(std::string, sku_auth_pw, "",
           "SKU authorization password string to initialize the PA session.");
@@ -62,14 +70,17 @@ using provisioning::test_programs::DutLib;
 absl::StatusOr<ate_client_ptr> AteClientNew(void) {
   client_options_t options;
 
-  std::string pa_socket = absl::GetFlag(FLAGS_pa_socket);
-  if (pa_socket.empty()) {
+  std::string pa_target = absl::GetFlag(FLAGS_pa_target);
+  if (pa_target.empty()) {
     return absl::InvalidArgumentError(
-        "--pa_socket not set. This is a required argument.");
+        "--pa_target not set. This is a required argument.");
   }
-  options.pa_socket = pa_socket.c_str();
+  options.pa_target = pa_target.c_str();
   options.enable_mtls = absl::GetFlag(FLAGS_enable_mtls);
 
+  std::string lb_policy = absl::GetFlag(FLAGS_load_balancing_policy);
+  options.load_balancing_policy = lb_policy.c_str();
+
   std::string pem_private_key = absl::GetFlag(FLAGS_client_key);
   std::string pem_cert_chain = absl::GetFlag(FLAGS_client_cert);
   std::string pem_root_certs = absl::GetFlag(FLAGS_ca_root_certs);
diff --git a/src/ate/test_programs/ft.cc b/src/ate/test_programs/ft.cc
@@ -42,7 +42,15 @@ ABSL_FLAG(std::string, ft_fw_bundle_bin, "",
 /**
  * PA configuration flags.
  */
-ABSL_FLAG(std::string, pa_socket, "", "host:port of the PA server.");
+ABSL_FLAG(std::string, pa_target, "",
+          "Endpoint address in gRPC name-syntax format, including port "
+          "number. For example: \"localhost:5000\", "
+          "\"ipv4:127.0.0.1:5000,127.0.0.2:5000\", or "
+          "\"ipv6:[::1]:5000,[::1]:5001\".");
+ABSL_FLAG(std::string, load_balancing_policy, "",
+          "gRPC load balancing policy. If not set, it will be selected by "
+          "the gRPC library. For example: \"round_robin\" or "
+          "\"pick_first\".");
 ABSL_FLAG(std::string, sku, "", "SKU string to initialize the PA session.");
 ABSL_FLAG(std::string, sku_auth_pw, "",
           "SKU authorization password string to initialize the PA session.");
@@ -65,14 +73,17 @@ using provisioning::test_programs::DutLib;
 absl::StatusOr<ate_client_ptr> AteClientNew(void) {
   client_options_t options;
 
-  std::string pa_socket = absl::GetFlag(FLAGS_pa_socket);
-  if (pa_socket.empty()) {
+  std::string pa_target = absl::GetFlag(FLAGS_pa_target);
+  if (pa_target.empty()) {
     return absl::InvalidArgumentError(
-        "--pa_socket not set. This is a required argument.");
+        "--pa_target not set. This is a required argument.");
   }
-  options.pa_socket = pa_socket.c_str();
+  options.pa_target = pa_target.c_str();
   options.enable_mtls = absl::GetFlag(FLAGS_enable_mtls);
 
+  std::string lb_policy = absl::GetFlag(FLAGS_load_balancing_policy);
+  options.load_balancing_policy = lb_policy.c_str();
+
   std::string pem_private_key = absl::GetFlag(FLAGS_client_key);
   std::string pem_cert_chain = absl::GetFlag(FLAGS_client_cert);
   std::string pem_root_certs = absl::GetFlag(FLAGS_ca_root_certs);