docs: fatal codes, re-init, and retry policy (#1818)

toddbaert · web-flow · commit ace1a7c7fdf9 · 2026-01-09T13:27:55.000-05:00
This PR specifies some provider behavior, specifically around stream health, gRPC retry policy, and FATAL codes. Specifically, it: - publishes a retry policy that is shall be used by all flagd providers - specifies a new option for marking some gRPC status codes as FATAL, which will cause the provider to stop attempting to reconnect (generally useful and requested in open-feature/go-sdk-contrib#756) - makes clear via state diagram that flagd provider should support re-initialization (if not in FATAL state) --------- Signed-off-by: Todd Baert <todd.baert@dynatrace.com>
diff --git a/docs/concepts/selectors.md b/docs/concepts/selectors.md
@@ -1,6 +1,7 @@
 # Selectors
 
-Selectors are query expressions that allow you to filter flag configurations from flagd's sync service. They enable providers to request only specific subsets of flags instead of receiving all flags, making flagd more efficient and flexible for complex deployments.
+Selectors are query expressions that allow you to filter flag configurations from flagd.
+They enable providers to sync or evaluate only specific subsets of flags instead of all flags, making flagd more efficient and flexible for complex deployments, and supporting basic multi-tenancy.
 
 ## Overview
 
diff --git a/docs/reference/specifications/providers.md b/docs/reference/specifications/providers.md
@@ -64,18 +64,21 @@ stateDiagram-v2
     NOT_READY --> ERROR: initialize
     READY --> ERROR: disconnected, disconnected period == 0
     READY --> STALE: disconnected, disconnect period < retry grace period
+    READY --> NOT_READY: shutdown
     STALE --> ERROR: disconnect period >= retry grace period
+    STALE --> NOT_READY: shutdown
     ERROR --> READY: reconnected
-    ERROR --> [*]: shutdown
+    ERROR --> NOT_READY: shutdown
+    ERROR --> [*]: Error code == PROVIDER_FATAL
 
-    note right of STALE
+    note left of STALE
         stream disconnected, attempting to reconnect,
         resolve from cache*
         resolve from flag set rules**
         STALE emitted
     end note
 
-    note right of READY
+    note left of READY
         stream connected,
         evaluation cache active*,
         flag set rules stored**,
@@ -84,7 +87,7 @@ stateDiagram-v2
         CHANGE emitted with stream messages
     end note
 
-    note right of ERROR
+    note left of ERROR
         stream disconnected, attempting to reconnect,
         evaluation cache purged*,
         ERROR emitted
@@ -101,25 +104,51 @@ stateDiagram-v2
 
 ### Stream Reconnection
 
-When either stream (sync or event) disconnects, whether due to the associated deadline being exceeded, network error or any other cause, the provider attempts to re-establish the stream immediately, and then retries with an exponential back-off.
-We always rely on the [integrated functionality of GRPC for reconnection](https://github.com/grpc/grpc/blob/master/doc/connection-backoff.md) and utilize [Wait-for-Ready](https://grpc.io/docs/guides/wait-for-ready/) to re-establish the stream.
-We are configuring the underlying reconnection mechanism whenever we can, based on our configuration. (not all GRPC implementations support this)
+When either stream (sync or event) fails or completes, whether due to the associated deadline being exceeded, network error or any other cause, the provider attempts to re-establish the stream.
+Both the RPC and sync streams will forever attempt to be re-established unless the stream response indicates a [fatal status code](#fatal-status-codes).
+This is distinct from the [gRPC retry-policy](#grpc-retry-policy), which automatically retries *all RPCs* (streams or otherwise) a limited number of times to make the provider resilient to transient errors.
+It's also distinct from the [gRPC layer 4 reconnection mechanism](https://grpc.github.io/grpc/core/md_doc_connection-backoff.html) which only reconnects the TCP connection, but not any streams.
+When the stream is reconnecting, providers transition to the [STALE](https://openfeature.dev/docs/reference/concepts/events/#provider_stale) state, and after `retryGracePeriod`, transition to the ERROR state, emitting the respective events during these transitions.
 
-| language/property | min connect timeout               | max backoff              | initial backoff          | jitter | multiplier |
-| ----------------- | --------------------------------- | ------------------------ | ------------------------ | ------ | ---------- |
-| GRPC property     | grpc.initial_reconnect_backoff_ms | max_reconnect_backoff_ms | min_reconnect_backoff_ms | 0.2    | 1.6        |
-| Flagd property    | deadlineMs                        | retryBackoffMaxMs        | retryBackoffMs           | 0.2    | 1.6        |
-| ---               | ---                               | ---                      | ---                      | ---    | ---        |
-| default [^1]      | ✅                                 | ✅                        | ✅                        | 0.2    | 1.6        |
-| js                | ✅                                 | ✅                        | ❌                        | 0.2    | 1.6        |
-| java              | ❌                                 | ❌                        | ❌                        | 0.2    | 1.6        |
+## gRPC Retry Policy
 
-[^1] : C++, Python, Ruby, Objective-C, PHP, C#, js(deprecated)
+flagd leverages gRPC built-in retry mechanism for all RPCs.
+In short, the retry policy attempts to retry all RPCs which return `UNAVAILABLE` or `UNKNOWN` status codes 3 times, with a 1s, 2s, 4s, backoff respectively.
+No other status codes are retried.
+The flagd gRPC retry policy is specified below:
 
-When disconnected, if the time since disconnection is less than `retryGracePeriod`, the provider emits `STALE` when it disconnects.
-While the provider is in state `STALE` the provider resolves values from its cache or stored flag set rules, depending on its resolver mode.
-When the time since the last disconnect first exceeds `retryGracePeriod`, the provider emits `ERROR`.
-The provider attempts to reconnect indefinitely, with a maximum interval of `retryBackoffMaxMs`.
+```json
+{
+    "methodConfig": [
+        {
+            "name": [
+                {
+                    "service": "flagd.evaluation.v1.Service"
+                },
+                {
+                    "service": "flagd.sync.v1.FlagSyncService"
+                }
+            ],
+            "retryPolicy": {
+                "MaxAttempts": 4,
+                "InitialBackoff": "1s",
+                "MaxBackoff": $FLAGD_RETRY_BACKOFF_MAX_MS, // from provider options
+                "BackoffMultiplier": 2.0,
+                "RetryableStatusCodes": [
+                    "UNAVAILABLE",
+                    "UNKNOWN"
+                ]
+            }
+        }
+    ]
+}
+```
+
+## Fatal Status Codes
+
+Providers accept an option for defining fatal gRPC status codes which, when received in the RPC or sync streams, transition the provider to the PROVIDER_FATAL state.
+This configuration is useful for situations wherein these codes indicate to a client that their configuration is invalid and must be changed (i.e., the error is non-transient).
+Examples for this include status codes such as `UNAUTHENTICATED` or `PERMISSION_DENIED`.
 
 ## RPC Resolver
 
@@ -262,28 +291,29 @@ precedence.
 
 Below are the supported configuration parameters (note that not all apply to both resolver modes):
 
-| Option name           | Environment variable name      | Explanation                                                                          | Type & Values                | Default                       | Compatible resolver     |
-| --------------------- | ------------------------------ | ------------------------------------------------------------------------------------ | ---------------------------- | ----------------------------- | ----------------------- |
-| resolver              | FLAGD_RESOLVER                 | mode of operation                                                                    | String - `rpc`, `in-process` | rpc                           | rpc & in-process        |
-| host                  | FLAGD_HOST                     | remote host                                                                          | String                       | localhost                     | rpc & in-process        |
-| port                  | FLAGD_PORT                     | remote port                                                                          | int                          | 8013 (rpc), 8015 (in-process) | rpc & in-process        |
-| targetUri             | FLAGD_TARGET_URI               | alternative to host/port, supporting custom name resolution                          | string                       | null                          | rpc & in-process        |
-| tls                   | FLAGD_TLS                      | connection encryption                                                                | boolean                      | false                         | rpc & in-process        |
-| socketPath            | FLAGD_SOCKET_PATH              | alternative to host port, unix socket                                                | String                       | null                          | rpc & in-process        |
-| certPath              | FLAGD_SERVER_CERT_PATH         | tls cert path                                                                        | String                       | null                          | rpc & in-process        |
-| deadlineMs            | FLAGD_DEADLINE_MS              | deadline for unary calls, and timeout for initialization                             | int                          | 500                           | rpc & in-process & file |
-| streamDeadlineMs      | FLAGD_STREAM_DEADLINE_MS       | deadline for streaming calls, useful as an application-layer keepalive               | int                          | 600000                        | rpc & in-process        |
-| retryBackoffMs        | FLAGD_RETRY_BACKOFF_MS         | initial backoff for stream retry                                                     | int                          | 1000                          | rpc & in-process        |
-| retryBackoffMaxMs     | FLAGD_RETRY_BACKOFF_MAX_MS     | maximum backoff for stream retry                                                     | int                          | 120000                        | rpc & in-process        |
-| retryGracePeriod      | FLAGD_RETRY_GRACE_PERIOD       | period in seconds before provider moves from STALE to ERROR state                    | int                          | 5                             | rpc & in-process & file |
-| keepAliveTime         | FLAGD_KEEP_ALIVE_TIME_MS       | http 2 keepalive                                                                     | long                         | 0                             | rpc & in-process        |
-| cache                 | FLAGD_CACHE                    | enable cache of static flags                                                         | String - `lru`, `disabled`   | lru                           | rpc                     |
-| maxCacheSize          | FLAGD_MAX_CACHE_SIZE           | max size of static flag cache                                                        | int                          | 1000                          | rpc                     |
-| selector              | FLAGD_SOURCE_SELECTOR          | Selector expression to filter flags (e.g., `flagSetId=my-app`, `source=config.json`) | string                       | null                          | in-process              |
-| providerId            | FLAGD_PROVIDER_ID              | A unique identifier for flagd(grpc client) initiating the request.                   | string                       | null                          | in-process              |
-| offlineFlagSourcePath | FLAGD_OFFLINE_FLAG_SOURCE_PATH | offline, file-based flag definitions, overrides host/port/targetUri                  | string                       | null                          | file                    |
-| offlinePollIntervalMs | FLAGD_OFFLINE_POLL_MS          | poll interval for reading offlineFlagSourcePath                                      | int                          | 5000                          | file                    |
-| contextEnricher       | -                              | sync-metadata to evaluation context mapping function                                 | function                     | identity function             | in-process              |
+| Option name           | Environment variable name      | Explanation                                                                                                     | Type & Values                | Default                       | Compatible resolver     |
+| --------------------- | ------------------------------ | --------------------------------------------------------------------------------------------------------------- | ---------------------------- | ----------------------------- | ----------------------- |
+| resolver              | FLAGD_RESOLVER                 | mode of operation                                                                                               | string - `rpc`, `in-process` | rpc                           | rpc & in-process        |
+| host                  | FLAGD_HOST                     | remote host                                                                                                     | string                       | localhost                     | rpc & in-process        |
+| port                  | FLAGD_PORT                     | remote port                                                                                                     | int                          | 8013 (rpc), 8015 (in-process) | rpc & in-process        |
+| targetUri             | FLAGD_TARGET_URI               | alternative to host/port, supporting custom name resolution                                                     | string                       | null                          | rpc & in-process        |
+| tls                   | FLAGD_TLS                      | connection encryption                                                                                           | boolean                      | false                         | rpc & in-process        |
+| socketPath            | FLAGD_SOCKET_PATH              | alternative to host port, unix socket                                                                           | string                       | null                          | rpc & in-process        |
+| certPath              | FLAGD_SERVER_CERT_PATH         | tls cert path                                                                                                   | string                       | null                          | rpc & in-process        |
+| deadlineMs            | FLAGD_DEADLINE_MS              | deadline for unary calls, and timeout for initialization                                                        | int                          | 500                           | rpc & in-process & file |
+| streamDeadlineMs      | FLAGD_STREAM_DEADLINE_MS       | deadline for streaming calls, useful as an application-layer keepalive                                          | int                          | 600000                        | rpc & in-process        |
+| retryBackoffMs        | FLAGD_RETRY_BACKOFF_MS         | initial backoff for stream retry                                                                                | int                          | 1000                          | rpc & in-process        |
+| retryBackoffMaxMs     | FLAGD_RETRY_BACKOFF_MAX_MS     | maximum backoff for stream retry                                                                                | int                          | 120000                        | rpc & in-process        |
+| retryGracePeriod      | FLAGD_RETRY_GRACE_PERIOD       | period in seconds before provider moves from STALE to ERROR state                                               | int                          | 5                             | rpc & in-process & file |
+| keepAliveTime         | FLAGD_KEEP_ALIVE_TIME_MS       | http 2 keepalive                                                                                                | long                         | 0                             | rpc & in-process        |
+| selector              | FLAGD_SOURCE_SELECTOR          | expression to filter flags (e.g., `flagSetId=my-app`, `source=config.json`)                                     | string                       | null                          | rpc & in-process        |
+| cache                 | FLAGD_CACHE                    | enable cache of static flags                                                                                    | string - `lru`, `disabled`   | lru                           | rpc                     |
+| maxCacheSize          | FLAGD_MAX_CACHE_SIZE           | max size of static flag cache                                                                                   | int                          | 1000                          | rpc                     |
+| providerId            | FLAGD_PROVIDER_ID              | A unique identifier for flagd(grpc client) initiating the request.                                              | string                       | null                          | in-process              |
+| offlineFlagSourcePath | FLAGD_OFFLINE_FLAG_SOURCE_PATH | offline, file-based flag definitions, overrides host/port/targetUri                                             | string                       | null                          | file                    |
+| offlinePollIntervalMs | FLAGD_OFFLINE_POLL_MS          | poll interval for reading offlineFlagSourcePath                                                                 | int                          | 5000                          | file                    |
+| contextEnricher       | -                              | sync-metadata to evaluation context mapping function                                                            | function                     | identity function             | in-process              |
+| fatalStatusCodes      | FLAGD_FATAL_STATUS_CODES       | a list of gRPC status codes, which will cause streams to give up and put the provider in a PROVIDER_FATAL state | array                        | []                            | rpc & in-process        |
 
 ### Custom Name Resolution
 
diff --git a/test-harness b/test-harness
@@ -1 +1 @@
-Subproject commit 7d7d51848a31805b4248b1d8e8a9f295554b1aee
+Subproject commit b0057abde5d84272d6dd91f4737655c9d6cead15