Skip to content

Commit ace1a7c

Browse files
authored
docs: fatal codes, re-init, and retry policy (#1818)
This PR specifies some provider behavior, specifically around stream health, gRPC retry policy, and FATAL codes. Specifically, it: - publishes a retry policy that is shall be used by all flagd providers - specifies a new option for marking some gRPC status codes as FATAL, which will cause the provider to stop attempting to reconnect (generally useful and requested in open-feature/go-sdk-contrib#756) - makes clear via state diagram that flagd provider should support re-initialization (if not in FATAL state) --------- Signed-off-by: Todd Baert <todd.baert@dynatrace.com>
1 parent 623e5e2 commit ace1a7c

File tree

3 files changed

+75
-44
lines changed

3 files changed

+75
-44
lines changed

docs/concepts/selectors.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
# Selectors
22

3-
Selectors are query expressions that allow you to filter flag configurations from flagd's sync service. They enable providers to request only specific subsets of flags instead of receiving all flags, making flagd more efficient and flexible for complex deployments.
3+
Selectors are query expressions that allow you to filter flag configurations from flagd.
4+
They enable providers to sync or evaluate only specific subsets of flags instead of all flags, making flagd more efficient and flexible for complex deployments, and supporting basic multi-tenancy.
45

56
## Overview
67

docs/reference/specifications/providers.md

Lines changed: 72 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -64,18 +64,21 @@ stateDiagram-v2
6464
NOT_READY --> ERROR: initialize
6565
READY --> ERROR: disconnected, disconnected period == 0
6666
READY --> STALE: disconnected, disconnect period < retry grace period
67+
READY --> NOT_READY: shutdown
6768
STALE --> ERROR: disconnect period >= retry grace period
69+
STALE --> NOT_READY: shutdown
6870
ERROR --> READY: reconnected
69-
ERROR --> [*]: shutdown
71+
ERROR --> NOT_READY: shutdown
72+
ERROR --> [*]: Error code == PROVIDER_FATAL
7073
71-
note right of STALE
74+
note left of STALE
7275
stream disconnected, attempting to reconnect,
7376
resolve from cache*
7477
resolve from flag set rules**
7578
STALE emitted
7679
end note
7780
78-
note right of READY
81+
note left of READY
7982
stream connected,
8083
evaluation cache active*,
8184
flag set rules stored**,
@@ -84,7 +87,7 @@ stateDiagram-v2
8487
CHANGE emitted with stream messages
8588
end note
8689
87-
note right of ERROR
90+
note left of ERROR
8891
stream disconnected, attempting to reconnect,
8992
evaluation cache purged*,
9093
ERROR emitted
@@ -101,25 +104,51 @@ stateDiagram-v2
101104

102105
### Stream Reconnection
103106

104-
When either stream (sync or event) disconnects, whether due to the associated deadline being exceeded, network error or any other cause, the provider attempts to re-establish the stream immediately, and then retries with an exponential back-off.
105-
We always rely on the [integrated functionality of GRPC for reconnection](https://github.com/grpc/grpc/blob/master/doc/connection-backoff.md) and utilize [Wait-for-Ready](https://grpc.io/docs/guides/wait-for-ready/) to re-establish the stream.
106-
We are configuring the underlying reconnection mechanism whenever we can, based on our configuration. (not all GRPC implementations support this)
107+
When either stream (sync or event) fails or completes, whether due to the associated deadline being exceeded, network error or any other cause, the provider attempts to re-establish the stream.
108+
Both the RPC and sync streams will forever attempt to be re-established unless the stream response indicates a [fatal status code](#fatal-status-codes).
109+
This is distinct from the [gRPC retry-policy](#grpc-retry-policy), which automatically retries *all RPCs* (streams or otherwise) a limited number of times to make the provider resilient to transient errors.
110+
It's also distinct from the [gRPC layer 4 reconnection mechanism](https://grpc.github.io/grpc/core/md_doc_connection-backoff.html) which only reconnects the TCP connection, but not any streams.
111+
When the stream is reconnecting, providers transition to the [STALE](https://openfeature.dev/docs/reference/concepts/events/#provider_stale) state, and after `retryGracePeriod`, transition to the ERROR state, emitting the respective events during these transitions.
107112

108-
| language/property | min connect timeout | max backoff | initial backoff | jitter | multiplier |
109-
| ----------------- | --------------------------------- | ------------------------ | ------------------------ | ------ | ---------- |
110-
| GRPC property | grpc.initial_reconnect_backoff_ms | max_reconnect_backoff_ms | min_reconnect_backoff_ms | 0.2 | 1.6 |
111-
| Flagd property | deadlineMs | retryBackoffMaxMs | retryBackoffMs | 0.2 | 1.6 |
112-
| --- | --- | --- | --- | --- | --- |
113-
| default [^1] |||| 0.2 | 1.6 |
114-
| js |||| 0.2 | 1.6 |
115-
| java |||| 0.2 | 1.6 |
113+
## gRPC Retry Policy
116114

117-
[^1] : C++, Python, Ruby, Objective-C, PHP, C#, js(deprecated)
115+
flagd leverages gRPC built-in retry mechanism for all RPCs.
116+
In short, the retry policy attempts to retry all RPCs which return `UNAVAILABLE` or `UNKNOWN` status codes 3 times, with a 1s, 2s, 4s, backoff respectively.
117+
No other status codes are retried.
118+
The flagd gRPC retry policy is specified below:
118119

119-
When disconnected, if the time since disconnection is less than `retryGracePeriod`, the provider emits `STALE` when it disconnects.
120-
While the provider is in state `STALE` the provider resolves values from its cache or stored flag set rules, depending on its resolver mode.
121-
When the time since the last disconnect first exceeds `retryGracePeriod`, the provider emits `ERROR`.
122-
The provider attempts to reconnect indefinitely, with a maximum interval of `retryBackoffMaxMs`.
120+
```json
121+
{
122+
"methodConfig": [
123+
{
124+
"name": [
125+
{
126+
"service": "flagd.evaluation.v1.Service"
127+
},
128+
{
129+
"service": "flagd.sync.v1.FlagSyncService"
130+
}
131+
],
132+
"retryPolicy": {
133+
"MaxAttempts": 4,
134+
"InitialBackoff": "1s",
135+
"MaxBackoff": $FLAGD_RETRY_BACKOFF_MAX_MS, // from provider options
136+
"BackoffMultiplier": 2.0,
137+
"RetryableStatusCodes": [
138+
"UNAVAILABLE",
139+
"UNKNOWN"
140+
]
141+
}
142+
}
143+
]
144+
}
145+
```
146+
147+
## Fatal Status Codes
148+
149+
Providers accept an option for defining fatal gRPC status codes which, when received in the RPC or sync streams, transition the provider to the PROVIDER_FATAL state.
150+
This configuration is useful for situations wherein these codes indicate to a client that their configuration is invalid and must be changed (i.e., the error is non-transient).
151+
Examples for this include status codes such as `UNAUTHENTICATED` or `PERMISSION_DENIED`.
123152

124153
## RPC Resolver
125154

@@ -262,28 +291,29 @@ precedence.
262291

263292
Below are the supported configuration parameters (note that not all apply to both resolver modes):
264293

265-
| Option name | Environment variable name | Explanation | Type & Values | Default | Compatible resolver |
266-
| --------------------- | ------------------------------ | ------------------------------------------------------------------------------------ | ---------------------------- | ----------------------------- | ----------------------- |
267-
| resolver | FLAGD_RESOLVER | mode of operation | String - `rpc`, `in-process` | rpc | rpc & in-process |
268-
| host | FLAGD_HOST | remote host | String | localhost | rpc & in-process |
269-
| port | FLAGD_PORT | remote port | int | 8013 (rpc), 8015 (in-process) | rpc & in-process |
270-
| targetUri | FLAGD_TARGET_URI | alternative to host/port, supporting custom name resolution | string | null | rpc & in-process |
271-
| tls | FLAGD_TLS | connection encryption | boolean | false | rpc & in-process |
272-
| socketPath | FLAGD_SOCKET_PATH | alternative to host port, unix socket | String | null | rpc & in-process |
273-
| certPath | FLAGD_SERVER_CERT_PATH | tls cert path | String | null | rpc & in-process |
274-
| deadlineMs | FLAGD_DEADLINE_MS | deadline for unary calls, and timeout for initialization | int | 500 | rpc & in-process & file |
275-
| streamDeadlineMs | FLAGD_STREAM_DEADLINE_MS | deadline for streaming calls, useful as an application-layer keepalive | int | 600000 | rpc & in-process |
276-
| retryBackoffMs | FLAGD_RETRY_BACKOFF_MS | initial backoff for stream retry | int | 1000 | rpc & in-process |
277-
| retryBackoffMaxMs | FLAGD_RETRY_BACKOFF_MAX_MS | maximum backoff for stream retry | int | 120000 | rpc & in-process |
278-
| retryGracePeriod | FLAGD_RETRY_GRACE_PERIOD | period in seconds before provider moves from STALE to ERROR state | int | 5 | rpc & in-process & file |
279-
| keepAliveTime | FLAGD_KEEP_ALIVE_TIME_MS | http 2 keepalive | long | 0 | rpc & in-process |
280-
| cache | FLAGD_CACHE | enable cache of static flags | String - `lru`, `disabled` | lru | rpc |
281-
| maxCacheSize | FLAGD_MAX_CACHE_SIZE | max size of static flag cache | int | 1000 | rpc |
282-
| selector | FLAGD_SOURCE_SELECTOR | Selector expression to filter flags (e.g., `flagSetId=my-app`, `source=config.json`) | string | null | in-process |
283-
| providerId | FLAGD_PROVIDER_ID | A unique identifier for flagd(grpc client) initiating the request. | string | null | in-process |
284-
| offlineFlagSourcePath | FLAGD_OFFLINE_FLAG_SOURCE_PATH | offline, file-based flag definitions, overrides host/port/targetUri | string | null | file |
285-
| offlinePollIntervalMs | FLAGD_OFFLINE_POLL_MS | poll interval for reading offlineFlagSourcePath | int | 5000 | file |
286-
| contextEnricher | - | sync-metadata to evaluation context mapping function | function | identity function | in-process |
294+
| Option name | Environment variable name | Explanation | Type & Values | Default | Compatible resolver |
295+
| --------------------- | ------------------------------ | --------------------------------------------------------------------------------------------------------------- | ---------------------------- | ----------------------------- | ----------------------- |
296+
| resolver | FLAGD_RESOLVER | mode of operation | string - `rpc`, `in-process` | rpc | rpc & in-process |
297+
| host | FLAGD_HOST | remote host | string | localhost | rpc & in-process |
298+
| port | FLAGD_PORT | remote port | int | 8013 (rpc), 8015 (in-process) | rpc & in-process |
299+
| targetUri | FLAGD_TARGET_URI | alternative to host/port, supporting custom name resolution | string | null | rpc & in-process |
300+
| tls | FLAGD_TLS | connection encryption | boolean | false | rpc & in-process |
301+
| socketPath | FLAGD_SOCKET_PATH | alternative to host port, unix socket | string | null | rpc & in-process |
302+
| certPath | FLAGD_SERVER_CERT_PATH | tls cert path | string | null | rpc & in-process |
303+
| deadlineMs | FLAGD_DEADLINE_MS | deadline for unary calls, and timeout for initialization | int | 500 | rpc & in-process & file |
304+
| streamDeadlineMs | FLAGD_STREAM_DEADLINE_MS | deadline for streaming calls, useful as an application-layer keepalive | int | 600000 | rpc & in-process |
305+
| retryBackoffMs | FLAGD_RETRY_BACKOFF_MS | initial backoff for stream retry | int | 1000 | rpc & in-process |
306+
| retryBackoffMaxMs | FLAGD_RETRY_BACKOFF_MAX_MS | maximum backoff for stream retry | int | 120000 | rpc & in-process |
307+
| retryGracePeriod | FLAGD_RETRY_GRACE_PERIOD | period in seconds before provider moves from STALE to ERROR state | int | 5 | rpc & in-process & file |
308+
| keepAliveTime | FLAGD_KEEP_ALIVE_TIME_MS | http 2 keepalive | long | 0 | rpc & in-process |
309+
| selector | FLAGD_SOURCE_SELECTOR | expression to filter flags (e.g., `flagSetId=my-app`, `source=config.json`) | string | null | rpc & in-process |
310+
| cache | FLAGD_CACHE | enable cache of static flags | string - `lru`, `disabled` | lru | rpc |
311+
| maxCacheSize | FLAGD_MAX_CACHE_SIZE | max size of static flag cache | int | 1000 | rpc |
312+
| providerId | FLAGD_PROVIDER_ID | A unique identifier for flagd(grpc client) initiating the request. | string | null | in-process |
313+
| offlineFlagSourcePath | FLAGD_OFFLINE_FLAG_SOURCE_PATH | offline, file-based flag definitions, overrides host/port/targetUri | string | null | file |
314+
| offlinePollIntervalMs | FLAGD_OFFLINE_POLL_MS | poll interval for reading offlineFlagSourcePath | int | 5000 | file |
315+
| contextEnricher | - | sync-metadata to evaluation context mapping function | function | identity function | in-process |
316+
| fatalStatusCodes | FLAGD_FATAL_STATUS_CODES | a list of gRPC status codes, which will cause streams to give up and put the provider in a PROVIDER_FATAL state | array | [] | rpc & in-process |
287317

288318
### Custom Name Resolution
289319

test-harness

0 commit comments

Comments
 (0)