You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
docs: fatal codes, re-init, and retry policy (#1818)
This PR specifies some provider behavior, specifically around stream
health, gRPC retry policy, and FATAL codes.
Specifically, it:
- publishes a retry policy that is shall be used by all flagd providers
- specifies a new option for marking some gRPC status codes as FATAL,
which will cause the provider to stop attempting to reconnect (generally
useful and requested in
open-feature/go-sdk-contrib#756)
- makes clear via state diagram that flagd provider should support
re-initialization (if not in FATAL state)
---------
Signed-off-by: Todd Baert <todd.baert@dynatrace.com>
Copy file name to clipboardExpand all lines: docs/concepts/selectors.md
+2-1Lines changed: 2 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,7 @@
1
1
# Selectors
2
2
3
-
Selectors are query expressions that allow you to filter flag configurations from flagd's sync service. They enable providers to request only specific subsets of flags instead of receiving all flags, making flagd more efficient and flexible for complex deployments.
3
+
Selectors are query expressions that allow you to filter flag configurations from flagd.
4
+
They enable providers to sync or evaluate only specific subsets of flags instead of all flags, making flagd more efficient and flexible for complex deployments, and supporting basic multi-tenancy.
Copy file name to clipboardExpand all lines: docs/reference/specifications/providers.md
+72-42Lines changed: 72 additions & 42 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -64,18 +64,21 @@ stateDiagram-v2
64
64
NOT_READY --> ERROR: initialize
65
65
READY --> ERROR: disconnected, disconnected period == 0
66
66
READY --> STALE: disconnected, disconnect period < retry grace period
67
+
READY --> NOT_READY: shutdown
67
68
STALE --> ERROR: disconnect period >= retry grace period
69
+
STALE --> NOT_READY: shutdown
68
70
ERROR --> READY: reconnected
69
-
ERROR --> [*]: shutdown
71
+
ERROR --> NOT_READY: shutdown
72
+
ERROR --> [*]: Error code == PROVIDER_FATAL
70
73
71
-
note right of STALE
74
+
note left of STALE
72
75
stream disconnected, attempting to reconnect,
73
76
resolve from cache*
74
77
resolve from flag set rules**
75
78
STALE emitted
76
79
end note
77
80
78
-
note right of READY
81
+
note left of READY
79
82
stream connected,
80
83
evaluation cache active*,
81
84
flag set rules stored**,
@@ -84,7 +87,7 @@ stateDiagram-v2
84
87
CHANGE emitted with stream messages
85
88
end note
86
89
87
-
note right of ERROR
90
+
note left of ERROR
88
91
stream disconnected, attempting to reconnect,
89
92
evaluation cache purged*,
90
93
ERROR emitted
@@ -101,25 +104,51 @@ stateDiagram-v2
101
104
102
105
### Stream Reconnection
103
106
104
-
When either stream (sync or event) disconnects, whether due to the associated deadline being exceeded, network error or any other cause, the provider attempts to re-establish the stream immediately, and then retries with an exponential back-off.
105
-
We always rely on the [integrated functionality of GRPC for reconnection](https://github.com/grpc/grpc/blob/master/doc/connection-backoff.md) and utilize [Wait-for-Ready](https://grpc.io/docs/guides/wait-for-ready/) to re-establish the stream.
106
-
We are configuring the underlying reconnection mechanism whenever we can, based on our configuration. (not all GRPC implementations support this)
107
+
When either stream (sync or event) fails or completes, whether due to the associated deadline being exceeded, network error or any other cause, the provider attempts to re-establish the stream.
108
+
Both the RPC and sync streams will forever attempt to be re-established unless the stream response indicates a [fatal status code](#fatal-status-codes).
109
+
This is distinct from the [gRPC retry-policy](#grpc-retry-policy), which automatically retries *all RPCs* (streams or otherwise) a limited number of times to make the provider resilient to transient errors.
110
+
It's also distinct from the [gRPC layer 4 reconnection mechanism](https://grpc.github.io/grpc/core/md_doc_connection-backoff.html) which only reconnects the TCP connection, but not any streams.
111
+
When the stream is reconnecting, providers transition to the [STALE](https://openfeature.dev/docs/reference/concepts/events/#provider_stale) state, and after `retryGracePeriod`, transition to the ERROR state, emitting the respective events during these transitions.
107
112
108
-
| language/property | min connect timeout | max backoff | initial backoff | jitter | multiplier |
flagd leverages gRPC built-in retry mechanism for all RPCs.
116
+
In short, the retry policy attempts to retry all RPCs which return `UNAVAILABLE` or `UNKNOWN` status codes 3 times, with a 1s, 2s, 4s, backoff respectively.
117
+
No other status codes are retried.
118
+
The flagd gRPC retry policy is specified below:
118
119
119
-
When disconnected, if the time since disconnection is less than `retryGracePeriod`, the provider emits `STALE` when it disconnects.
120
-
While the provider is in state `STALE` the provider resolves values from its cache or stored flag set rules, depending on its resolver mode.
121
-
When the time since the last disconnect first exceeds `retryGracePeriod`, the provider emits `ERROR`.
122
-
The provider attempts to reconnect indefinitely, with a maximum interval of `retryBackoffMaxMs`.
120
+
```json
121
+
{
122
+
"methodConfig": [
123
+
{
124
+
"name": [
125
+
{
126
+
"service": "flagd.evaluation.v1.Service"
127
+
},
128
+
{
129
+
"service": "flagd.sync.v1.FlagSyncService"
130
+
}
131
+
],
132
+
"retryPolicy": {
133
+
"MaxAttempts": 4,
134
+
"InitialBackoff": "1s",
135
+
"MaxBackoff": $FLAGD_RETRY_BACKOFF_MAX_MS, // from provider options
136
+
"BackoffMultiplier": 2.0,
137
+
"RetryableStatusCodes": [
138
+
"UNAVAILABLE",
139
+
"UNKNOWN"
140
+
]
141
+
}
142
+
}
143
+
]
144
+
}
145
+
```
146
+
147
+
## Fatal Status Codes
148
+
149
+
Providers accept an option for defining fatal gRPC status codes which, when received in the RPC or sync streams, transition the provider to the PROVIDER_FATAL state.
150
+
This configuration is useful for situations wherein these codes indicate to a client that their configuration is invalid and must be changed (i.e., the error is non-transient).
151
+
Examples for this include status codes such as `UNAUTHENTICATED` or `PERMISSION_DENIED`.
123
152
124
153
## RPC Resolver
125
154
@@ -262,28 +291,29 @@ precedence.
262
291
263
292
Below are the supported configuration parameters (note that not all apply to both resolver modes):
264
293
265
-
| Option name | Environment variable name | Explanation | Type & Values | Default | Compatible resolver |
| deadlineMs | FLAGD_DEADLINE_MS | deadline for unary calls, and timeout for initialization | int | 500 | rpc & in-process & file |
275
-
| streamDeadlineMs | FLAGD_STREAM_DEADLINE_MS | deadline for streaming calls, useful as an application-layer keepalive | int | 600000 | rpc & in-process |
276
-
| retryBackoffMs | FLAGD_RETRY_BACKOFF_MS | initial backoff for stream retry | int | 1000 | rpc & in-process |
277
-
| retryBackoffMaxMs | FLAGD_RETRY_BACKOFF_MAX_MS | maximum backoff for stream retry | int | 120000 | rpc & in-process |
278
-
| retryGracePeriod | FLAGD_RETRY_GRACE_PERIOD | period in seconds before provider moves from STALE to ERROR state | int | 5 | rpc & in-process & file |
| deadlineMs | FLAGD_DEADLINE_MS | deadline for unary calls, and timeout for initialization | int | 500 | rpc & in-process & file |
304
+
| streamDeadlineMs | FLAGD_STREAM_DEADLINE_MS | deadline for streaming calls, useful as an application-layer keepalive | int | 600000 | rpc & in-process |
305
+
| retryBackoffMs | FLAGD_RETRY_BACKOFF_MS | initial backoff for stream retry | int | 1000 | rpc & in-process |
306
+
| retryBackoffMaxMs | FLAGD_RETRY_BACKOFF_MAX_MS | maximum backoff for stream retry | int | 120000 | rpc & in-process |
307
+
| retryGracePeriod | FLAGD_RETRY_GRACE_PERIOD | period in seconds before provider moves from STALE to ERROR state | int | 5 | rpc & in-process & file |
| offlinePollIntervalMs | FLAGD_OFFLINE_POLL_MS | poll interval for reading offlineFlagSourcePath | int | 5000 | file |
315
+
| contextEnricher | - | sync-metadata to evaluation context mapping function | function | identity function | in-process |
316
+
| fatalStatusCodes | FLAGD_FATAL_STATUS_CODES | a list of gRPC status codes, which will cause streams to give up and put the provider in a PROVIDER_FATAL state | array |[]| rpc & in-process |
0 commit comments