-
Notifications
You must be signed in to change notification settings - Fork 68
fix(flagd): do not retry for certain status codes (#756) #799
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
fix(flagd): do not retry for certain status codes (#756) #799
Conversation
a2312a3 to
a25f7be
Compare
|
As we are adding new config options, we should wait for open-feature/flagd-testbed#311 to be merged to ensure property names are in consistent for all the providers based on the docs. |
tests/flagd/testframework/utils.go
Outdated
| } | ||
| return reflect.ValueOf(longVal).Convert(fieldType) | ||
| case "StringList": | ||
| arrayVal := strings.Split(value, ",") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we also need to trim here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, good catch! will update this
| func (g *Sync) initNonRetryableStatusCodesSet() { | ||
| nonRetryableCodes = make(map[string]struct{}) | ||
| for _, code := range g.FatalStatusCodes { | ||
| normalized := toCamelCase(code) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are the codes camelCase? In the retry policy up in L35 they are Upper case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because grpc doesnt export a function to get the UPPER case representation it uses internally. However, I changed the implementation to use codes.Code for lookup instead of the string, making it cleaner & remove the need for the case conversion
| func (g *Sync) Sync(ctx context.Context, dataSync chan<- sync.DataSync) error { | ||
| g.Logger.Info("starting continuous flag synchronization") | ||
|
|
||
| time.Sleep(500 * time.Millisecond) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a debug leftover?
| } | ||
|
|
||
| // Backoff before retrying | ||
| time.Sleep(time.Duration(g.RetryBackOffMs) * time.Millisecond) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| time.Sleep(time.Duration(g.RetryBackOffMs) * time.Millisecond) | |
| time.Sleep(time.Duration(g.RetryBackOffMaxMs) * time.Millisecond) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a fix of a different problem. Can be moved to a separate PR. Not part of the non-retryable status codes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this part of the retry discussion? or are you suggesting to open a new one to discuss this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we are always mixing features and functionalities, i would also love to see this as two separate prs. it allows to focus on one implementation with its tests, and makes the changes clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm reluctant to put this change in a separate PR since Todd's idea was to unify implementations (Java and Go) as feedback on the initial PR -> see comment here.
Based on this I applied @guidobrei 's suggestion. I noticed just now, it was actually a mistake I made due to different variable naming in Java.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't feel very strong about splitting this PR up, but @aepfli @guidobrei seem to disagree - I think this change might somehow also be causing the e2e test failure now, but not 100% sure about that - might be an option needs updating, so splitting this up might actually make things easier for you.
providers/flagd/pkg/provider.go
Outdated
| CustomSyncProviderUri: provider.providerConfiguration.CustomSyncProviderUri, | ||
| GrpcDialOptionsOverride: provider.providerConfiguration.GrpcDialOptionsOverride, | ||
| RetryGracePeriod: provider.providerConfiguration.RetryGracePeriod, | ||
| RetryBackOffMs: provider.providerConfiguration.RetryBackoffMs, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Separate issue
| // Mark as ready on first successful stream | ||
| g.initializer.Do(func() { | ||
| g.ready = true | ||
| g.Logger.Info("sync service is now ready") | ||
| }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: This moved to lines 275-279 to set ready = true only after the first stream cycle was successful not during the cycle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I think this was actually a small bug. 👍
3035374 to
c8e8e10
Compare
This comment was marked as resolved.
This comment was marked as resolved.
i am not sure, but worst case the how-to-fix section in https://github.com/open-feature/go-sdk-contrib/pull/799/checks?check_run_id=56445883813 can be helpful and should fix this ;) |
bf2e6f2 to
56800b2
Compare
aepfli
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this pullrequest merges two features, fatalErrorCodes and backoff - to keep the changes distinct i suggest separating them into two different pull requests (does not mean they will not be released within one changeset) as they can be also delivered separately.
Furthermore we should rethink our sleeps as I think this is not good practice and there are alternatives, I also created an improvement for the java provider for this.
| } | ||
|
|
||
| // Backoff before retrying | ||
| time.Sleep(time.Duration(g.RetryBackOffMaxMs) * time.Millisecond) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not a big fan of our blocking sleeps, as they clearly have some disadvantages, should we maybe stick to a timer for this kind of logic? like
select {
case <-time.After(time.Duration(g.RetryBackOffMaxMs) * time.Millisecond):
// ... code here ...
case <-ctx.Done():
return // Allows cancellation
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's disadvantages of sleep, but I think right now this is better than nothing, because nothing == a tight loop in some cases that is a serious bug in some situations.
Pls consider my comment above :)
We can do an improvement issue for golang too -> this is just a bug fix / consistency PR |
toddbaert
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implementation looks good to me, nice work overall. One thing I think we need in addition is that we should use the same FATAL codes in RPC mode - so I think you will have to add something similar to what you have done in pkg/service/rpc/service.go... do you agree? Since both modes have streams, I think it makes sense for both streams to use this rule.
With respect to @aepfli and @guidobrei 's comment about separating things... I can go either way, but you will need 1 more approval besides mine and I think it might make it easier for you to debug the e2e CI failure.
Signed-off-by: Alexandra Oberaigner <[email protected]>
7ab3ce0 to
8ff44da
Compare
aepfli
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you, looks good to me, one little nit, but nothing blocking this pr from getting merged
Signed-off-by: Alexandra Oberaigner <[email protected]>
12b55af to
f531ab0
Compare
|
[Q] does someone have an idea why the gherkin tests for SYNC_PORT still fail even though I excluded them with ~@sync-port |
Signed-off-by: Alexandra Oberaigner <[email protected]>
Signed-off-by: Todd Baert <[email protected]>
6e45d12 to
7d6fff6
Compare
@alexandraoberaigner I pushed this. |
This PR
This pull request introduces configurable retry and fatal error handling for the in-process gRPC sync provider in the
flagdproject. The main changes include adding new configuration options for retry backoff timing and fatal status codes, refactoring environment variable parsing, and updating service initialization and error handling logic.Related Issues
Changes
Retry and fatal error configuration (most important):
RetryBackoffMs,RetryBackoffMaxMs,FatalStatusCodes) toProviderConfiguration, with environment variable support and helper functions for parsing integer values from environment variables. [1] [2] [3] [4] [5] [6]Refactoring and code quality:
grpc_config.go, making the code more modular and testable. [1] [2] [3] [4] [5]Error handling improvements:
grpc_config_test.go.