Skip to content

fix(flagd): do not retry for certain status codes (#756)#799

Merged
toddbaert merged 6 commits intoopen-feature:mainfrom
open-feature-forking:fix/inifinite-loop-error
Dec 5, 2025
Merged

fix(flagd): do not retry for certain status codes (#756)#799
toddbaert merged 6 commits intoopen-feature:mainfrom
open-feature-forking:fix/inifinite-loop-error

Conversation

@alexandraoberaigner
Copy link
Contributor

This PR

This pull request introduces configurable retry and fatal error handling for the in-process gRPC sync provider in the flagd project. The main changes include adding new configuration options for retry backoff timing and fatal status codes, refactoring environment variable parsing, and updating service initialization and error handling logic.

Related Issues

Changes

Retry and fatal error configuration (most important):

  • Added new configuration options (RetryBackoffMs, RetryBackoffMaxMs, FatalStatusCodes) to ProviderConfiguration, with environment variable support and helper functions for parsing integer values from environment variables. [1] [2] [3] [4] [5] [6]
  • Updated provider and service initialization to pass the new retry and fatal error configuration fields, enabling customization of retry timing and fatal error handling for sync streams. [1] [2] [3]

Refactoring and code quality:

  • Refactored environment variable parsing for integer values using a new helper function, simplifying and unifying logic for multiple configuration fields. [1] [2] [3]
  • Moved gRPC retry policy construction and fatal status code normalization to a new file grpc_config.go, making the code more modular and testable. [1] [2] [3] [4] [5]

Error handling improvements:

  • Enhanced sync error handling to detect fatal gRPC status codes and transition the provider to a fatal state, preventing endless retries on unrecoverable errors.
  • Updated test coverage for retry policy construction, fatal status code normalization, and camel-case conversion logic in grpc_config_test.go.

@alexandraoberaigner alexandraoberaigner force-pushed the fix/inifinite-loop-error branch 4 times, most recently from a2312a3 to a25f7be Compare November 14, 2025 07:39
@alexandraoberaigner alexandraoberaigner marked this pull request as ready for review November 14, 2025 07:45
@alexandraoberaigner alexandraoberaigner requested review from a team as code owners November 14, 2025 07:45
@aepfli
Copy link
Member

aepfli commented Nov 14, 2025

As we are adding new config options, we should wait for open-feature/flagd-testbed#311 to be merged to ensure property names are in consistent for all the providers based on the docs.

@alexandraoberaigner

This comment was marked as resolved.

@aepfli
Copy link
Member

aepfli commented Nov 26, 2025

[Q] does someone have an idea whats wrong with the DCO?

Commit sha: 2eeebc3, Author: Alexandra Oberaigner, Committer: alexandraoberaigner; Expected "Alexandra Oberaigner alexandra.oberaigner@dynatrace.com", but got "Alexandra Oberaigner alexandra.oberaigner@dynatrace.com".

i am not sure, but worst case the how-to-fix section in https://github.com/open-feature/go-sdk-contrib/pull/799/checks?check_run_id=56445883813 can be helpful and should fix this ;)

Copy link
Member

@aepfli aepfli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this pullrequest merges two features, fatalErrorCodes and backoff - to keep the changes distinct i suggest separating them into two different pull requests (does not mean they will not be released within one changeset) as they can be also delivered separately.

Furthermore we should rethink our sleeps as I think this is not good practice and there are alternatives, I also created an improvement for the java provider for this.

}

// Backoff before retrying
time.Sleep(time.Duration(g.RetryBackOffMaxMs) * time.Millisecond)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not a big fan of our blocking sleeps, as they clearly have some disadvantages, should we maybe stick to a timer for this kind of logic? like

select {
case <-time.After(time.Duration(g.RetryBackOffMaxMs) * time.Millisecond):
    // ... code here ...
case <-ctx.Done():
    return // Allows cancellation
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's disadvantages of sleep, but I think right now this is better than nothing, because nothing == a tight loop in some cases that is a serious bug in some situations.

@alexandraoberaigner
Copy link
Contributor Author

I think this pullrequest merges two features, fatalErrorCodes and backoff - to keep the changes distinct i suggest separating them into two different pull requests (does not mean they will not be released within one changeset) as they can be also delivered separately.

Pls consider my comment above :)

Furthermore we should rethink our sleeps as I think this is not good practice and there are alternatives, I also created an improvement for the java provider for this.

We can do an improvement issue for golang too -> this is just a bug fix / consistency PR

Copy link
Member

@toddbaert toddbaert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation looks good to me, nice work overall. One thing I think we need in addition is that we should use the same FATAL codes in RPC mode - so I think you will have to add something similar to what you have done in pkg/service/rpc/service.go... do you agree? Since both modes have streams, I think it makes sense for both streams to use this rule.

With respect to @aepfli and @guidobrei 's comment about separating things... I can go either way, but you will need 1 more approval besides mine and I think it might make it easier for you to debug the e2e CI failure.

Signed-off-by: Alexandra Oberaigner <alexandra.oberaigner@dynatrace.com>
Copy link
Member

@aepfli aepfli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you, looks good to me, one little nit, but nothing blocking this pr from getting merged

Signed-off-by: Alexandra Oberaigner <alexandra.oberaigner@dynatrace.com>
@alexandraoberaigner
Copy link
Contributor Author

[Q] does someone have an idea why the gherkin tests for SYNC_PORT still fail even though I excluded them with ~@sync-port

alexandraoberaigner and others added 2 commits November 28, 2025 13:07
Signed-off-by: Alexandra Oberaigner <alexandra.oberaigner@dynatrace.com>
Signed-off-by: Todd Baert <todd.baert@dynatrace.com>
@toddbaert toddbaert force-pushed the fix/inifinite-loop-error branch from 6e45d12 to 7d6fff6 Compare November 28, 2025 17:07
@toddbaert toddbaert self-requested a review November 28, 2025 17:19
@toddbaert
Copy link
Member

[Q] does someone have an idea why the gherkin tests for SYNC_PORT still fail even though I excluded them with ~@sync-port

@alexandraoberaigner I pushed this.

Signed-off-by: Alexandra Oberaigner <alexandra.oberaigner@dynatrace.com>
@alexandraoberaigner
Copy link
Contributor Author

the retryBackoff changes have been removed from this PR as suggested by @guidobrei @aepfli - I will open a separate PR soon

@toddbaert
Copy link
Member

I'll merge tomorrow unless I hear objections cc @guidobrei

Copy link
Member

@guidobrei guidobrei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ❤️

@toddbaert toddbaert merged commit e01a99e into open-feature:main Dec 5, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Infinite retry to establish connection to FlagSyncService in Flagd golang provider

6 participants