Skip to content

Conversation

@alexandraoberaigner
Copy link
Contributor

This PR

This pull request introduces configurable retry and fatal error handling for the in-process gRPC sync provider in the flagd project. The main changes include adding new configuration options for retry backoff timing and fatal status codes, refactoring environment variable parsing, and updating service initialization and error handling logic.

Related Issues

Changes

Retry and fatal error configuration (most important):

  • Added new configuration options (RetryBackoffMs, RetryBackoffMaxMs, FatalStatusCodes) to ProviderConfiguration, with environment variable support and helper functions for parsing integer values from environment variables. [1] [2] [3] [4] [5] [6]
  • Updated provider and service initialization to pass the new retry and fatal error configuration fields, enabling customization of retry timing and fatal error handling for sync streams. [1] [2] [3]

Refactoring and code quality:

  • Refactored environment variable parsing for integer values using a new helper function, simplifying and unifying logic for multiple configuration fields. [1] [2] [3]
  • Moved gRPC retry policy construction and fatal status code normalization to a new file grpc_config.go, making the code more modular and testable. [1] [2] [3] [4] [5]

Error handling improvements:

  • Enhanced sync error handling to detect fatal gRPC status codes and transition the provider to a fatal state, preventing endless retries on unrecoverable errors.
  • Updated test coverage for retry policy construction, fatal status code normalization, and camel-case conversion logic in grpc_config_test.go.

@alexandraoberaigner alexandraoberaigner force-pushed the fix/inifinite-loop-error branch 4 times, most recently from a2312a3 to a25f7be Compare November 14, 2025 07:39
@alexandraoberaigner alexandraoberaigner marked this pull request as ready for review November 14, 2025 07:45
@alexandraoberaigner alexandraoberaigner requested review from a team as code owners November 14, 2025 07:45
@aepfli
Copy link
Member

aepfli commented Nov 14, 2025

As we are adding new config options, we should wait for open-feature/flagd-testbed#311 to be merged to ensure property names are in consistent for all the providers based on the docs.

}
return reflect.ValueOf(longVal).Convert(fieldType)
case "StringList":
arrayVal := strings.Split(value, ",")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we also need to trim here?

Copy link
Contributor Author

@alexandraoberaigner alexandraoberaigner Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, good catch! will update this

func (g *Sync) initNonRetryableStatusCodesSet() {
nonRetryableCodes = make(map[string]struct{})
for _, code := range g.FatalStatusCodes {
normalized := toCamelCase(code)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are the codes camelCase? In the retry policy up in L35 they are Upper case.

Copy link
Contributor Author

@alexandraoberaigner alexandraoberaigner Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because grpc doesnt export a function to get the UPPER case representation it uses internally. However, I changed the implementation to use codes.Code for lookup instead of the string, making it cleaner & remove the need for the case conversion

func (g *Sync) Sync(ctx context.Context, dataSync chan<- sync.DataSync) error {
g.Logger.Info("starting continuous flag synchronization")

time.Sleep(500 * time.Millisecond)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a debug leftover?

}

// Backoff before retrying
time.Sleep(time.Duration(g.RetryBackOffMs) * time.Millisecond)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
time.Sleep(time.Duration(g.RetryBackOffMs) * time.Millisecond)
time.Sleep(time.Duration(g.RetryBackOffMaxMs) * time.Millisecond)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a fix of a different problem. Can be moved to a separate PR. Not part of the non-retryable status codes.

Copy link
Contributor Author

@alexandraoberaigner alexandraoberaigner Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this part of the retry discussion? or are you suggesting to open a new one to discuss this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we are always mixing features and functionalities, i would also love to see this as two separate prs. it allows to focus on one implementation with its tests, and makes the changes clearer.

Copy link
Contributor Author

@alexandraoberaigner alexandraoberaigner Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm reluctant to put this change in a separate PR since Todd's idea was to unify implementations (Java and Go) as feedback on the initial PR -> see comment here.
Based on this I applied @guidobrei 's suggestion. I noticed just now, it was actually a mistake I made due to different variable naming in Java.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel very strong about splitting this PR up, but @aepfli @guidobrei seem to disagree - I think this change might somehow also be causing the e2e test failure now, but not 100% sure about that - might be an option needs updating, so splitting this up might actually make things easier for you.

CustomSyncProviderUri: provider.providerConfiguration.CustomSyncProviderUri,
GrpcDialOptionsOverride: provider.providerConfiguration.GrpcDialOptionsOverride,
RetryGracePeriod: provider.providerConfiguration.RetryGracePeriod,
RetryBackOffMs: provider.providerConfiguration.RetryBackoffMs,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separate issue

Comment on lines -251 to -255
// Mark as ready on first successful stream
g.initializer.Do(func() {
g.ready = true
g.Logger.Info("sync service is now ready")
})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: This moved to lines 275-279 to set ready = true only after the first stream cycle was successful not during the cycle.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think this was actually a small bug. 👍

@alexandraoberaigner

This comment was marked as resolved.

@aepfli
Copy link
Member

aepfli commented Nov 26, 2025

[Q] does someone have an idea whats wrong with the DCO?

Commit sha: 2eeebc3, Author: Alexandra Oberaigner, Committer: alexandraoberaigner; Expected "Alexandra Oberaigner [email protected]", but got "Alexandra Oberaigner [email protected]".

i am not sure, but worst case the how-to-fix section in https://github.com/open-feature/go-sdk-contrib/pull/799/checks?check_run_id=56445883813 can be helpful and should fix this ;)

Copy link
Member

@aepfli aepfli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this pullrequest merges two features, fatalErrorCodes and backoff - to keep the changes distinct i suggest separating them into two different pull requests (does not mean they will not be released within one changeset) as they can be also delivered separately.

Furthermore we should rethink our sleeps as I think this is not good practice and there are alternatives, I also created an improvement for the java provider for this.

}

// Backoff before retrying
time.Sleep(time.Duration(g.RetryBackOffMaxMs) * time.Millisecond)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not a big fan of our blocking sleeps, as they clearly have some disadvantages, should we maybe stick to a timer for this kind of logic? like

select {
case <-time.After(time.Duration(g.RetryBackOffMaxMs) * time.Millisecond):
    // ... code here ...
case <-ctx.Done():
    return // Allows cancellation
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's disadvantages of sleep, but I think right now this is better than nothing, because nothing == a tight loop in some cases that is a serious bug in some situations.

@alexandraoberaigner
Copy link
Contributor Author

I think this pullrequest merges two features, fatalErrorCodes and backoff - to keep the changes distinct i suggest separating them into two different pull requests (does not mean they will not be released within one changeset) as they can be also delivered separately.

Pls consider my comment above :)

Furthermore we should rethink our sleeps as I think this is not good practice and there are alternatives, I also created an improvement for the java provider for this.

We can do an improvement issue for golang too -> this is just a bug fix / consistency PR

Copy link
Member

@toddbaert toddbaert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation looks good to me, nice work overall. One thing I think we need in addition is that we should use the same FATAL codes in RPC mode - so I think you will have to add something similar to what you have done in pkg/service/rpc/service.go... do you agree? Since both modes have streams, I think it makes sense for both streams to use this rule.

With respect to @aepfli and @guidobrei 's comment about separating things... I can go either way, but you will need 1 more approval besides mine and I think it might make it easier for you to debug the e2e CI failure.

Copy link
Member

@aepfli aepfli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you, looks good to me, one little nit, but nothing blocking this pr from getting merged

Signed-off-by: Alexandra Oberaigner <[email protected]>
@alexandraoberaigner
Copy link
Contributor Author

[Q] does someone have an idea why the gherkin tests for SYNC_PORT still fail even though I excluded them with ~@sync-port

alexandraoberaigner and others added 2 commits November 28, 2025 13:07
@toddbaert toddbaert force-pushed the fix/inifinite-loop-error branch from 6e45d12 to 7d6fff6 Compare November 28, 2025 17:07
@toddbaert toddbaert self-requested a review November 28, 2025 17:19
@toddbaert
Copy link
Member

[Q] does someone have an idea why the gherkin tests for SYNC_PORT still fail even though I excluded them with ~@sync-port

@alexandraoberaigner I pushed this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Infinite retry to establish connection to FlagSyncService in Flagd golang provider

6 participants