-
Notifications
You must be signed in to change notification settings - Fork 47
feat: use informer cache for Get method in Kubernetes backend #525
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @juanxiu for this PR.
I have a comment requiring some more discussion, PTAL.
if !ok { | ||
return nil, fmt.Errorf("object is not an Application: %T", obj) | ||
} | ||
return app, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Items returned from the cache need to be treated read-only. I believe that's why unrelated unit tests are failing and panicing.
There are two options imho:
- We clearly document that objects returned by this function are to be treated read-only, because they are retrieved directly from the cache and caller needs to make a copy if they want to modify it, or
- We return a copy of the object retrieved from the cache, to take the burden from the caller.
The first option puts more responsibility to the caller, but is resource efficient. The second option would ensure that the caller can treat the objects lightly, but for the cost of increased memory consumption.
I have not yet made up my mind with regards to which solution I'd prefer. Which one do you think makes more sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In option 1, when objects are retrieved through multiple informers, there is a risk that callers may forget to use DeepCopy(). Such oversights can make it difficult to trace and resolve bugs. On the other hand, callers can use resources efficiently and have the flexibility to decide when to create copies.
Option 2 always returns a copy of the object from the function, so callers cannot control when the copy is made. However, callers can still customize behavior by registering event handlers with the informer. Returning a copy from the function enforces consistent object usage and prevents inconsistent handling of copying across different callers.
Personally, I chose option 2 because I believe minimizing the potential for bugs and ensuring consistent and stable behavior throughout the codebase is important.
For these reasons, I have modified the code to return app.DeepCopy(), nil.
Additionally, I plan to update the List methods in the Kubernetes backend for Application and appproject type resources to use the informer cache as well. I would appreciate it if we could merge this after those changes are completed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems there are other problems with this approach, at least regarding the tests. Some of them are still failing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no issue with loading the informer cache itself. However, in the current unit tests, a different problem arises. To load the cache, informer.Run must be executed, which requires calling the Start method of either the manager or the server. Until now, the code has been written without assuming that Start would be called in the test code. As a result, timing issues related to goroutine execution occur during test runs. How can we resolve this situation?
wq.On("Get").Return(&ev, false) | ||
wq.On("Done", &ev) | ||
s, err := NewServer(context.Background(), fac, "argocd", WithGeneratedTokenSigningKey(), WithAutoNamespaceCreate(true, "", nil)) | ||
s.Start(context.Background(), make(chan error)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
During testing in this way, a Start call is required to load the informer.
Signed-off-by: yeonsoo <[email protected]>
Signed-off-by: yeonsoo <[email protected]>
Signed-off-by: yeonsoo <[email protected]>
Signed-off-by: yeonsoo <[email protected]>
Signed-off-by: yeonsoo <[email protected]>
Add configurable informer sync timeout and proper context handling to resolve intermittent test failures. Signed-off-by: yeonsoo <[email protected]>
Signed-off-by: yeonsoo <[email protected]>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #525 +/- ##
==========================================
+ Coverage 45.62% 45.73% +0.11%
==========================================
Files 90 90
Lines 12029 12070 +41
==========================================
+ Hits 5488 5520 +32
- Misses 6096 6105 +9
Partials 445 445 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Add configurable informer timeout and Redis proxy disable for tests. Signed-off-by: yeonsoo <[email protected]>
Add configurable informer timeout and Redis proxy disable for tests. Implement cache-first with API server fallback in Get method. Signed-off-by: yeonsoo <[email protected]>
Signed-off-by: yeonsoo <[email protected]>
- Get() uses cache by default, API server when forUpdate=true - update() sets forUpdate context to ensure latest ResourceVersion - Prevents update conflicts while improving read performance Signed-off-by: yeonsoo <[email protected]>
Signed-off-by: yeonsoo <[email protected]>
Signed-off-by: yeonsoo <[email protected]>
Hi, @jannfis ! Kindly requesting your review. All tests have passed successfully. Regular Get() calls now use the informer cache, while Get() calls within the update retry loop are conditionally routed to directly query the API server via a context flag. |
Signed-off-by: yeonsoo <[email protected]>
Thank you @juanxiu ! During review, I noticed that this PR actually contains three features instead of one: Ability to disable Redis proxy, ability to specify informer sync timeout, and the use of the informer cache. It would be great if you could untangle these, and submit as separate PRs. That way, we can keep better track of things and independently revert changes, should it be necessary. Thank you. |
Hi, @jannfis The issues arose during testing of this feature:
Ultimately, the informer cache is the main feature, and the Redis proxy disable option is needed for its proper testing. If separated, there would be a situation where intermediate states without test coverage or unstable tests would be merged. Thanks |
Signed-off-by: yeonsoo <[email protected]>
Signed-off-by: yeonsoo <[email protected]>
Signed-off-by: Yeonsoo Kim <[email protected]>
Signed-off-by: yeonsoo <[email protected]>
Signed-off-by: yeonsoo <[email protected]>
Hi, @jannfis I pulled the latest changes from the main branch and resolved the conflicts. |
forUpdate, _ := ctx.Value(backend.ForUpdateContextKey).(bool) | ||
|
||
if !forUpdate && be.appLister != nil && be.appInformer != nil && be.appInformer.HasSynced() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please explain this construct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please take a look at my explanation below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had some comments. Also, if you could extract the disable redis proxy code into a separate PR (just like with the informer cache), I'm gonna merge it first, you rebase this PR, and the functionality would be available to this PR too.
internal/informer/informer.go
Outdated
i.groupResource = schema.GroupResource{ | ||
Group: "argoproj.io", | ||
Resource: "applications", | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this hard-coded, every informer would only ever have a lister for Applications. I think the groupResource must be set from the outside, potentially at initialization, depending on the type concrete type of the informer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback. You’re right — I removed the hard-coded groupResource.
Instead, I added WithGroupResource[T](group, resource string)
as a new InformerOption, which allows each informer to be configured with the correct groupResource at initialization time based on its concrete type.
|
||
err := retry.RetryOnConflict(retry.DefaultBackoff, func() error { | ||
existing, ierr := m.applicationBackend.Get(ctx, incoming.Name, incoming.Namespace) | ||
existing, ierr := m.applicationBackend.Get(ctxForUpdate, incoming.Name, incoming.Namespace) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason why forUpdate
is necessary lies here. Kubernetes implements Optimistic Concurrency Control through resourceVersion
. When an Update request is made, the API server compares the resourceVersion
in the request with the one currently stored, and if they differ, it returns a Conflict.
The issue arises when using the informer cache during updates. Although the informer is synchronized with the API server, there is a slight lag, meaning that the cache may return a stale resourceVersion
. As a result, the Update request fails, and even when using retry.RetryOnConflict()
to retry, it keeps reading the same stale resourceVersion
from the cache, causing repeated failures.
In contrast, when forUpdate=true
, the object is fetched directly from the API server, guaranteeing the latest resourceVersion
. This ensures that during retries, the latest version is retrieved, allowing the Update to succeed.
Therefore, we use the informer cache for regular reads to optimize performance, and direct API reads during updates to ensure correctness.
Signed-off-by: yeonsoo <[email protected]>
What does this PR do / why we need it:
This PR enhances the KubernetesBackend by modifying the Get method to use the informer cache instead of querying the Kubernetes API server directly when retrieving Application resources. It leverages the generic Informer's Lister() method to efficiently access cached resources, thereby reducing load on the API server and improving performance. Corresponding unit tests have been added to verify correct informer cache usage. This change creates a foundation for more efficient resource management via the informer caching mechanism.
Which issue(s) this PR fixes:
Fixes #251
How to test changes / Special notes to the reviewer:
Checklist