Skip to content

Conversation

juanxiu
Copy link
Contributor

@juanxiu juanxiu commented Aug 20, 2025

What does this PR do / why we need it:
This PR enhances the KubernetesBackend by modifying the Get method to use the informer cache instead of querying the Kubernetes API server directly when retrieving Application resources. It leverages the generic Informer's Lister() method to efficiently access cached resources, thereby reducing load on the API server and improving performance. Corresponding unit tests have been added to verify correct informer cache usage. This change creates a foundation for more efficient resource management via the informer caching mechanism.

Which issue(s) this PR fixes:

Fixes #251

How to test changes / Special notes to the reviewer:

  • The Get() method now primarily uses the informer cache by default.
  • Differentiation between update-related Get() and regular Get() calls is done via a ForUpdateContextKey context flag.
  • Added informer synchronization checks and ensured nil safety.

Checklist

  • Documentation update is required by this PR (and has been updated) OR no documentation update is required.

Copy link
Collaborator

@jannfis jannfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @juanxiu for this PR.

I have a comment requiring some more discussion, PTAL.

if !ok {
return nil, fmt.Errorf("object is not an Application: %T", obj)
}
return app, nil
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Items returned from the cache need to be treated read-only. I believe that's why unrelated unit tests are failing and panicing.

There are two options imho:

  1. We clearly document that objects returned by this function are to be treated read-only, because they are retrieved directly from the cache and caller needs to make a copy if they want to modify it, or
  2. We return a copy of the object retrieved from the cache, to take the burden from the caller.

The first option puts more responsibility to the caller, but is resource efficient. The second option would ensure that the caller can treat the objects lightly, but for the cost of increased memory consumption.

I have not yet made up my mind with regards to which solution I'd prefer. Which one do you think makes more sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In option 1, when objects are retrieved through multiple informers, there is a risk that callers may forget to use DeepCopy(). Such oversights can make it difficult to trace and resolve bugs. On the other hand, callers can use resources efficiently and have the flexibility to decide when to create copies.

Option 2 always returns a copy of the object from the function, so callers cannot control when the copy is made. However, callers can still customize behavior by registering event handlers with the informer. Returning a copy from the function enforces consistent object usage and prevents inconsistent handling of copying across different callers.

Personally, I chose option 2 because I believe minimizing the potential for bugs and ensuring consistent and stable behavior throughout the codebase is important.

For these reasons, I have modified the code to return app.DeepCopy(), nil.
Additionally, I plan to update the List methods in the Kubernetes backend for Application and appproject type resources to use the informer cache as well. I would appreciate it if we could merge this after those changes are completed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems there are other problems with this approach, at least regarding the tests. Some of them are still failing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no issue with loading the informer cache itself. However, in the current unit tests, a different problem arises. To load the cache, informer.Run must be executed, which requires calling the Start method of either the manager or the server. Until now, the code has been written without assuming that Start would be called in the test code. As a result, timing issues related to goroutine execution occur during test runs. How can we resolve this situation?

wq.On("Get").Return(&ev, false)
wq.On("Done", &ev)
s, err := NewServer(context.Background(), fac, "argocd", WithGeneratedTokenSigningKey(), WithAutoNamespaceCreate(true, "", nil))
s.Start(context.Background(), make(chan error))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During testing in this way, a Start call is required to load the informer.

Add configurable informer sync timeout and proper context handling
to resolve intermittent test failures.

Signed-off-by: yeonsoo <[email protected]>
@codecov-commenter
Copy link

codecov-commenter commented Oct 7, 2025

Codecov Report

❌ Patch coverage is 76.59574% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 45.73%. Comparing base (f39cfdc) to head (693ab36).

Files with missing lines Patch % Lines
internal/informer/options.go 0.00% 8 Missing ⚠️
internal/manager/application/application.go 57.14% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #525      +/-   ##
==========================================
+ Coverage   45.62%   45.73%   +0.11%     
==========================================
  Files          90       90              
  Lines       12029    12070      +41     
==========================================
+ Hits         5488     5520      +32     
- Misses       6096     6105       +9     
  Partials      445      445              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Add configurable informer timeout and Redis proxy disable for tests.

Signed-off-by: yeonsoo <[email protected]>
Add configurable informer timeout and Redis proxy disable for tests.
Implement cache-first with API server fallback in Get method.

Signed-off-by: yeonsoo <[email protected]>
- Get() uses cache by default, API server when forUpdate=true
- update() sets forUpdate context to ensure latest ResourceVersion
- Prevents update conflicts while improving read performance

Signed-off-by: yeonsoo <[email protected]>
Signed-off-by: yeonsoo <[email protected]>
@juanxiu juanxiu requested a review from jannfis October 8, 2025 09:59
@juanxiu
Copy link
Contributor Author

juanxiu commented Oct 8, 2025

Hi, @jannfis ! Kindly requesting your review.

All tests have passed successfully. Regular Get() calls now use the informer cache, while Get() calls within the update retry loop are conditionally routed to directly query the API server via a context flag.

@jannfis
Copy link
Collaborator

jannfis commented Oct 8, 2025

Thank you @juanxiu ! During review, I noticed that this PR actually contains three features instead of one: Ability to disable Redis proxy, ability to specify informer sync timeout, and the use of the informer cache.

It would be great if you could untangle these, and submit as separate PRs. That way, we can keep better track of things and independently revert changes, should it be necessary. Thank you.

@juanxiu
Copy link
Contributor Author

juanxiu commented Oct 8, 2025

Hi, @jannfis
Thank you for the review. I would like to explain the relationship between the three features.
The core feature is the addition of the informer cache. The goal was to improve performance by reading from the local cache instead of the API server in KubernetesBackend.Get().

The issues arose during testing of this feature:

  1. Redis proxy disable option:
    To initialize Principal, NewServer() must be called, which by default starts the Redis proxy. Since there was no Redis server in the test environment, the principal itself could not be created. (All existing tests in principal/server_test.go also use WithRedisProxyDisabled()). Can I keep this change together in this PR?

  2. Informer sync timeout setting:
    I submitted a separate PR feat: informer sync timeout configuration for principal #603 for this. This PR requires the Redis proxy disable setting. After here PR is merged, the related PR should also pass the tests successfully

Ultimately, the informer cache is the main feature, and the Redis proxy disable option is needed for its proper testing. If separated, there would be a situation where intermediate states without test coverage or unstable tests would be merged.
Is there a better approach?

Thanks

juanxiu added a commit to juanxiu/argocd-agent that referenced this pull request Oct 8, 2025
@juanxiu
Copy link
Contributor Author

juanxiu commented Oct 9, 2025

Hi, @jannfis

I pulled the latest changes from the main branch and resolved the conflicts.
If there are no further issues, could we proceed with merging this PR?

Comment on lines +92 to +94
forUpdate, _ := ctx.Value(backend.ForUpdateContextKey).(bool)

if !forUpdate && be.appLister != nil && be.appInformer != nil && be.appInformer.HasSynced() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please explain this construct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please take a look at my explanation below?

Copy link
Collaborator

@jannfis jannfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had some comments. Also, if you could extract the disable redis proxy code into a separate PR (just like with the informer cache), I'm gonna merge it first, you rebase this PR, and the functionality would be available to this PR too.

Comment on lines 117 to 120
i.groupResource = schema.GroupResource{
Group: "argoproj.io",
Resource: "applications",
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this hard-coded, every informer would only ever have a lister for Applications. I think the groupResource must be set from the outside, potentially at initialization, depending on the type concrete type of the informer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback. You’re right — I removed the hard-coded groupResource.
Instead, I added WithGroupResource[T](group, resource string) as a new InformerOption, which allows each informer to be configured with the correct groupResource at initialization time based on its concrete type.


err := retry.RetryOnConflict(retry.DefaultBackoff, func() error {
existing, ierr := m.applicationBackend.Get(ctx, incoming.Name, incoming.Namespace)
existing, ierr := m.applicationBackend.Get(ctxForUpdate, incoming.Name, incoming.Namespace)
Copy link
Contributor Author

@juanxiu juanxiu Oct 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jannfis

The reason why forUpdate is necessary lies here. Kubernetes implements Optimistic Concurrency Control through resourceVersion. When an Update request is made, the API server compares the resourceVersion in the request with the one currently stored, and if they differ, it returns a Conflict.

The issue arises when using the informer cache during updates. Although the informer is synchronized with the API server, there is a slight lag, meaning that the cache may return a stale resourceVersion. As a result, the Update request fails, and even when using retry.RetryOnConflict() to retry, it keeps reading the same stale resourceVersion from the cache, causing repeated failures.

In contrast, when forUpdate=true, the object is fetched directly from the API server, guaranteeing the latest resourceVersion. This ensures that during retries, the latest version is retrieved, allowing the Update to succeed.

Therefore, we use the informer cache for regular reads to optimize performance, and direct API reads during updates to ensure correctness.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Kubernetes backend should use informer cache for retrieving resources
3 participants