feat: use informer cache for Get method in Kubernetes backend #525

juanxiu · 2025-08-20T03:37:52Z

What does this PR do / why we need it:
This PR enhances the KubernetesBackend by modifying the Get method to use the informer cache instead of querying the Kubernetes API server directly when retrieving Application resources. It leverages the generic Informer's Lister() method to efficiently access cached resources, thereby reducing load on the API server and improving performance. Corresponding unit tests have been added to verify correct informer cache usage. This change creates a foundation for more efficient resource management via the informer caching mechanism.

Which issue(s) this PR fixes:

Fixes #251

How to test changes / Special notes to the reviewer:

The Get() method now primarily uses the informer cache by default.
Differentiation between update-related Get() and regular Get() calls is done via a ForUpdateContextKey context flag.
Added informer synchronization checks and ensured nil safety.

Checklist

Documentation update is required by this PR (and has been updated) OR no documentation update is required.

jannfis

Thanks @juanxiu for this PR.

I have a comment requiring some more discussion, PTAL.

jannfis · 2025-08-20T12:42:24Z

internal/backend/kubernetes/application/kubernetes.go

+	if !ok {
+		return nil, fmt.Errorf("object is not an Application: %T", obj)
+	}
+	return app, nil


Items returned from the cache need to be treated read-only. I believe that's why unrelated unit tests are failing and panicing.

There are two options imho:

We clearly document that objects returned by this function are to be treated read-only, because they are retrieved directly from the cache and caller needs to make a copy if they want to modify it, or

We return a copy of the object retrieved from the cache, to take the burden from the caller.

The first option puts more responsibility to the caller, but is resource efficient. The second option would ensure that the caller can treat the objects lightly, but for the cost of increased memory consumption.

I have not yet made up my mind with regards to which solution I'd prefer. Which one do you think makes more sense?

In option 1, when objects are retrieved through multiple informers, there is a risk that callers may forget to use DeepCopy(). Such oversights can make it difficult to trace and resolve bugs. On the other hand, callers can use resources efficiently and have the flexibility to decide when to create copies.

Option 2 always returns a copy of the object from the function, so callers cannot control when the copy is made. However, callers can still customize behavior by registering event handlers with the informer. Returning a copy from the function enforces consistent object usage and prevents inconsistent handling of copying across different callers.

Personally, I chose option 2 because I believe minimizing the potential for bugs and ensuring consistent and stable behavior throughout the codebase is important.

For these reasons, I have modified the code to return app.DeepCopy(), nil.
Additionally, I plan to update the List methods in the Kubernetes backend for Application and appproject type resources to use the informer cache as well. I would appreciate it if we could merge this after those changes are completed.

It seems there are other problems with this approach, at least regarding the tests. Some of them are still failing.

There is no issue with loading the informer cache itself. However, in the current unit tests, a different problem arises. To load the cache, informer.Run must be executed, which requires calling the Start method of either the manager or the server. Until now, the code has been written without assuming that Start would be called in the test code. As a result, timing issues related to goroutine execution occur during test runs. How can we resolve this situation?

juanxiu · 2025-08-23T07:57:24Z

principal/event_test.go

 		wq.On("Get").Return(&ev, false)
 		wq.On("Done", &ev)
 		s, err := NewServer(context.Background(), fac, "argocd", WithGeneratedTokenSigningKey(), WithAutoNamespaceCreate(true, "", nil))
+		s.Start(context.Background(), make(chan error))


During testing in this way, a Start call is required to load the informer.

Signed-off-by: yeonsoo <[email protected]>

Add configurable informer sync timeout and proper context handling to resolve intermittent test failures. Signed-off-by: yeonsoo <[email protected]>

Signed-off-by: yeonsoo <[email protected]>

codecov-commenter · 2025-10-07T18:11:34Z

Codecov Report

❌ Patch coverage is 76.59574% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 45.73%. Comparing base (f39cfdc) to head (693ab36).

Files with missing lines	Patch %	Lines
internal/informer/options.go	0.00%	8 Missing ⚠️
internal/manager/application/application.go	57.14%	2 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #525      +/-   ##
==========================================
+ Coverage   45.62%   45.73%   +0.11%     
==========================================
  Files          90       90              
  Lines       12029    12070      +41     
==========================================
+ Hits         5488     5520      +32     
- Misses       6096     6105       +9     
  Partials      445      445

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Add configurable informer timeout and Redis proxy disable for tests. Signed-off-by: yeonsoo <[email protected]>

Add configurable informer timeout and Redis proxy disable for tests. Implement cache-first with API server fallback in Get method. Signed-off-by: yeonsoo <[email protected]>

Signed-off-by: yeonsoo <[email protected]>

- Get() uses cache by default, API server when forUpdate=true - update() sets forUpdate context to ensure latest ResourceVersion - Prevents update conflicts while improving read performance Signed-off-by: yeonsoo <[email protected]>

Signed-off-by: yeonsoo <[email protected]>

juanxiu · 2025-10-08T09:59:20Z

Hi, @jannfis ! Kindly requesting your review.

All tests have passed successfully. Regular Get() calls now use the informer cache, while Get() calls within the update retry loop are conditionally routed to directly query the API server via a context flag.

Signed-off-by: yeonsoo <[email protected]>

jannfis · 2025-10-08T13:30:34Z

Thank you @juanxiu ! During review, I noticed that this PR actually contains three features instead of one: Ability to disable Redis proxy, ability to specify informer sync timeout, and the use of the informer cache.

It would be great if you could untangle these, and submit as separate PRs. That way, we can keep better track of things and independently revert changes, should it be necessary. Thank you.

juanxiu · 2025-10-08T15:52:31Z

Hi, @jannfis
Thank you for the review. I would like to explain the relationship between the three features.
The core feature is the addition of the informer cache. The goal was to improve performance by reading from the local cache instead of the API server in KubernetesBackend.Get().

The issues arose during testing of this feature:

Redis proxy disable option:
To initialize Principal, NewServer() must be called, which by default starts the Redis proxy. Since there was no Redis server in the test environment, the principal itself could not be created. (All existing tests in principal/server_test.go also use WithRedisProxyDisabled()). Can I keep this change together in this PR?
Informer sync timeout setting:
I submitted a separate PR feat: informer sync timeout configuration for principal #603 for this. This PR requires the Redis proxy disable setting. After here PR is merged, the related PR should also pass the tests successfully

Ultimately, the informer cache is the main feature, and the Redis proxy disable option is needed for its proper testing. If separated, there would be a situation where intermediate states without test coverage or unstable tests would be merged.
Is there a better approach?

Thanks

Signed-off-by: yeonsoo <[email protected]>

Signed-off-by: Yeonsoo Kim <[email protected]>

Signed-off-by: yeonsoo <[email protected]>

juanxiu · 2025-10-09T05:27:10Z

Hi, @jannfis

I pulled the latest changes from the main branch and resolved the conflicts.
If there are no further issues, could we proceed with merging this PR?

jannfis · 2025-10-09T11:42:27Z

internal/backend/kubernetes/application/kubernetes.go

+	forUpdate, _ := ctx.Value(backend.ForUpdateContextKey).(bool)
+
+	if !forUpdate && be.appLister != nil && be.appInformer != nil && be.appInformer.HasSynced() {


Can you please explain this construct?

Could you please take a look at my explanation below?

jannfis

I had some comments. Also, if you could extract the disable redis proxy code into a separate PR (just like with the informer cache), I'm gonna merge it first, you rebase this PR, and the functionality would be available to this PR too.

jannfis · 2025-10-09T11:46:05Z

internal/informer/informer.go

+	i.groupResource = schema.GroupResource{
+		Group:    "argoproj.io",
+		Resource: "applications",
+	}


With this hard-coded, every informer would only ever have a lister for Applications. I think the groupResource must be set from the outside, potentially at initialization, depending on the type concrete type of the informer.

Thanks for the feedback. You’re right — I removed the hard-coded groupResource.
Instead, I added WithGroupResource[T](group, resource string) as a new InformerOption, which allows each informer to be configured with the correct groupResource at initialization time based on its concrete type.

juanxiu · 2025-10-09T12:13:17Z

internal/manager/application/application.go

+
 	err := retry.RetryOnConflict(retry.DefaultBackoff, func() error {
-		existing, ierr := m.applicationBackend.Get(ctx, incoming.Name, incoming.Namespace)
+		existing, ierr := m.applicationBackend.Get(ctxForUpdate, incoming.Name, incoming.Namespace)


@jannfis

The reason why forUpdate is necessary lies here. Kubernetes implements Optimistic Concurrency Control through resourceVersion. When an Update request is made, the API server compares the resourceVersion in the request with the one currently stored, and if they differ, it returns a Conflict.

The issue arises when using the informer cache during updates. Although the informer is synchronized with the API server, there is a slight lag, meaning that the cache may return a stale resourceVersion. As a result, the Update request fails, and even when using retry.RetryOnConflict() to retry, it keeps reading the same stale resourceVersion from the cache, causing repeated failures.

In contrast, when forUpdate=true, the object is fetched directly from the API server, guaranteeing the latest resourceVersion. This ensures that during retries, the latest version is retrieved, allowing the Update to succeed.

Therefore, we use the informer cache for regular reads to optimize performance, and direct API reads during updates to ensure correctness.

Signed-off-by: yeonsoo <[email protected]>

juanxiu requested review from jannfis, jgwest and chetan-rns as code owners August 20, 2025 03:37

jannfis reviewed Aug 20, 2025

View reviewed changes

juanxiu commented Aug 23, 2025

View reviewed changes

juanxiu added 6 commits October 8, 2025 02:12

feat: add Lister() to expose informer cache bia GenericLister

3512dba

Signed-off-by: yeonsoo <[email protected]>

feat: use informr cache in KubernetesBackend Get method

7de1164

Signed-off-by: yeonsoo <[email protected]>

test: add unit tests for Get method using informer cache

87fb311

Signed-off-by: yeonsoo <[email protected]>

fix: Return deep copy of Application object in Get()

c0b9baa

Signed-off-by: yeonsoo <[email protected]>

test: add informer start

3a22c60

Signed-off-by: yeonsoo <[email protected]>

fix: flaky principal tests

da53214

Add configurable informer sync timeout and proper context handling to resolve intermittent test failures. Signed-off-by: yeonsoo <[email protected]>

juanxiu force-pushed the main branch from 829de82 to da53214 Compare October 7, 2025 17:56

fix: syntax error in options.go

53100da

Signed-off-by: yeonsoo <[email protected]>

juanxiu added 6 commits October 8, 2025 14:00

fix: flaky principal tests

705aed2

Add configurable informer timeout and Redis proxy disable for tests. Signed-off-by: yeonsoo <[email protected]>

fix: flaky tests and e2e stability

4933a06

Add configurable informer timeout and Redis proxy disable for tests. Implement cache-first with API server fallback in Get method. Signed-off-by: yeonsoo <[email protected]>

fix: prevent panic in Get when namespace lister is nil

88a1c14

Signed-off-by: yeonsoo <[email protected]>

feat: add nil check

2620360

Signed-off-by: yeonsoo <[email protected]>

feat: set appInformer for test backend

3edf88a

Signed-off-by: yeonsoo <[email protected]>

juanxiu requested a review from jannfis October 8, 2025 09:59

remove: unneccesary comments

df1199a

Signed-off-by: yeonsoo <[email protected]>

juanxiu added a commit to juanxiu/argocd-agent that referenced this pull request Oct 8, 2025

remove: informer time out setting from PR argoproj-labs#525

e0637a4

Signed-off-by: yeonsoo <[email protected]>

juanxiu added 4 commits October 9, 2025 03:00

feat: remove informer sync timeout configuration

16a082d

Signed-off-by: yeonsoo <[email protected]>

Merge branch 'main' into main

93c90c9

Signed-off-by: Yeonsoo Kim <[email protected]>

fix: syntax error resolve

9864fe1

Signed-off-by: yeonsoo <[email protected]>

fix: syntax error resolve

66231cc

Signed-off-by: yeonsoo <[email protected]>

jannfis reviewed Oct 9, 2025

View reviewed changes

juanxiu commented Oct 9, 2025

View reviewed changes

fix: add WithGroupResource option to informer Lister() for all resources

693ab36

Signed-off-by: yeonsoo <[email protected]>

		forUpdate, _ := ctx.Value(backend.ForUpdateContextKey).(bool)

		if !forUpdate && be.appLister != nil && be.appInformer != nil && be.appInformer.HasSynced() {

feat: use informer cache for Get method in Kubernetes backend #525

Are you sure you want to change the base?

feat: use informer cache for Get method in Kubernetes backend #525

Uh oh!

Conversation

juanxiu commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jannfis left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

juanxiu commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jannfis commented Oct 8, 2025

Uh oh!

juanxiu commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

juanxiu commented Oct 9, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jannfis left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

juanxiu Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

juanxiu commented Aug 20, 2025 •

edited

Loading

codecov-commenter commented Oct 7, 2025 •

edited

Loading

juanxiu commented Oct 8, 2025 •

edited

Loading

juanxiu commented Oct 8, 2025 •

edited

Loading

juanxiu Oct 9, 2025 •

edited

Loading