🌱 Refactor `catalogmetadata` client and cache #1318

m1kola · 2024-09-29T18:14:52Z

Description

It is now possible to cache errors occured on the cache population. This will be useful when we start proactively populating cache from ClusterCatalogReconciler instead of doing this on-demand.

This refactoring also moves the responsibility of performing network requests out of the cache backend into the client.

This is done in preparation for #1284

Reviewer Checklist

API Go Documentation
Tests: Unit Tests (and E2E Tests, if appropriate)
Comprehensive Commit Messages
Links to related GitHub Issue(s)

netlify · 2024-09-29T18:15:19Z

✅ Deploy Preview for olmv1 ready!

Name	Link
🔨 Latest commit	`c6b3b5d`
🔍 Latest deploy log	https://app.netlify.com/sites/olmv1/deploys/66fe674202b9890008494087
😎 Deploy Preview	https://deploy-preview-1318--olmv1.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

codecov · 2024-09-29T18:24:00Z

Codecov Report

Attention: Patch coverage is 89.18919% with 8 lines in your changes missing coverage. Please review.

Project coverage is 76.37%. Comparing base (801388a) to head (c6b3b5d).
Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
internal/catalogmetadata/client/client.go	81.39%	4 Missing and 4 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1318      +/-   ##
==========================================
+ Coverage   76.17%   76.37%   +0.20%     
==========================================
  Files          41       41              
  Lines        2409     2438      +29     
==========================================
+ Hits         1835     1862      +27     
- Misses        401      402       +1     
- Partials      173      174       +1

Flag	Coverage Δ
e2e	`58.44% <64.86%> (+0.12%)`	⬆️
unit	`53.65% <85.13%> (+0.47%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

everettraven · 2024-10-02T18:07:48Z

internal/catalogmetadata/cache/cache_test.go

+	assert.NoError(t, equalFilesystems(defaultFS(), actualFSPut))
+	assert.NoError(t, equalFilesystems(actualFSPut, actualFSGet))
+
+	t.Log("Put v2 error into cache")


Should we add a test after populating something for a newly resolved ref that doing a GET for the old resolved ref returns a nil cache and nil error (IIUC that is the intended behavior)?

To clarify the scenario I am thinking of using terminology in the test log output here:

Put v1 into cache

Put v2 into cache (actual OR error)

Get v1 from cache (expect nil here since this is an old revision!)

@everettraven there is no implicit cache invalidation on Put or Get: Remove has to be called explicitly. On a higher level - we have this controller which watches catalogs and does cache removal (introduced in #1207).

So in your example "Get v1 from cache" will return old cache unless we do Remove before get.

Actually you are right - "Get v1 from cache" will return nil because we map by catalog name and override on Put. This is a bug: I want Remove to work even if we do a Put of another version before Remove.

I think #1207 (and main) has the same issue...

Another actually - I confused myself, but you are still right :)

The idea was that we only remove cache on catalog deletion. Anything else would require a finaliser on a catalog which we decided not to have. We are not keeping older versions of cache on the FS: on put we implicitly delete FS cache and basically replace it with a new version. It worked like this before refactoring too.

So yes, "Get v1 from cache" after "Put v2 into cache" should return nil. I'll update the PR with a relevant test.

everettraven

Had a comment about potentially adding another test case, but overall this refactor looks really good!

It is now possible to cache errors occured on the cache population. This will be useful when we start proactively populating cache from `ClusterCatalogReconciler` instead of doing this on-demand. This refactoring also moves the responsibility of performing network requests out of the cache backend into the client. Signed-off-by: Mikalai Radchuk <[email protected]>

everettraven · 2024-10-03T15:06:24Z

internal/catalogmetadata/cache/cache.go

+	// the cached contents.
+	if cache, err := fsc.get(catalogName, resolvedRef); err == nil && cache != nil {
+		// We only return here if the was no error during
+		// the previous (likely concurrent) cache population attempt.
+		// If there was an error - we want to try and populate the cache again.
+		return cache, nil
+	}


Apologies for missing this the first time around, but do we actually still need this?

I think this was put in place with more complex thread safety logic than we require now since we are doing thread-safe "Put" and "Get" operations. Refreshing my memory on the previous logic, it was done for the specific scenario of:

We do a read-only lock to verify if the content has been cached before

We release the read-only lock once we have verified that either we have or have not cached this content before

Some other thread could now potentially get a write lock and write the contents for this catalog to the cache

When we go to write content for the catalog to the cache, we get a write lock and re-verify whether or not it was written by some other process. We let it win because we don't know whether our revision of the content is newer or not.

I don't think the same scenario is possible here since we are doing a lock for the entire duration of the Put method and as such we can probably discard this initial get and ensure Put has the single responsibility of always writing the new cache.

That being said, it may still be valid to say "we have successfully cached this revision before so we will ignore a failed put for the same revision". I would lean more towards putting the responsibility of this verification to the caller though to simplify the responsibility of these cache functions.

@everettraven it is still possible that two callers still do the following concurrently:

Get cache - it is missing - we lock for reading

Do an HTTP request - we do not lock this

Write into FS - we lock this

So if this happens we have and we remove the get from Put we have two redundant things: 1) network request 2) FS IO.

So if we remove this check we are risking two sequential writes to FS which will make reads (from all catalogs) blocked as well. With this check we ensure that only one FS write happens even if there was multiple concurrent attempts to fetch cache.

Right. I get that it would still be possible for that scenario to occur, but I'm questioning whether it should be the responsibility of the Put method to handle that or the caller. I would think the caller should take that into consideration as opposed to the Put method.

@everettraven if we make this responsibility of the caller then the caller will have to be responsible for locking of the mutex as well. And we probably don't want to make locking the caller's responsibility.

Or am I still not getting the idea?

everettraven

Had one comment about the responsibility of the Put method, but I don't think that it should hold this PR because:

It is an internal package, we can modify it as much as we want
It isn't any worse than what we had before

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 29, 2024

m1kola mentioned this pull request Sep 29, 2024

✨ Populate/update cache on ClusterCatalog reconcile #1284

Merged

4 tasks

m1kola force-pushed the refactor_caching branch from aad81dd to 2c2019c Compare September 29, 2024 18:17

m1kola marked this pull request as ready for review September 29, 2024 18:27

m1kola requested a review from a team as a code owner September 29, 2024 18:27

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 29, 2024

everettraven reviewed Oct 2, 2024

View reviewed changes

everettraven previously approved these changes Oct 2, 2024

View reviewed changes

m1kola marked this pull request as draft October 3, 2024 08:26

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 3, 2024

m1kola dismissed everettraven’s stale review via 718473c October 3, 2024 08:35

m1kola force-pushed the refactor_caching branch 2 times, most recently from 718473c to d548612 Compare October 3, 2024 09:14

m1kola force-pushed the refactor_caching branch from d548612 to c6b3b5d Compare October 3, 2024 09:43

m1kola marked this pull request as ready for review October 3, 2024 10:09

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 3, 2024

m1kola requested a review from everettraven October 3, 2024 10:09

everettraven reviewed Oct 3, 2024

View reviewed changes

everettraven approved these changes Oct 3, 2024

View reviewed changes

everettraven added this pull request to the merge queue Oct 3, 2024

Merged via the queue into operator-framework:main with commit 117bece Oct 3, 2024
18 checks passed

m1kola deleted the refactor_caching branch October 3, 2024 19:10

m1kola mentioned this pull request Nov 26, 2024

🐛 catalog cache: retry cache population when cache contains an error #1489

Merged

4 tasks

🌱 Refactor catalogmetadata client and cache #1318

🌱 Refactor catalogmetadata client and cache #1318

Uh oh!

Conversation

m1kola commented Sep 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Reviewer Checklist

Uh oh!

netlify bot commented Sep 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for olmv1 ready!

Uh oh!

codecov bot commented Sep 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

everettraven Oct 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

m1kola Oct 3, 2024

Choose a reason for hiding this comment

Uh oh!

m1kola Oct 3, 2024

Choose a reason for hiding this comment

Uh oh!

m1kola Oct 3, 2024

Choose a reason for hiding this comment

Uh oh!

everettraven left a comment

Choose a reason for hiding this comment

Uh oh!

everettraven Oct 3, 2024

Choose a reason for hiding this comment

Uh oh!

everettraven Oct 3, 2024

Choose a reason for hiding this comment

Uh oh!

m1kola Oct 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

everettraven Oct 3, 2024

Choose a reason for hiding this comment

Uh oh!

m1kola Oct 3, 2024

Choose a reason for hiding this comment

Uh oh!

everettraven left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

🌱 Refactor `catalogmetadata` client and cache #1318

🌱 Refactor `catalogmetadata` client and cache #1318

m1kola commented Sep 29, 2024 •

edited

Loading

netlify bot commented Sep 29, 2024 •

edited

Loading

codecov bot commented Sep 29, 2024 •

edited

Loading

everettraven Oct 2, 2024 •

edited

Loading

m1kola Oct 3, 2024 •

edited

Loading