Skip to content

Conversation

m1kola
Copy link
Member

@m1kola m1kola commented Sep 29, 2024

Description

It is now possible to cache errors occured on the cache population. This will be useful when we start proactively populating cache from ClusterCatalogReconciler instead of doing this on-demand.

This refactoring also moves the responsibility of performing network requests out of the cache backend into the client.

This is done in preparation for #1284

Reviewer Checklist

  • API Go Documentation
  • Tests: Unit Tests (and E2E Tests, if appropriate)
  • Comprehensive Commit Messages
  • Links to related GitHub Issue(s)

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 29, 2024
@netlify
Copy link

netlify bot commented Sep 29, 2024

Deploy Preview for olmv1 ready!

Name Link
🔨 Latest commit c6b3b5d
🔍 Latest deploy log https://app.netlify.com/sites/olmv1/deploys/66fe674202b9890008494087
😎 Deploy Preview https://deploy-preview-1318--olmv1.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@codecov
Copy link

codecov bot commented Sep 29, 2024

Codecov Report

Attention: Patch coverage is 89.18919% with 8 lines in your changes missing coverage. Please review.

Project coverage is 76.37%. Comparing base (801388a) to head (c6b3b5d).
Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
internal/catalogmetadata/client/client.go 81.39% 4 Missing and 4 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1318      +/-   ##
==========================================
+ Coverage   76.17%   76.37%   +0.20%     
==========================================
  Files          41       41              
  Lines        2409     2438      +29     
==========================================
+ Hits         1835     1862      +27     
- Misses        401      402       +1     
- Partials      173      174       +1     
Flag Coverage Δ
e2e 58.44% <64.86%> (+0.12%) ⬆️
unit 53.65% <85.13%> (+0.47%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@m1kola m1kola marked this pull request as ready for review September 29, 2024 18:27
@m1kola m1kola requested a review from a team as a code owner September 29, 2024 18:27
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 29, 2024
assert.NoError(t, equalFilesystems(defaultFS(), actualFSPut))
assert.NoError(t, equalFilesystems(actualFSPut, actualFSGet))

t.Log("Put v2 error into cache")
Copy link
Contributor

@everettraven everettraven Oct 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a test after populating something for a newly resolved ref that doing a GET for the old resolved ref returns a nil cache and nil error (IIUC that is the intended behavior)?

To clarify the scenario I am thinking of using terminology in the test log output here:

  • Put v1 into cache
  • Put v2 into cache (actual OR error)
  • Get v1 from cache (expect nil here since this is an old revision!)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@everettraven there is no implicit cache invalidation on Put or Get: Remove has to be called explicitly. On a higher level - we have this controller which watches catalogs and does cache removal (introduced in #1207).

So in your example "Get v1 from cache" will return old cache unless we do Remove before get.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually you are right - "Get v1 from cache" will return nil because we map by catalog name and override on Put. This is a bug: I want Remove to work even if we do a Put of another version before Remove.

I think #1207 (and main) has the same issue...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another actually - I confused myself, but you are still right :)

The idea was that we only remove cache on catalog deletion. Anything else would require a finaliser on a catalog which we decided not to have. We are not keeping older versions of cache on the FS: on put we implicitly delete FS cache and basically replace it with a new version. It worked like this before refactoring too.

So yes, "Get v1 from cache" after "Put v2 into cache" should return nil. I'll update the PR with a relevant test.

everettraven
everettraven previously approved these changes Oct 2, 2024
Copy link
Contributor

@everettraven everettraven left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had a comment about potentially adding another test case, but overall this refactor looks really good!

@m1kola m1kola marked this pull request as draft October 3, 2024 08:26
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 3, 2024
@m1kola m1kola force-pushed the refactor_caching branch 2 times, most recently from 718473c to d548612 Compare October 3, 2024 09:14
It is now possible to cache errors occured on the cache population.
This will be useful when we start proactively populating cache
from `ClusterCatalogReconciler` instead of doing this on-demand.

This refactoring also moves the responsibility of performing
network requests out of the cache backend into the client.

Signed-off-by: Mikalai Radchuk <[email protected]>
@m1kola m1kola marked this pull request as ready for review October 3, 2024 10:09
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 3, 2024
@m1kola m1kola requested a review from everettraven October 3, 2024 10:09
Comment on lines +68 to +74
// the cached contents.
if cache, err := fsc.get(catalogName, resolvedRef); err == nil && cache != nil {
// We only return here if the was no error during
// the previous (likely concurrent) cache population attempt.
// If there was an error - we want to try and populate the cache again.
return cache, nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for missing this the first time around, but do we actually still need this?

I think this was put in place with more complex thread safety logic than we require now since we are doing thread-safe "Put" and "Get" operations. Refreshing my memory on the previous logic, it was done for the specific scenario of:

  • We do a read-only lock to verify if the content has been cached before
  • We release the read-only lock once we have verified that either we have or have not cached this content before
  • Some other thread could now potentially get a write lock and write the contents for this catalog to the cache
  • When we go to write content for the catalog to the cache, we get a write lock and re-verify whether or not it was written by some other process. We let it win because we don't know whether our revision of the content is newer or not.

I don't think the same scenario is possible here since we are doing a lock for the entire duration of the Put method and as such we can probably discard this initial get and ensure Put has the single responsibility of always writing the new cache.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That being said, it may still be valid to say "we have successfully cached this revision before so we will ignore a failed put for the same revision". I would lean more towards putting the responsibility of this verification to the caller though to simplify the responsibility of these cache functions.

Copy link
Member Author

@m1kola m1kola Oct 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@everettraven it is still possible that two callers still do the following concurrently:

  1. Get cache - it is missing - we lock for reading
  2. Do an HTTP request - we do not lock this
  3. Write into FS - we lock this

So if this happens we have and we remove the get from Put we have two redundant things: 1) network request 2) FS IO.

So if we remove this check we are risking two sequential writes to FS which will make reads (from all catalogs) blocked as well. With this check we ensure that only one FS write happens even if there was multiple concurrent attempts to fetch cache.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I get that it would still be possible for that scenario to occur, but I'm questioning whether it should be the responsibility of the Put method to handle that or the caller. I would think the caller should take that into consideration as opposed to the Put method.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@everettraven if we make this responsibility of the caller then the caller will have to be responsible for locking of the mutex as well. And we probably don't want to make locking the caller's responsibility.

Or am I still not getting the idea?

Copy link
Contributor

@everettraven everettraven left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had one comment about the responsibility of the Put method, but I don't think that it should hold this PR because:

  • It is an internal package, we can modify it as much as we want
  • It isn't any worse than what we had before

@everettraven everettraven added this pull request to the merge queue Oct 3, 2024
Merged via the queue into operator-framework:main with commit 117bece Oct 3, 2024
18 checks passed
@m1kola m1kola deleted the refactor_caching branch October 3, 2024 19:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants