-
Notifications
You must be signed in to change notification settings - Fork 67
🌱 Refactor catalogmetadata
client and cache
#1318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
✅ Deploy Preview for olmv1 ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
aad81dd
to
2c2019c
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1318 +/- ##
==========================================
+ Coverage 76.17% 76.37% +0.20%
==========================================
Files 41 41
Lines 2409 2438 +29
==========================================
+ Hits 1835 1862 +27
- Misses 401 402 +1
- Partials 173 174 +1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
assert.NoError(t, equalFilesystems(defaultFS(), actualFSPut)) | ||
assert.NoError(t, equalFilesystems(actualFSPut, actualFSGet)) | ||
|
||
t.Log("Put v2 error into cache") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a test after populating something for a newly resolved ref that doing a GET for the old resolved ref returns a nil cache and nil error (IIUC that is the intended behavior)?
To clarify the scenario I am thinking of using terminology in the test log output here:
- Put v1 into cache
- Put v2 into cache (actual OR error)
- Get v1 from cache (expect nil here since this is an old revision!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@everettraven there is no implicit cache invalidation on Put
or Get
: Remove
has to be called explicitly. On a higher level - we have this controller which watches catalogs and does cache removal (introduced in #1207).
So in your example "Get v1 from cache" will return old cache unless we do Remove
before get.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually you are right - "Get v1 from cache" will return nil
because we map by catalog name and override on Put
. This is a bug: I want Remove
to work even if we do a Put
of another version before Remove
.
I think #1207 (and main
) has the same issue...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another actually - I confused myself, but you are still right :)
The idea was that we only remove cache on catalog deletion. Anything else would require a finaliser on a catalog which we decided not to have. We are not keeping older versions of cache on the FS: on put we implicitly delete FS cache and basically replace it with a new version. It worked like this before refactoring too.
So yes, "Get v1 from cache" after "Put v2 into cache" should return nil
. I'll update the PR with a relevant test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had a comment about potentially adding another test case, but overall this refactor looks really good!
718473c
to
d548612
Compare
It is now possible to cache errors occured on the cache population. This will be useful when we start proactively populating cache from `ClusterCatalogReconciler` instead of doing this on-demand. This refactoring also moves the responsibility of performing network requests out of the cache backend into the client. Signed-off-by: Mikalai Radchuk <[email protected]>
d548612
to
c6b3b5d
Compare
// the cached contents. | ||
if cache, err := fsc.get(catalogName, resolvedRef); err == nil && cache != nil { | ||
// We only return here if the was no error during | ||
// the previous (likely concurrent) cache population attempt. | ||
// If there was an error - we want to try and populate the cache again. | ||
return cache, nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for missing this the first time around, but do we actually still need this?
I think this was put in place with more complex thread safety logic than we require now since we are doing thread-safe "Put" and "Get" operations. Refreshing my memory on the previous logic, it was done for the specific scenario of:
- We do a read-only lock to verify if the content has been cached before
- We release the read-only lock once we have verified that either we have or have not cached this content before
- Some other thread could now potentially get a write lock and write the contents for this catalog to the cache
- When we go to write content for the catalog to the cache, we get a write lock and re-verify whether or not it was written by some other process. We let it win because we don't know whether our revision of the content is newer or not.
I don't think the same scenario is possible here since we are doing a lock for the entire duration of the Put
method and as such we can probably discard this initial get and ensure Put
has the single responsibility of always writing the new cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That being said, it may still be valid to say "we have successfully cached this revision before so we will ignore a failed put for the same revision". I would lean more towards putting the responsibility of this verification to the caller though to simplify the responsibility of these cache functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@everettraven it is still possible that two callers still do the following concurrently:
- Get cache - it is missing - we lock for reading
- Do an HTTP request - we do not lock this
- Write into FS - we lock this
So if this happens we have and we remove the get from Put
we have two redundant things: 1) network request 2) FS IO.
So if we remove this check we are risking two sequential writes to FS which will make reads (from all catalogs) blocked as well. With this check we ensure that only one FS write happens even if there was multiple concurrent attempts to fetch cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. I get that it would still be possible for that scenario to occur, but I'm questioning whether it should be the responsibility of the Put
method to handle that or the caller. I would think the caller should take that into consideration as opposed to the Put
method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@everettraven if we make this responsibility of the caller then the caller will have to be responsible for locking of the mutex as well. And we probably don't want to make locking the caller's responsibility.
Or am I still not getting the idea?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had one comment about the responsibility of the Put
method, but I don't think that it should hold this PR because:
- It is an internal package, we can modify it as much as we want
- It isn't any worse than what we had before
Description
It is now possible to cache errors occured on the cache population. This will be useful when we start proactively populating cache from
ClusterCatalogReconciler
instead of doing this on-demand.This refactoring also moves the responsibility of performing network requests out of the cache backend into the client.
This is done in preparation for #1284
Reviewer Checklist