Implement DfsBlockCache with Caffeine Cache#2
Implement DfsBlockCache with Caffeine Cache#2jiahuijiang wants to merge 5 commits intojj/new-cachefrom
Conversation
jhoch-palantir
left a comment
There was a problem hiding this comment.
Can you walk me through the code tomorrow? Not super easy. Also want to brainstorm a bit about size tracking.
|
|
||
| packFileCache = Caffeine.newBuilder() | ||
| .removalListener((DfsPackDescription description, DfsPackFile packFile, RemovalCause cause) -> | ||
| packFile.close()) |
There was a problem hiding this comment.
technically the key and value are @Nullable, though we're not using soft keys/values, but would prefer to be null-safe. may also want to consider logging to track removal with cause & pack file.
There was a problem hiding this comment.
actually packFile.close is not needed, will remove it
| dfsBlockCache.invalidateAll(); | ||
| } | ||
|
|
||
| private static final class DfsPackKeyWithPosition { |
There was a problem hiding this comment.
Your key doesn't have an equals or hashCode, which could cause problems if you try to lookup the a value with different (but similar) key instances.
There was a problem hiding this comment.
hmmm I think two objects of this class with never equal to each other? since pack key has an AtomicLong field.
@ben-manes What do you mean by similar key instance?
There was a problem hiding this comment.
A hash map uses the key's equals and hashCode to locate and store an entry. If two equivalent but not equal keys are used, the map will consider them pointing to distinct entries. A cache is built on a hash map.
DfsPackKeyWithPosition key1 = new DfsPackKeyWithPosition(packKey, 100);
DfsPackKeyWithPosition key2 = new DfsPackKeyWithPosition(packKey, 100);
assert key1.equals(key2)
assert key1.hashCode() == key2.hashCode()
There was a problem hiding this comment.
ahh thanks for clarification!
There was a problem hiding this comment.
fyi, DfsPackKey doesn't implement equals/hashCode either.
There was a problem hiding this comment.
Yeah I think it's because of the AtomicLong it contains. So here it has to be the same DfsPackKey object
|
|
||
| void cleanUp() { | ||
| packFileCache.invalidateAll(); | ||
| dfsBlockCache.invalidateAll(); |
There was a problem hiding this comment.
Might want to invoke cleanup on each of these caches as well to free up resources immediately rather than on later cache accesses? See https://github.com/ben-manes/caffeine/wiki/Cleanup for context
There was a problem hiding this comment.
ohh! good to know! updated
There was a problem hiding this comment.
invalidateAll does a clean-up, since there's nothing remaining in the cache.
| packFile.close()) | ||
| .maximumSize(cacheEntrySize) | ||
| .expireAfterAccess(cacheConfig.getPackFileExpireSeconds(), TimeUnit.SECONDS) | ||
| .recordStats() |
There was a problem hiding this comment.
if we're recording stats, probably want to expose the recorded stats via an accessor method
There was a problem hiding this comment.
Think we want to use tritium to track it but it's not open sourced yet? Will add a TODO to add it when we move it internally.
| dfsBlockCache = Caffeine.newBuilder() | ||
| .maximumSize(cacheEntrySize) | ||
| .expireAfterAccess(cacheConfig.getPackFileExpireSeconds(), TimeUnit.SECONDS) | ||
| .recordStats() |
There was a problem hiding this comment.
if we're recording stats, probably want to expose the recorded stats via an accessor method
|
|
||
| packFileCache = Caffeine.newBuilder() | ||
| .removalListener((DfsPackDescription description, DfsPackFile packFile, RemovalCause cause) -> | ||
| packFile.close()) |
There was a problem hiding this comment.
technically the key and value are @Nullable, though we're not using soft keys/values, but would prefer to be null-safe. may also want to consider logging to track removal with cause & pack file.
| * <p> | ||
| * The value for blockSize must be a power of 2. | ||
| */ | ||
| private final int blockSize; |
There was a problem hiding this comment.
probably want to either check that blockSize is power of 2 or bump it up to next largest power of 2.
There was a problem hiding this comment.
Added a check in the config. Will switch to Immutable + checkArgument when we move the code >_<
| private final long maxStreamThroughCache; | ||
|
|
||
| /** | ||
| * Suggested block size to read from pack files in. |
| } | ||
|
|
||
| DfsPackFile newPackFile = new DfsPackFile(this, description, key != null ? key : new DfsPackKey()); | ||
| packFileCache.put(description, newPackFile); |
There was a problem hiding this comment.
does it matter if multiple threads concurrently load the same description -> pack file?
There was a problem hiding this comment.
yea.. they may get different results if the packKey is different :/ let me fix it
There was a problem hiding this comment.
Don't think this is fixed yet. Consider this sequence of line executions
Thread A: 76
Thread B: 76
Thread A: 77, 81, 82, 83, 76, 77, 78
Thread B: 77, 81, 82, 83, 76, 77, 78
Can we use get(key, mappingFunction) instead? https://github.com/ben-manes/caffeine/blob/master/caffeine/src/main/java/com/github/benmanes/caffeine/cache/Cache.java#L82
8b1132e to
2f2ed77
Compare
| // weight is static after creation and update, so here we are relying on dfsBlockCache's removal | ||
| // listen to make sure the retained size of packFile won't exceed the given memory | ||
| long estimatedSize = 2048 + blockSize; | ||
| return estimatedSize > Integer.MAX_VALUE ? Integer.MAX_VALUE : (int) estimatedSize; |
There was a problem hiding this comment.
If we need to return an int here, let's make it so blockSize has to be <= Integer.MAX_VALUE/2?
| .build(); | ||
|
|
||
| dfsBlockCache = Caffeine.newBuilder() | ||
| .removalListener((DfsPackKeyWithPosition keyWithPosition, Ref ref, RemovalCause cause) -> ref = null) |
There was a problem hiding this comment.
what references are we trying to free here? Don't think ref = null does anything if it's already being removed from the cache, right?
| .removalListener((DfsPackDescription description, DfsPackFile packFile, RemovalCause cause) -> { | ||
| if (packFile != null) { | ||
| log.debug("PackFile {} is removed because it {}", packFile.getPackName(), cause); | ||
| packFile.key.cachedSize.set(0); |
There was a problem hiding this comment.
Is cachedSize used after the packFile is removed? It feels weird that we're setting it here but not using it in the weigher.
There was a problem hiding this comment.
updated with one cache + 2 maps method
| dfsBlockCache = Caffeine.newBuilder() | ||
| .removalListener((DfsPackKeyWithPosition keyWithPosition, Ref ref, RemovalCause cause) -> ref = null) | ||
| .maximumWeight(cacheConfig.getCacheMaximumSize() / 2) | ||
| .weigher((DfsPackKeyWithPosition keyWithPosition, Ref ref) -> ref == null? 48 : 48 + ref.getSize()) |
There was a problem hiding this comment.
With this and the higher line, 48 and 2048 look like magical constants. Can we make this clearer?
There was a problem hiding this comment.
updated with comments
|
|
||
| dfsBlockCache = Caffeine.newBuilder() | ||
| .removalListener((DfsPackKeyWithPosition keyWithPosition, Ref ref, RemovalCause cause) -> ref = null) | ||
| .maximumWeight(cacheConfig.getCacheMaximumSize() / 2) |
There was a problem hiding this comment.
Are the two caches going to be the same size? Seems weird to 50/50 split
|
|
||
| private static final class DfsPackKeyWithPosition { | ||
| private DfsPackKey dfsPackKey; | ||
| private long position; |
There was a problem hiding this comment.
these should be final
(especially if used in hashCode/equals)
| Ref<DfsBlock> loadedBlockRef = dfsBlockCache.get(new DfsPackKeyWithPosition(key, position), keyWithPosition -> { | ||
| try { | ||
| DfsBlock loadedBlock = pack.readOneBlock(keyWithPosition.getPosition(), dfsReader); | ||
| key.cachedSize.getAndAdd(loadedBlock.size()); |
There was a problem hiding this comment.
Does the weigher get called here? If not is there any way to trigger it?
| @@ -0,0 +1,206 @@ | |||
| package org.eclipse.jgit.internal.storage.dfs; | |||
|
|
|||
| import com.github.benmanes.caffeine.cache.*; | |||
| } | ||
|
|
||
| DfsPackFile newPackFile = new DfsPackFile(this, description, key != null ? key : new DfsPackKey()); | ||
| packFileCache.put(description, newPackFile); |
There was a problem hiding this comment.
Don't think this is fixed yet. Consider this sequence of line executions
Thread A: 76
Thread B: 76
Thread A: 77, 81, 82, 83, 76, 77, 78
Thread B: 77, 81, 82, 83, 76, 77, 78
Can we use get(key, mappingFunction) instead? https://github.com/ben-manes/caffeine/blob/master/caffeine/src/main/java/com/github/benmanes/caffeine/cache/Cache.java#L82
|
Discussed with @jhoch-palantir offline. |
8d0ef5a to
e642e73
Compare
DO NOT MERGE Note: This should live in Stemma, put it in jgit for now to make it easier to review.
| blockSize = cacheConfig.getBlockSize(); | ||
|
|
||
| packFileCache = new ConcurrentHashMap<>(16, 0.75f, 1); | ||
| reversePackDescriptionIndex = new ConcurrentHashMap<>(16, 0.75f, 1); |
There was a problem hiding this comment.
Think we can use the default constructor
| // key, value reference 8 * 2 bytes | ||
| .weigher((DfsPackKeyWithPosition keyWithPosition, Ref ref) -> ref == null? 60 : 60 + ref.getSize()) | ||
| .recordStats() | ||
| .build(); |
There was a problem hiding this comment.
Let's move the 60 into a constant and move the documentation there.
I really like the formatting on this comment...
There was a problem hiding this comment.
I think we should make some effort to account for or at least document the memory taken up by the other two maps, e.g. this cache will take ~<= cacheConfig.getCacheMaximumSize() + X MB
There was a problem hiding this comment.
(even if X is in terms of the number of pack files, that's valuable. Then later we can reason about things in terms of pack files, not bytes)
| DfsPackKey key = keyWithPosition.getDfsPackKey(); | ||
| long position = keyWithPosition.getPosition(); | ||
|
|
||
| if (position < 0) { |
| private final Map<DfsPackDescription, DfsPackFile> packFileCache; | ||
|
|
||
| /** Reverse index from DfsPackKey to the DfsPackDescription. */ | ||
| private final Map<DfsPackKey, DfsPackDescription> reversePackDescriptionIndex; |
There was a problem hiding this comment.
this is 1-1? do we need any invariant checks for this?
| private final int blockSize; | ||
|
|
||
| /** Cache of pack files, indexed by description. */ | ||
| private final Map<DfsPackDescription, DfsPackFile> packFileCache; |
| return blockSize; | ||
| } | ||
|
|
||
| // do something when the block is invalid |
There was a problem hiding this comment.
out of date. deleting
| key.cachedSize.set(0); | ||
| } | ||
| // TODO: release all the blocks cached for this pack file too | ||
| // right now those refs are not accessible anymore and will be evicted by caffeine cache eventually |
There was a problem hiding this comment.
Did you mean to implement this?
There was a problem hiding this comment.
don't see this causing a big problem.... but nice to have it as improvement soon
| return length <= maxStreamThroughCache; | ||
| } | ||
|
|
||
| DfsPackFile getOrCreate(DfsPackDescription description, DfsPackKey key) { |
There was a problem hiding this comment.
I feel like there's an edge case where I could just call getOrCreate(...) a bunch of times and never actually load anything into the cache, and the two maps would never get cleared out
There was a problem hiding this comment.
These entries should be tiny (<1k) per entry, and if we clear the whole cache object periodically it shouldn't be a problem. But we should still add that as a TODO as least...
| Ref<DfsBlock> loadedBlockRef = dfsBlockAndIndicesCache.get(new DfsPackKeyWithPosition(key, position), keyWithPosition -> { | ||
| try { | ||
| DfsBlock loadedBlock = pack.readOneBlock(keyWithPosition.getPosition(), dfsReader); | ||
| key.cachedSize.getAndAdd(loadedBlock.size()); |
There was a problem hiding this comment.
do we need to "update" the cache here because the weight has changed?
There was a problem hiding this comment.
The cache entry won't get "reload" I believe. here cachedSize is used to keep track whether all the loaded blocks have been evicted.
There was a problem hiding this comment.
Let's make sure that this use of "cachedSize" is documented. I think the caching/memory strategy being laid out in a top-level class comment would be sensible. You + I have chatted offline about stuff a bunch and it would be good to make sure that's not lost :D
| if (keyWithPosition.position >= 0) { | ||
| keyWithPosition.getDfsPackKey().cachedSize.getAndAdd(size); | ||
| } | ||
| return new Ref(keyWithPosition.getDfsPackKey(), keyWithPosition.getPosition(), size, value); |
There was a problem hiding this comment.
Do we have guarantees that this method is only called if this wasn't present in the map? I'm worried about something getting put in the map twice and cachedSize getting incremented twice (same question with getOrLoad above)
There was a problem hiding this comment.
yes this is guaranteed.
(and even if it's not, when the old value gets removed, it's treated as being evicted and the cachedSize will be decreased in the removalListener)
| if (pack != null) { | ||
| DfsPackKey key = pack.key; | ||
| cleanUpIndicesIfExists(key); | ||
| key.cachedSize.set(0); |
DO NOT MERGE
Note: This should live in the project that wants to pass this implementation in, for now to make it
easier to review.