Skip to content
This repository was archived by the owner on Jul 13, 2018. It is now read-only.

Implement DfsBlockCache with Caffeine Cache#2

Open
jiahuijiang wants to merge 5 commits intojj/new-cachefrom
jj/caffeine-implementation
Open

Implement DfsBlockCache with Caffeine Cache#2
jiahuijiang wants to merge 5 commits intojj/new-cachefrom
jj/caffeine-implementation

Conversation

@jiahuijiang
Copy link
Owner

@jiahuijiang jiahuijiang commented Feb 17, 2017

DO NOT MERGE

Note: This should live in the project that wants to pass this implementation in, for now to make it
easier to review.

  • Used Caffeine version as default and passed unit tests

Copy link

@jhoch-palantir jhoch-palantir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you walk me through the code tomorrow? Not super easy. Also want to brainstorm a bit about size tracking.


packFileCache = Caffeine.newBuilder()
.removalListener((DfsPackDescription description, DfsPackFile packFile, RemovalCause cause) ->
packFile.close())

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line seems off to me

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

technically the key and value are @Nullable, though we're not using soft keys/values, but would prefer to be null-safe. may also want to consider logging to track removal with cause & pack file.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually packFile.close is not needed, will remove it

dfsBlockCache.invalidateAll();
}

private static final class DfsPackKeyWithPosition {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your key doesn't have an equals or hashCode, which could cause problems if you try to lookup the a value with different (but similar) key instances.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm I think two objects of this class with never equal to each other? since pack key has an AtomicLong field.
@ben-manes What do you mean by similar key instance?

Copy link

@ben-manes ben-manes Feb 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A hash map uses the key's equals and hashCode to locate and store an entry. If two equivalent but not equal keys are used, the map will consider them pointing to distinct entries. A cache is built on a hash map.

DfsPackKeyWithPosition key1 = new DfsPackKeyWithPosition(packKey, 100);
DfsPackKeyWithPosition key2 = new DfsPackKeyWithPosition(packKey, 100);
assert key1.equals(key2)
assert key1.hashCode() == key2.hashCode()

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh thanks for clarification!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fyi, DfsPackKey doesn't implement equals/hashCode either.

Copy link
Owner Author

@jiahuijiang jiahuijiang Feb 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think it's because of the AtomicLong it contains. So here it has to be the same DfsPackKey object


void cleanUp() {
packFileCache.invalidateAll();
dfsBlockCache.invalidateAll();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might want to invoke cleanup on each of these caches as well to free up resources immediately rather than on later cache accesses? See https://github.com/ben-manes/caffeine/wiki/Cleanup for context

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ohh! good to know! updated

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

invalidateAll does a clean-up, since there's nothing remaining in the cache.

packFile.close())
.maximumSize(cacheEntrySize)
.expireAfterAccess(cacheConfig.getPackFileExpireSeconds(), TimeUnit.SECONDS)
.recordStats()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we're recording stats, probably want to expose the recorded stats via an accessor method

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think we want to use tritium to track it but it's not open sourced yet? Will add a TODO to add it when we move it internally.

dfsBlockCache = Caffeine.newBuilder()
.maximumSize(cacheEntrySize)
.expireAfterAccess(cacheConfig.getPackFileExpireSeconds(), TimeUnit.SECONDS)
.recordStats()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we're recording stats, probably want to expose the recorded stats via an accessor method


packFileCache = Caffeine.newBuilder()
.removalListener((DfsPackDescription description, DfsPackFile packFile, RemovalCause cause) ->
packFile.close())

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

technically the key and value are @Nullable, though we're not using soft keys/values, but would prefer to be null-safe. may also want to consider logging to track removal with cause & pack file.

* <p>
* The value for blockSize must be a power of 2.
*/
private final int blockSize;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably want to either check that blockSize is power of 2 or bump it up to next largest power of 2.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a check in the config. Will switch to Immutable + checkArgument when we move the code >_<

private final long maxStreamThroughCache;

/**
* Suggested block size to read from pack files in.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

block size in bytes?

}

DfsPackFile newPackFile = new DfsPackFile(this, description, key != null ? key : new DfsPackKey());
packFileCache.put(description, newPackFile);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it matter if multiple threads concurrently load the same description -> pack file?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea.. they may get different results if the packKey is different :/ let me fix it

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think this is fixed yet. Consider this sequence of line executions

Thread A: 76
Thread B: 76
Thread A: 77, 81, 82, 83, 76, 77, 78
Thread B: 77, 81, 82, 83, 76, 77, 78

Can we use get(key, mappingFunction) instead? https://github.com/ben-manes/caffeine/blob/master/caffeine/src/main/java/com/github/benmanes/caffeine/cache/Cache.java#L82

@jiahuijiang jiahuijiang force-pushed the jj/caffeine-implementation branch from 8b1132e to 2f2ed77 Compare February 21, 2017 02:40
// weight is static after creation and update, so here we are relying on dfsBlockCache's removal
// listen to make sure the retained size of packFile won't exceed the given memory
long estimatedSize = 2048 + blockSize;
return estimatedSize > Integer.MAX_VALUE ? Integer.MAX_VALUE : (int) estimatedSize;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we need to return an int here, let's make it so blockSize has to be <= Integer.MAX_VALUE/2?

.build();

dfsBlockCache = Caffeine.newBuilder()
.removalListener((DfsPackKeyWithPosition keyWithPosition, Ref ref, RemovalCause cause) -> ref = null)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what references are we trying to free here? Don't think ref = null does anything if it's already being removed from the cache, right?

.removalListener((DfsPackDescription description, DfsPackFile packFile, RemovalCause cause) -> {
if (packFile != null) {
log.debug("PackFile {} is removed because it {}", packFile.getPackName(), cause);
packFile.key.cachedSize.set(0);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is cachedSize used after the packFile is removed? It feels weird that we're setting it here but not using it in the weigher.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated with one cache + 2 maps method

dfsBlockCache = Caffeine.newBuilder()
.removalListener((DfsPackKeyWithPosition keyWithPosition, Ref ref, RemovalCause cause) -> ref = null)
.maximumWeight(cacheConfig.getCacheMaximumSize() / 2)
.weigher((DfsPackKeyWithPosition keyWithPosition, Ref ref) -> ref == null? 48 : 48 + ref.getSize())

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this and the higher line, 48 and 2048 look like magical constants. Can we make this clearer?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated with comments


dfsBlockCache = Caffeine.newBuilder()
.removalListener((DfsPackKeyWithPosition keyWithPosition, Ref ref, RemovalCause cause) -> ref = null)
.maximumWeight(cacheConfig.getCacheMaximumSize() / 2)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the two caches going to be the same size? Seems weird to 50/50 split


private static final class DfsPackKeyWithPosition {
private DfsPackKey dfsPackKey;
private long position;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these should be final

(especially if used in hashCode/equals)

Ref<DfsBlock> loadedBlockRef = dfsBlockCache.get(new DfsPackKeyWithPosition(key, position), keyWithPosition -> {
try {
DfsBlock loadedBlock = pack.readOneBlock(keyWithPosition.getPosition(), dfsReader);
key.cachedSize.getAndAdd(loadedBlock.size());

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the weigher get called here? If not is there any way to trigger it?

@@ -0,0 +1,206 @@
package org.eclipse.jgit.internal.storage.dfs;

import com.github.benmanes.caffeine.cache.*;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we import explicitly?

}

DfsPackFile newPackFile = new DfsPackFile(this, description, key != null ? key : new DfsPackKey());
packFileCache.put(description, newPackFile);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think this is fixed yet. Consider this sequence of line executions

Thread A: 76
Thread B: 76
Thread A: 77, 81, 82, 83, 76, 77, 78
Thread B: 77, 81, 82, 83, 76, 77, 78

Can we use get(key, mappingFunction) instead? https://github.com/ben-manes/caffeine/blob/master/caffeine/src/main/java/com/github/benmanes/caffeine/cache/Cache.java#L82

@jiahuijiang
Copy link
Owner Author

Discussed with @jhoch-palantir offline.
For better size estimation, we are only using Caffeine cache for the key+position -> Ref cache. The weight will be updated every time a ref is inserted/ removed/ updated.
When a ref is removed, we will use a reverse index to find the packFile that this ref belongs to. If the ref points to an index, the whole packFile will be removed. If it's pointing to a file block, the cachedSize will decrease. When the cachedSize becomes zero, the packFile will be removed from our index maps.

@jiahuijiang jiahuijiang force-pushed the jj/caffeine-implementation branch 4 times, most recently from 8d0ef5a to e642e73 Compare February 22, 2017 19:53
blockSize = cacheConfig.getBlockSize();

packFileCache = new ConcurrentHashMap<>(16, 0.75f, 1);
reversePackDescriptionIndex = new ConcurrentHashMap<>(16, 0.75f, 1);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think we can use the default constructor

// key, value reference 8 * 2 bytes
.weigher((DfsPackKeyWithPosition keyWithPosition, Ref ref) -> ref == null? 60 : 60 + ref.getSize())
.recordStats()
.build();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move the 60 into a constant and move the documentation there.

I really like the formatting on this comment...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should make some effort to account for or at least document the memory taken up by the other two maps, e.g. this cache will take ~<= cacheConfig.getCacheMaximumSize() + X MB

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(even if X is in terms of the number of pack files, that's valuable. Then later we can reason about things in terms of pack files, not bytes)

DfsPackKey key = keyWithPosition.getDfsPackKey();
long position = keyWithPosition.getPosition();

if (position < 0) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how can this be less than 0?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's for indices

private final Map<DfsPackDescription, DfsPackFile> packFileCache;

/** Reverse index from DfsPackKey to the DfsPackDescription. */
private final Map<DfsPackKey, DfsPackDescription> reversePackDescriptionIndex;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is 1-1? do we need any invariant checks for this?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep this is 1-1

private final int blockSize;

/** Cache of pack files, indexed by description. */
private final Map<DfsPackDescription, DfsPackFile> packFileCache;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

29,30s/Cache/Map

return blockSize;
}

// do something when the block is invalid

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this outstanding?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out of date. deleting

key.cachedSize.set(0);
}
// TODO: release all the blocks cached for this pack file too
// right now those refs are not accessible anymore and will be evicted by caffeine cache eventually

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean to implement this?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't see this causing a big problem.... but nice to have it as improvement soon

return length <= maxStreamThroughCache;
}

DfsPackFile getOrCreate(DfsPackDescription description, DfsPackKey key) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like there's an edge case where I could just call getOrCreate(...) a bunch of times and never actually load anything into the cache, and the two maps would never get cleared out

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These entries should be tiny (<1k) per entry, and if we clear the whole cache object periodically it shouldn't be a problem. But we should still add that as a TODO as least...

Ref<DfsBlock> loadedBlockRef = dfsBlockAndIndicesCache.get(new DfsPackKeyWithPosition(key, position), keyWithPosition -> {
try {
DfsBlock loadedBlock = pack.readOneBlock(keyWithPosition.getPosition(), dfsReader);
key.cachedSize.getAndAdd(loadedBlock.size());

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to "update" the cache here because the weight has changed?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cache entry won't get "reload" I believe. here cachedSize is used to keep track whether all the loaded blocks have been evicted.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make sure that this use of "cachedSize" is documented. I think the caching/memory strategy being laid out in a top-level class comment would be sensible. You + I have chatted offline about stuff a bunch and it would be good to make sure that's not lost :D

if (keyWithPosition.position >= 0) {
keyWithPosition.getDfsPackKey().cachedSize.getAndAdd(size);
}
return new Ref(keyWithPosition.getDfsPackKey(), keyWithPosition.getPosition(), size, value);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have guarantees that this method is only called if this wasn't present in the map? I'm worried about something getting put in the map twice and cachedSize getting incremented twice (same question with getOrLoad above)

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes this is guaranteed.
(and even if it's not, when the old value gets removed, it's treated as being evicted and the cachedSize will be decreased in the removalListener)

if (pack != null) {
DfsPackKey key = pack.key;
cleanUpIndicesIfExists(key);
key.cachedSize.set(0);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switch 169 and 170?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants