L2 Tagging hashset not always present. #552

jrlost · 2025-11-06T17:01:24Z

jrlost
Nov 6, 2025

A little background, we've been slowly transitioning our apps over towards using fusioncache and recently a few of them landed on production. One of the apps is a large multi-tenant solution where we rely on tagging for our tenant level cache invalidation. With this release, we've noticed an increase in calls to the database for things that appear to be in L2 and not stale; things I'd expect to have been used.

While digging into the issue, I spotted something strange, the __fc:t:TAGNAME object was not in Redis (what we're using for L2).
After triggering a RemoveByTag call on the tag, it then started showing (and, as expected, has stuck around). We are primarily using GetOrSetAsync when capturing these, and in each place we're actively passing in the tags.

Example weird state:
HGET "sf-v2:TENANTIDENTIFIER:siteredirects" data
"{\"Value\":[],\"Timestamp\":638980440162818275,\"LogicalExpirationTimestamp\":638980442562818275,\"Tags\":[\"TENANTIDENTIFIER\"],\"Metadata\":{\"IsStale\":false,\"EagerExpirationTimestamp\":638980441122818275,\"Size\":1}}"

HGET "sf-v2:__fc:t:TENANTIDENTIFIER" data
(nil)

We're running this application in multiple nodes and relying on the backplane to keep things in sync. I suspect that if this tag hashset is missing for L2 that the app instances will start to deviate and it's possible each instance could have a different source of truth with regards to the tag expiration.

Originally I had suspected this was an issue with our redis instance. We use volitile_lru for evictions, but when evaluating our cluster, INFO shows there had been zero evictions, suggesting this object just never made it to Redis at all.

Another thought I had was that perhaps there was an exception or timeout on write to L2, but logs (currently have log level set to WARNING) show no errors related to fusioncache.

I brought this up here instead of as an issue primarily because I have no means to replicate it so at this point, my evidence is just the handful of instances like the example above. It's all anecdotal at this time, but something has resulted in L2 losing (or never getting) the tag hashset. I will keep looking, but wanted to pick brains for other things that could have resulted in this state.

one more data point, this app is running in >12 instances and doing >1 million caching operations/sec, so it's definitely not outside the realm of possibility that it's a contention issue, memory issue, heap issue; we've seen other weird behaviors happen at this scale in the past.

jodydonetti · 2025-11-07T01:05:29Z

jodydonetti
Nov 7, 2025
Maintainer

Hi @jrlost , hashsets are not used at all in FusionCache.

For more on the particular design, take a look at the related docs.

This design basically scales to infinity so even if you have, say, 1M entries all tagged with a certain tag:

there will not be a hashset with 1M entries
when calling RemoveByTag() it will not need to massively delete 1M entries and the observable result will be instantaneous

Hope this helps, let me know.

4 replies

jrlost Nov 7, 2025
Author

Hi @jrlost , hashsets are not used at all in FusionCache.

For more on the particular design, take a look at the related docs.

This design basically scales to infinity so even if you have, say, 1M entries all tagged with a certain tag:

there will not be a hashset with 1M entries

when calling RemoveByTag() it will not need to massively delete 1M entries and the observable result will be instantaneous

Hope this helps, let me know.

I think you misinterpreted what I said. The hashsets I am referring to are Redis hashes (note the HGET above), which we're using for L2.

We ARE using FusionCache currently.
We are using L1 and L2 with Tagging.
We are using GetOrSetAsync for all of our caching.
We are seeing cases where stuff is tagged BUT the tagging object/entry __fc:t:TAGNAME is not present in L2. It's possible this is impacting ALL cached objects, but I'm only seeing it because these tag objects have such a large default TTL.
We are seeing ZERO evictions in our Redis cluster; meaning Redis didn't remove this with it's eviction policies (which are set to volitile-lru).
The million operations/sec are against L2. The per-instance L1 operations are orders of magnitude higher. I was offering this up as a data point because 1. thread limitations do exist 2. With Redis, even when using StackExchangeRedis, there's only so much the multiplexer can do and historically we've seen this very same library(StackExchangeRedis) queue work. I'm offering all of this as a data point; this is to say that perhaps there's a case where the L2 propagation isn't happening, is timing out (but not logging), is in an open circuit breaker, whatever....

I brought this up as a discussion in hopes of rubber ducking or bouncing ideas off of you (or others) to try to figure out

Why this might be happening.
What the impacts of it might be.

I can certainly scour the FusionCache L2 persistence code, I've dug through much of the tag related code, but didn't find anything that would indicate why this isn't making it to Redis; but I figured I'd drop this here in hopes that someone who is more intimately familiar with the code may have ideas of their own.

jodydonetti Nov 13, 2025
Maintainer

Hi @jrlost

I think you misinterpreted what I said. The hashsets I am referring to are Redis hashes (note the HGET above), which we're using for L2.

Ok, question: how are you using hashes on Redis? Are you work with them directly (for example via StackExchange.Redis) or in some way via FusionCache directly?

I'm trying to better understand the situation.

We ARE using FusionCache currently.

We are using L1 and L2 with Tagging.

We are using GetOrSetAsync for all of our caching.

Related to the question above: only GetOrSetAsync or also?

We are seeing cases where stuff is tagged BUT the tagging object/entry __fc:t:TAGNAME is not present in L2.

Yes, this is correct: when you save an entry with a tag, nothing is created for that tag, only the tagged entry itself.

The entry for the tag itself is created only when calling RemoveByTag("my-tag"), because the entry will store the timestamp of when the operation happened.

From there on, any "get" operation (including the "get" part of a GetOrSet call) will do a sort of high-pass filter, checking the potential timestamp of each tag.

This is why I asked if you read the Tagging docs that explains the design, it's pretty peculiar.

Maybe you were instead expecting that, when saving an entry tagged with the tag "my-tag", a corresponding entry would be created in Redis to keep track of all the entries tagged with that tag, but this is not the case with FusionCache.

The approach of having an entry for the tag which includes the list of cache keys tagged with that tag is a relatively common approach (I think the MS implementation of output caching for Redis is using it), but the problem is that this approach does not scale well (imho).

Mainly for 2 reasons:

a single entry (for the tag itself) may end up containing a ton of "references" to other entries (meaning: their cache keys). I worked on systems with a single distributed cache instance containing 5M+ entries, of which 500K+ were tagged with a certain tag: imagine having one entry containing a list of 500K cache keys
when calling RemoveByTag("my-tag") this approach would then read the entry for the tag (to get the list of cache keys tagged with it), and then foreach over them and execute a Remove(key) or something like that: imagine doing this for 500K entries, it would destroy your Redis instance and it would take forever. Also, there would probably be timing issues, because from T0 (call to RemoveByTag("my-tag")) to T1 (when it finishes removing the entries) some of the entries may have been very well updated already, leading to removal of fresh new data

Also, to avoid conflict with updating a "normal" entry (of type Redis STRING), this approach usually requires using a Redis SET, and that is not supported via the standard IDistributedCache abstraction, meaning we would lose the ability to use any IDistributedCache implementation for the L2.

Because of these (and other) reasons, I came up with a totally different approach/design for FusionCache, one that basically requires a single O(1) operation when calling RemoveByTag("my-tag"), and spread the remaining work in space and time, adaptively based on concrete usage of the related entries.

I also employed a bunch of other optimizations so that the work needed is as little (and fast) as possible.

It's possible this is impacting ALL cached objects, but I'm only seeing it because these tag objects have such a large default TTL.

Yes they use a large TTL because they are used to keep track of RemoveByTag() operations, which is useful when (later on) accessing data in the cache to check if that data is still valid or not (the "high-pass filter" part I mentioned above).

So they need to stay in the cache for a longer time, to be sure a future read of old data is not served, if its timestamp is lower than the RemoveByTag() operation.

The million operations/sec are against L2. The per-instance L1 operations are orders of magnitude higher.

Wow, that's brutal (in a positive way 🙂).

If you are up for a quick chat I'd like to know more about it (of course without violating privacy, NDAs, etc).

Let me know.

I was offering this up as a data point because 1. thread limitations do exist 2. With Redis, even when using StackExchangeRedis, there's only so much the multiplexer can do and historically we've seen this very same library(StackExchangeRedis) queue work.

Yup, been there, I know the pain (sometimes).

I'm offering all of this as a data point; this is to say that perhaps there's a case where the L2 propagation isn't happening, is timing out (but not logging), is in an open circuit breaker, whatever....

That would be strange, because (if the right log level is enabled) everything is logged.

I brought this up as a discussion in hopes of rubber ducking or bouncing ideas off of you (or others) to try to figure out

Why this might be happening.

What the impacts of it might be.

You did right, and thanks for sharing!

Again, if you like I'd have a chat to better figure things out.

I can certainly scour the FusionCache L2 persistence code, I've dug through much of the tag related code, but didn't find anything that would indicate why this isn't making it to Redis; but I figured I'd drop this here in hopes that someone who is more intimately familiar with the code may have ideas of their own.

Well I wrote all the FusionCache code, so probably I'm the right person to try to figure it out 😀

Thanks, and let me know about the chat.

jrlost Nov 13, 2025
Author

Wow, didn't have "Jody writing me a book" on my bingo card for the day. 💯

I'll start by adding a little more context, then I'll dive into some of your questions.

When we transitioned over from our old caching solution (I'll cover that a bit more later) to FusionCache, there was a lot of scrutiny placed on the application performance when we deployed this code into production. As mentioned earlier, this is a massive multi-tenant application servicing ~50-60k tenants so we take changes to performance seriously. When we deployed to production, the first thing we started to notice was certain things that were behind the factory being passed into GetOrSetAsync were occasionally getting hit when we'd expect the object to be in both L1 and L2. So this started me down the road of trying to figure out why some objects are not using cache.

At the time of deployment, we had the OTEL activities getting captured in our Elastic APM cluster and could see the attempt to pull the object from L1 with the expected cache key, only for it to not capture the object and instead jump into the factory. The object in question was a frequently used object in the request pipeline, so when the object was requested later on in a future request (milliseconds to seconds later) and it still jumped into the factory, we started digging even deeper. The first things we noticed was the missing tag entry in L2 and given what you've suggested, it sounds like this was a red herring. What's curious is, as soon as we did call RemoveByTag, suddenly all of the objects for this tenant began getting loaded into L2 and the app stopped jumping into the factory on each GetOrSetAsync call.
Note, I keep saying this object/the object, it was nearly all objects for the tenant showing this behavior; the frequently used object referenced though is equivalent to the tenant "configuration". Also, given this is a large scale application, we don't run with info/debug logs on, so all I can say is that at the time, we had no related exceptions coming from the FusionCache logger.

I apologize for the word salad above; I just thought it might help to put a bit more perspective around this. On to trying to answer some of your questions.

Ok, question: how are you using hashes on Redis? Are you work with them directly (for example via StackExchange.Redis) or in some way via FusionCache directly?

With the update to FusionCache, we're only using FusionCache to read from/write to Redis via the RedisCache distributed cache provider w/ backplane; we are not creating these hashes ourselves nor are we managing Redis objects outside of FusionCache. We do query Redis periodically when diagnosing issues, like the HGETs mentioned above.

Related to the question above: only GetOrSetAsync or also?

We are using GetOrSetAsync in all but one case, that one case is using GetOrDefaultAsync with a separate SetAsync. I will add that, when we were digging into the missing objects in L2, the only objects that were present for these tenants at the time were the ones created via this GetOrDefaultAsync/SetAsync pair.

Yes, this is correct: when you save an entry with a tag, nothing is created for that tag, only the tagged entry itself.

The entry for the tag itself is created only when calling RemoveByTag("my-tag"), because the entry will store the timestamp of when the operation happened.

From there on, any "get" operation (including the "get" part of a GetOrSet call) will do a sort of high-pass filter, checking the potential timestamp of each tag.

This is why I asked if you read the Tagging docs that explains the design, it's pretty peculiar.

Maybe you were instead expecting that, when saving an entry tagged with the tag "my-tag", a corresponding entry would be created in Redis to keep track of all the entries tagged with that tag, but this is not the case with FusionCache.

I've read the documentation several times and can say that I must have missed:

The entry for the tag itself is created only when calling RemoveByTag("my-tag"), because the entry will store the timestamp of when the operation happened.

Regardless, I can honestly say that my perspective may have been skewed by our previous cache implementation.
We've struggled with cache invalidation at scale for many years and the solution we came up with many many years ago was very similar to your approach. Effectively, we had maintained a hash for each tenant where stored a guid. We'd store this hashset (guid for each tenant) in both L1 and L2 and we would use the guid as part of our cache key. When invalidating the tenant, we'd simply change the hashed guid, now all of the previously cached objects were invalid and would fall out of cache organically. Obviously, when we saw the FusionCache approach to caching with support for things like fallbacks, eager refreshes, etc... we jumped at it; because our implementation suffered one obvious failure, when the cache was invalidated it had to be rebuilt on request.
So.. when I saw this tagging solution, with the "high-pass filter" approach, I was picturing that the initial cache entry check/filter would be comparing against the persisted tag timestamp.

Maybe you were instead expecting that, when saving an entry tagged with the tag "my-tag", a corresponding entry would be created in Redis to keep track of all the entries tagged with that tag, but this is not the case with FusionCache.

Nah, I understood that there was a single object per-tag controlling the state; just was unaware that it didn't exist until a RemoveByTag was triggered.
And to be fair, I did do some diving into the FusionCache code, which landed me here:
https://github.com/ZiggyCreatures/FusionCache/blob/65f545d6ca5642b3f3da136fee666657d8294b09/src/ZiggyCreatures.FusionCache/FusionCache_Async.cs#L913C97-L913C132

Which then landed me here:

FusionCache/src/ZiggyCreatures.FusionCache/Internals/FusionCacheInternalUtils.cs

Line 548 in 65f545d

ctx.Options.SkipDistributedCacheWrite = true;

Which, now that you stated that "it not showing until after the RemoveByTag is called", makes complete sense. It doesn't explain why the other objects never made it into cache, but at least explains that nuance.

That would be strange, because (if the right log level is enabled) everything is logged.

Yeah, sadly with this volume, turning on info or debug logging isn't much of an option; we do have the ability to trace a percentage of transactions but had to disable the OTEL bridge due to a bug in the Elastic APM agent where it was effectively collecting all of the Activities FusionCache was publishing; it was ignoring all capture rates (we collect ~0.5% of transactions and with this bug, it was capturing 100% of the Activities as their own transactions, which got a little out of hand and managed to take our APM cluster offline).

Anyways, I'll keep digging around; I will say that once the tag object existed we now no longer see the factories being hit like they had been. So, the issue I had been seeing could now be gone. When I get some time in the next week or so, I'll see if I can get a minimal repro written; it's possible that how we're using tagging or our configuration is responsible for this.

On a positive note, since moving to FusionCache, our latencies, in the 99th percentile, have been static and our throughput has increased by nearly 40%.

Thank you for taking the time to read all of this and taking the time to reply. I do really appreciate it.

EDIT: Updated the symptoms to better clarify the original problem.

jodydonetti Nov 16, 2025
Maintainer

Wow, didn't have "Jody writing me a book" on my bingo card for the day. 💯

Ahah, fair 🤣

When we transitioned over from our old caching solution (I'll cover that a bit more later) to FusionCache, there was a lot of scrutiny placed on the application performance when we deployed this code into production.

As it should be.

As mentioned earlier, this is a massive multi-tenant application servicing ~50-60k tenants
[...]
At the time of deployment, we had the OTEL activities getting captured in our Elastic APM cluster and could see the attempt to pull the object from L1 with the expected cache key, only for it to not capture the object and instead jump into the factory. The object in question was a frequently used object in the request pipeline, so when the object was requested later on in a future request (milliseconds to seconds later) and it still jumped into the factory, we started digging even deeper.

I don't know the entry options used, but it feels like something related to them: what follows is a list of possible examples.

If the object is not used from L1 even though is there, I can think from the top of my head about a couple of potential reasons:

you are using Auto-Clone (see here), and there are deserialization issues such that it is not able to deserialize the object in L1 correctly (you should see something in the logs, but maybe OTEL sampling is hiding that)
you are using Fail-Safe with a duration (maybe tweaked on-the-fly via Adaptive Caching) that is too low: this will make it so that the entry is in fact physically there but it's already logically expired

If instead the issue is with L2, maybe somewhere you are setting SkipDistributedCacheWrite or SkipDistributedCacheRead?

Or maybe you are using distributed cache timeouts (see here) with a very low value? If, for example, reading from L2 takes normally 10ms and you set the timeout to even just 9ms, this will lead to the cache basically not being used on read because you are not giving it enough time to do it.
Contrived and simplistic example, I know, but to drive the point home.

I apologize for the word salad above; I just thought it might help to put a bit more perspective around this. On to trying to answer some of your questions.

Totally the opposite, it has been very helpful in better understanding the scenario.

We are using GetOrSetAsync in all but one case, that one case is using GetOrDefaultAsync with a separate SetAsync.

Just out of curiosity: is this a case where it is not possible to use GetOrSetAsync insted? I'm asking because when using 2 separate calls you are basically not giving the cache a chance to coordinate the separate calls (both to/from the cache + the factory), meaning no stampede protection there. Just so you're aware of that.

I will add that, when we were digging into the missing objects in L2, the only objects that were present for these tenants at the time were the ones created via this GetOrDefaultAsync/SetAsync pair.

Mmmh, very very interesting: as stated above, maybe in the GetOrSet call you are tweaking the entry options on the fly viaAdaptive Caching, and something is going on there?

I've read the documentation several times and can say that I must have missed:

The entry for the tag itself is created only when calling RemoveByTag("my-tag"), because the entry will store the timestamp of when the operation happened.

I'll try to read it again later, and in case I'll clarify this better, thanks for pointing this out.

Which then landed me here:

FusionCache/src/ZiggyCreatures.FusionCache/Internals/FusionCacheInternalUtils.cs

Line 548 in 65f545d

ctx.Options.SkipDistributedCacheWrite = true;

Which, now that you stated that "it not showing until after the RemoveByTag is called", makes complete sense.

Yup, it's an optimization I made to avoid having the cache full of entries with "zero" everywhere, which seemed like a waste and would create confusion when observed by a user.

It doesn't explain why the other objects never made it into cache, but at least explains that nuance.

Correct, I'm still baffled by it 🤔

Yeah, sadly with this volume, turning on info or debug logging isn't much of an option; we do have the ability to trace a percentage of transactions but had to disable the OTEL bridge due to a bug in the Elastic APM agent where it was effectively collecting all of the Activities FusionCache was publishing; it was ignoring all capture rates (we collect ~0.5% of transactions and with this bug, it was capturing 100% of the Activities as their own transactions, which got a little out of hand and managed to take our APM cluster offline).

Ouch. Been there too with something different but similar, not fun.

Anyways, I'll keep digging around; I will say that once the tag object existed we now no longer see the factories being hit like they had been. So, the issue I had been seeing could now be gone. When I get some time in the next week or so, I'll see if I can get a minimal repro written; it's possible that how we're using tagging or our configuration is responsible for this.

Ok, let me know!

On a positive note, since moving to FusionCache, our latencies, in the 99th percentile, have been static and our throughput has increased by nearly 40%.

Thanks for sharing, this is really great to know and made my day 🙂

Thank you for taking the time to read all of this and taking the time to reply. I do really appreciate it.

Thank for sharing, it's always interesting getting to know different scenarios and how FusionCache is being used out in the wild.

One last thing.

If you like and have time, can you go here:

and follow the link?
Share whatever you want, it's the only way for me to get to know how people are using FusionCache.

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

L2 Tagging hashset not always present. #552

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

L2 Tagging hashset not always present. #552

Uh oh!

jrlost Nov 6, 2025

Replies: 1 comment · 4 replies

Uh oh!

Uh oh!

jodydonetti Nov 7, 2025 Maintainer

Uh oh!

Uh oh!

jrlost Nov 7, 2025 Author

Uh oh!

jodydonetti Nov 13, 2025 Maintainer

Uh oh!

Uh oh!

jrlost Nov 13, 2025 Author

Uh oh!

jodydonetti Nov 16, 2025 Maintainer

jrlost
Nov 6, 2025

Replies: 1 comment 4 replies

jodydonetti
Nov 7, 2025
Maintainer

jrlost Nov 7, 2025
Author

jodydonetti Nov 13, 2025
Maintainer

jrlost Nov 13, 2025
Author

jodydonetti Nov 16, 2025
Maintainer