-
Notifications
You must be signed in to change notification settings - Fork 174
Replies: 1 comment · 4 replies
-
|
Hi @jrlost , hashsets are not used at all in FusionCache. For more on the particular design, take a look at the related docs. This design basically scales to infinity so even if you have, say, 1M entries all tagged with a certain tag:
Hope this helps, let me know. |
Beta Was this translation helpful? Give feedback.
All reactions
-
I think you misinterpreted what I said. The
I brought this up as a discussion in hopes of rubber ducking or bouncing ideas off of you (or others) to try to figure out
I can certainly scour the FusionCache L2 persistence code, I've dug through much of the tag related code, but didn't find anything that would indicate why this isn't making it to Redis; but I figured I'd drop this here in hopes that someone who is more intimately familiar with the code may have ideas of their own. |
Beta Was this translation helpful? Give feedback.
All reactions
-
|
Hi @jrlost
Ok, question: how are you using hashes on Redis? Are you work with them directly (for example via StackExchange.Redis) or in some way via FusionCache directly? I'm trying to better understand the situation.
Related to the question above: only
Yes, this is correct: when you save an entry with a tag, nothing is created for that tag, only the tagged entry itself. The entry for the tag itself is created only when calling From there on, any "get" operation (including the "get" part of a This is why I asked if you read the Tagging docs that explains the design, it's pretty peculiar. Maybe you were instead expecting that, when saving an entry tagged with the tag The approach of having an entry for the tag which includes the list of cache keys tagged with that tag is a relatively common approach (I think the MS implementation of output caching for Redis is using it), but the problem is that this approach does not scale well (imho). Mainly for 2 reasons:
Also, to avoid conflict with updating a "normal" entry (of type Redis STRING), this approach usually requires using a Redis SET, and that is not supported via the standard Because of these (and other) reasons, I came up with a totally different approach/design for FusionCache, one that basically requires a single O(1) operation when calling I also employed a bunch of other optimizations so that the work needed is as little (and fast) as possible.
Yes they use a large TTL because they are used to keep track of So they need to stay in the cache for a longer time, to be sure a future read of old data is not served, if its timestamp is lower than the
Wow, that's brutal (in a positive way 🙂). If you are up for a quick chat I'd like to know more about it (of course without violating privacy, NDAs, etc). Let me know.
Yup, been there, I know the pain (sometimes).
That would be strange, because (if the right log level is enabled) everything is logged.
You did right, and thanks for sharing! Again, if you like I'd have a chat to better figure things out.
Well I wrote all the FusionCache code, so probably I'm the right person to try to figure it out 😀 Thanks, and let me know about the chat. |
Beta Was this translation helpful? Give feedback.
All reactions
-
|
Wow, didn't have "Jody writing me a book" on my bingo card for the day. 💯 I'll start by adding a little more context, then I'll dive into some of your questions. When we transitioned over from our old caching solution (I'll cover that a bit more later) to FusionCache, there was a lot of scrutiny placed on the application performance when we deployed this code into production. As mentioned earlier, this is a massive multi-tenant application servicing ~50-60k tenants so we take changes to performance seriously. When we deployed to production, the first thing we started to notice was certain things that were behind the factory being passed into At the time of deployment, we had the OTEL activities getting captured in our Elastic APM cluster and could see the attempt to pull the object from L1 with the expected cache key, only for it to not capture the object and instead jump into the factory. The object in question was a frequently used object in the request pipeline, so when the object was requested later on in a future request (milliseconds to seconds later) and it still jumped into the factory, we started digging even deeper. The first things we noticed was the missing tag entry in L2 and given what you've suggested, it sounds like this was a red herring. What's curious is, as soon as we did call I apologize for the word salad above; I just thought it might help to put a bit more perspective around this. On to trying to answer some of your questions.
With the update to FusionCache, we're only using FusionCache to read from/write to Redis via the RedisCache distributed cache provider w/ backplane; we are not creating these hashes ourselves nor are we managing Redis objects outside of FusionCache. We do query Redis periodically when diagnosing issues, like the
We are using
I've read the documentation several times and can say that I must have missed:
Regardless, I can honestly say that my perspective may have been skewed by our previous cache implementation.
Nah, I understood that there was a single object per-tag controlling the state; just was unaware that it didn't exist until a Which then landed me here: FusionCache/src/ZiggyCreatures.FusionCache/Internals/FusionCacheInternalUtils.cs Line 548 in 65f545d
Which, now that you stated that "it not showing until after the
Yeah, sadly with this volume, turning on Anyways, I'll keep digging around; I will say that once the tag object existed we now no longer see the factories being hit like they had been. So, the issue I had been seeing could now be gone. When I get some time in the next week or so, I'll see if I can get a minimal repro written; it's possible that how we're using tagging or our configuration is responsible for this. On a positive note, since moving to FusionCache, our latencies, in the 99th percentile, have been static and our throughput has increased by nearly 40%. Thank you for taking the time to read all of this and taking the time to reply. I do really appreciate it. EDIT: Updated the symptoms to better clarify the original problem. |
Beta Was this translation helpful? Give feedback.
All reactions
-
Ahah, fair 🤣
As it should be.
I don't know the entry options used, but it feels like something related to them: what follows is a list of possible examples. If the object is not used from L1 even though is there, I can think from the top of my head about a couple of potential reasons:
If instead the issue is with L2, maybe somewhere you are setting Or maybe you are using distributed cache timeouts (see here) with a very low value? If, for example, reading from L2 takes normally
Totally the opposite, it has been very helpful in better understanding the scenario.
Just out of curiosity: is this a case where it is not possible to use
Mmmh, very very interesting: as stated above, maybe in the
I'll try to read it again later, and in case I'll clarify this better, thanks for pointing this out.
Yup, it's an optimization I made to avoid having the cache full of entries with "zero" everywhere, which seemed like a waste and would create confusion when observed by a user.
Correct, I'm still baffled by it 🤔
Ouch. Been there too with something different but similar, not fun.
Ok, let me know!
Thanks for sharing, this is really great to know and made my day 🙂
Thank for sharing, it's always interesting getting to know different scenarios and how FusionCache is being used out in the wild. One last thing. If you like and have time, can you go here:
and follow the link? Thanks! |
Beta Was this translation helpful? Give feedback.

Uh oh!
There was an error while loading. Please reload this page.
-
A little background, we've been slowly transitioning our apps over towards using fusioncache and recently a few of them landed on production. One of the apps is a large multi-tenant solution where we rely on tagging for our tenant level cache invalidation. With this release, we've noticed an increase in calls to the database for things that appear to be in L2 and not stale; things I'd expect to have been used.
While digging into the issue, I spotted something strange, the
__fc:t:TAGNAMEobject was not in Redis (what we're using for L2).After triggering a
RemoveByTagcall on the tag, it then started showing (and, as expected, has stuck around). We are primarily usingGetOrSetAsyncwhen capturing these, and in each place we're actively passing in the tags.Example weird state:
HGET "sf-v2:TENANTIDENTIFIER:siteredirects" data"{\"Value\":[],\"Timestamp\":638980440162818275,\"LogicalExpirationTimestamp\":638980442562818275,\"Tags\":[\"TENANTIDENTIFIER\"],\"Metadata\":{\"IsStale\":false,\"EagerExpirationTimestamp\":638980441122818275,\"Size\":1}}"HGET "sf-v2:__fc:t:TENANTIDENTIFIER" data(nil)We're running this application in multiple nodes and relying on the backplane to keep things in sync. I suspect that if this tag hashset is missing for L2 that the app instances will start to deviate and it's possible each instance could have a different source of truth with regards to the tag expiration.
Originally I had suspected this was an issue with our redis instance. We use volitile_lru for evictions, but when evaluating our cluster, INFO shows there had been zero evictions, suggesting this object just never made it to Redis at all.
Another thought I had was that perhaps there was an exception or timeout on write to L2, but logs (currently have log level set to WARNING) show no errors related to fusioncache.
I brought this up here instead of as an issue primarily because I have no means to replicate it so at this point, my evidence is just the handful of instances like the example above. It's all anecdotal at this time, but something has resulted in L2 losing (or never getting) the tag hashset. I will keep looking, but wanted to pick brains for other things that could have resulted in this state.
one more data point, this app is running in >12 instances and doing >1 million caching operations/sec, so it's definitely not outside the realm of possibility that it's a contention issue, memory issue, heap issue; we've seen other weird behaviors happen at this scale in the past.
Beta Was this translation helpful? Give feedback.
All reactions