Local Cache backend - Request for feedback #231

Natsirtt · 2025-03-30T22:14:20Z

Natsirtt
Mar 30, 2025

Hello everyone!

Context

I found rdedup when searching around for a data deduplication library, engine or tool. I am investigating relying on one in our CI/CD pipeline at my game studio so as to be able to sync binary data (quite heavy, often several dozens of gigabytes at a time) to everybody working on the project.

I think it would already yield significant time saving for people working on our LAN where the data is on network drives, but we also have a few people working remotely accessing the data over a VPN, which of course can take quite a long time when our tool naively makes them pull the whole new updated game revision across the Atlantic Ocean 🙃

And as game binaries grow quite iteratively, it's probably a good use-case for deduplication :)

Considering I have been wanting to properly learn Rust for a few years now (only ever got around to read the book but never really wrote and "real" Rust), and since rdedup seems quite performant and reliable, I took a keen interest in the project!

So thanks for making it, first of all 😄

I think I will eventually use the library to make an in-house synchronisation tool dedicated to our use case, the day I can afford to dedicate time to this. But in the meantime I am toying around with the binary on my spare time 🙂

The first test I did was to tar up 10 successive game editor builds, totalling for 40GB of binaries, and fed them to rdedup which created a repository totalling only 4.7GB (including all the metadata) 😁 a good indication that data deduplication might be a good investment for us!

Tip

Actually I only played with the game modules of our editor, not the engine modules or the modules of the editor itself! In other words, the data I played with is already the data that mutates the most in our setup. Engine and editor binaries are order of magnitudes larger but fairly rarely mutate in our studio. So there might be even more gains to be had for us 😉

Idea

Because of my use case (not so much storage-efficient backups but rather efficient data transfers where a lot of the data is redundant over time), I am reluctant to use rdedup in a way where all users constantly synchronise/copy the whole data store on their local machines; as it might still be a lot of data going around and a lot of it might be irrelevant until they need a specific build of the game (for instance). I especially want to avoid people using a VPN to have to pull unnecessary data.

So, my idea was to create a new Backend type, which I called LocalCache, which would keep some of the chunks locally. Reading the codebase a bit, I saw an opportunity to try something a bit cheeky and I implemented Backend for LocalCache by making it a sort of meta-backend which actually holds two underlying backends: the "real" one, called remote and which can be any concrete type of Backend; and a local one of type Local which is the actual cache anmd always filesystem-based.

Then, upon reading, it first tries to delegate the read to the local backend, but upon encountering an error, it assumes it was a cache miss and delegates it to the remote instead. Before returning the result from the remote, it calls a write operation for that chunk on the local repository, which is how a chunk might be already present next time a load operation is triggered.

Proof of concept implementation

Here is a quick implementation I made on my forked repository of rdedup

And before I mention the outstanding issues I can think of with it, here are a few notes.

Disabled with-xz2 default feature
Please ignore the disabling of the default with-xz2 feature. To my own despair I mostly work in a Windows environment (hard not to in the games industry!) and the library rdedup relies on for lmza bindings misbehaves on Windows. I do also have a Linux machine where simply installing liblzma-dev fixed the same issue but I have no clue how to fix it on Windows and didn't care to figure it out for now 😌 If I would be to ever make a pull request (I do not think this stuff is ready for it anyway), I'd hopefully remember to reenable the feature in my fork 🙃

backend_from_url changes
As I needed a code path to create a 3rd type of backend, I started to see some flaws (in my humble opinion) on the whole URL driven approach to creating backends. To me it feels like that doesn't really belong in the library part of the codebase; it feels outside of the scope of its responsibilities to have such a high-level but also very specific way to resolve a backend. I feel like it makes more sense to expect a backend as a dependency to the library; and achieving that using a closure (as was almost the case before, it was a function pointer) is a great way to do it in a deferred manner which can be useful.

Now, this is kind of based on gut feels, but I think that argument is backed by the fact that:

The idea of a cache being a full fledged backend did not fit at all with the URL idea which itself would represent the "remote" repository only
The URL data member of the Repo struct was solely existing and carried in fairly low-levels of the codebase only in order to be given as a dependency to the backend factory function that is used whenever a backend thread needs to be created again. This felt wrong and felt to me like a leaky abstraction where a design decision to use URLs (which in and of itself I do like, mind you!) was making its way to completely unrelated parts of the system.

To combat that, I changed the BackendSelectFn to be a Closure instead of a function pointer. That way, the URL data is captured in the closure in parts of the codebase that want to work that way, whilst other backends may be created using completely different dependencies.

Because, again in my humble opinion, I believe that the Repo init and open functions should be agnostic to the backend creation scheme, I renamed init_custom and open_custom to init and open respectively; whilst the original init and open functions are now respectively named init_from_url and open_from_url.

This all meant that the URL data could disappear from the Repo struct which feels right to me 🙂

I am not certain how I feel about the url-driven approach even being part of the lib still. Again, I like the idea it just feels out of scope but maybe not. Maybe the changes I made are enough and the url-approach is convenient to have? Maybe moving the URL stuff to either another, optional lib or optional feature of the main lib, or simply a different module just for separation of concerns' sake would be good?

I am quite curious to hear anyone's thoughts on those changes 😄

Simple test
In order to test my changes, I somewhat crudely setup the "true" (remote) repository on my HDD, and the "cache" (local) repository on my NVMe SSD, simulating a slower access to the remote repo.
To create the cache repository, I copied the remote repository after it was initialised but before I performed any store operation onto it. Then I stored all the test tarballs in the remote repo.

When loading a specific name (weighing 2.8GB) and outputting it to a file on my SSD with the cache completely empty, it took 3 minutes and 44 seconds to complete.
When loading the same name again, this time the cache having been populated, it took 1 minute and 22 seconds.

This of course both a contrived and ideal scenario since 100% of the chunks will be a cache hit. So not only is all of the data already available locally, we also alleviate the underlying cache write that a cache miss would occur on top of the longer read all the way from the remote repo.
That being said, considering how much data was deduplicated in my use-case (remember, 40GB down to 4.7GB); I imagine a significant amount of the data will be cached even in (my) real world use-cases. So this seems to be worth it!

Issues with this first iteration

Cache creation not supported
For now, I created the cache manually by literally copying recursively the repository I considered "remote", before I stored and data into it. This is a bit crude and relies on external user action and is of course not documented! This probably should be addressed.

Error handling
Encountered errors such as during a read from local might need to be more granular. There may be a specific error to lookout for indicating that a chunk read triggerred a cache miss; whilst other errors shouldn't be ignored and should be dealt with properly.

No cache eviction
There's currently no cache eviction whatsoever. If the cache backend is truly thought of as a cache that people might not want to see grow too much, there could be an eviction policy of sorts that deletes chunks that were not read in a long time. But I didn't give any thought yet on how to implement that, as it'd need new metadata somewhere to keep track of that, and also some thought on when cache eviction may happen. Is the tool invoked with a new cache command that could run a clean or evict subcommand? Or does it work "on the fly" somehow (at the cost of a slight performance hit during a load since cache eviction would be performed somewhere?). The former is maybe "cleaner" but the latter keeps the cache working seamlessly without needing to think of it as a bigger "feature" in the tool... I am not sure what to think of all of that yet and welcome ideas 🙂

Old name are left behind
Of course, because the cache directory is not meant to be live-synced with the remote "real" data store, it might have old names that were since removed from the central repository. Similarly to the cache eviction issue, a scheme would need to be devised to deal with that. Would the cache command have a sync_remote (or some better name) subcommand, at the cost of making the cache system too "big" or "present" in the rdedup feature set? Or is there a way to also perform this gradually during operations so caching stays a more seamlessly integrated feature?

At least I think an easy way to implement it, regardless of the exact UX to get there, would be to run the list command on both the local (lets call the set of names for local L) and remote (with its set of names called R) repositories, compute the set n ∈ L∖R and for all names in that set, remove them from the local repository before finally running a garbage collection pass. Thoughts on that welcome but I think it should hold up 😄

Locking issues
Conceptually, a write to local in a read operation feels a bit wrong but maybe that's fine! Much more importantly it is probably dangerous as of the current implementation, as no exclusive lock is held at that point. From what I understand, a shared lock will be held. I actually didn't have time to study that part of the codebase yet and I'm just basing that on what I read in some comments, documentation and some of the code.

But my understanding is that this is possibly a bad idea as it is 😬
Of course, in the use-case I project using all of this for, I would use the cache locally, not concurrently and although I implemented the write operation too (which basically forwards the write to the remote but also writes locally), I would actually never do a write this way (our CI/CD would write to the remote repo and not use any caching backend).

But because I am toying with the idea of the cache being a full fledged backend, the way I do it now I think is flawed and creates new opportunities for a failure of at least the local repository.

Note

As I am writing this, I realise that I do have one place in the current data flow where I could actually do something about this. I guess I can change my implementation of lock_shared in impl Backend for LocalCache and always create an exclusive lock for the local cache! This will of course make it so that a cache-repo cannot be concurrently read from, but that kind of makes sense in a way. What do people think of that idea? It would bring a limitation to the system when using the cache but I think make it safe to use?
EDIT: I did that change now

Encryption
I am testing everything without any encryption. My use-case is all about getting data as fast as possible to end users in a controlled, sandboxed environment so we would likely keep it disabled, and as such I did not do any test with it.

The one thing I know is currently going to be odd with encryption is that I did not refactor the code path asking for a passphrase, and so I am pretty sure that the fun idea of having two live backends under the hood will now result in the passphrase being asked for twice 😅 Maybe that is not a big deal, as the backends could technically have different passphrases and everything would work! But it feels like that will probably be annoying for users and currently the CLI wouldn't even specify which backend's passphrase the user is being prompted for first. So, there's definitely room for improvement there 🙃

Feedback request

I hope this wasn't too much of a long read! If anyone would be so kind here are the key points I would love to get feedback on or simply open a discussion about:

Of course, the main idea itself of having a LocalCache backend. Does it make sense to others? Do we think this can work and it's worth finding ways to solve the current shortcomings with an approach like this? Or is this a naive approach to it and we'd be better suited implementing a bespoke cache feature, weaved into the different abstraction layers of the codebase?
If we like the idea of the cache backend being done in this way, can anyone think of any clever thing to do with locks? Can it be more granular than a full lock of both the local and remote backends?
How do people feel about the backend-from-URL changes I have made? Do people agree that it is outside of the scope of responsibility the library should have or am I the only one feeling that way? 🙂 Or is maybe the current middle-ground of not considering it the "default" way to create backends but still providing it as a feature a good compromise?
Is there anything anyone can think of that I did not mention yet? Any other thing I clearly didn't even consider?
If anyone has feedback or advice on my Rust code in general, I am very new to it and would welcome it the help! This is basically the first time I write any Rust outside of a tutorial so please highlight anything weird or non-idiomatic I may have written 😄

Final question and thoughts

I read somewhere the original author mentioning that as his use-case was always about creating storage efficient backups, the load operation was never worked on as hard as the store and so he assumed there was room for performance improvements there. Anyone has any insights on that? If I ever follow-through with my current designs to use this as a data sharing tool, the load code path is actually much more critical to me than the store one 😄 If I need to am sure I can someday dive in and figure some stuff out, but any insights anyone has on where current bottlenecks are and what could be done about them would be appreciated 🙂

I'm also interested in any high-level/conceptual overview of rdedup's architecture and design, especially if anyone has more experience with it than I do and thinks they can highlight key areas that you may think I did not seem to know about and could help me in my goals 🙂
Though, I have to say, between it being implemented in Rust, the great architectural work that has been done and the really interesting blog-post about the "fearless concurrency" refactor by the original author, it has been so far quite comfortable to tinker within rdedup which I think is quite an achievement!

Thanks a lot in advance for anyone reading and answering!

I read the original author wasn't really working on this anymore and I do not know if anyone else is; so if no one ends up interacting with this probably-too-long writeup, it's OK! My thoughts needed to go somewhere and if anything this will serve as my personal notes for this project 😄

dpc · 2025-03-31T03:16:16Z

dpc
Mar 31, 2025
Maintainer

Hello! I'm the original author.

On a high level your ideas make a lot of sense, and I'm very impressed. I also think your use case would benefit greatly from your idea, be that using rdedup or some other chunking dedup project. I was designing rdedup specifically to keep it as a general-purpose deduplication engine, with the separate library and such.

The project should work fine, but a while ago I just didn't have any more time to invest in it, and lost interest, especially that no other developers/users materialize. I'm not aware of anyone really using it for anything, there has not been any activity over last few years, and that's a shame because AFAIK, it's all well thought through and put together. The code is probably not up to date to the latest state of Rust language and ecosystem.

IMO, you should not worry too much, fork the thing and adapt it to your needs as you please. Asking me what is OK and not OK is just going to slow you down, and I myself am not even using this project personally. Everything should be structured and abstracted quite well, so changing design should not hurt.

I read somewhere the original author mentioning that as his use-case was always about creating storage efficient backups, the load operation was never worked on as hard as the store and so he assumed there was room for performance improvements there. Anyone has any insights on that?

IIRC the current implementation just naively reads stuff one by one as needed, so the IO latency will greatly slow it down, and there's no way that the device bandwidth is well utilized. Some parallelization of reading (thread pools?) for reading IO should be relatively easy to do and could easily lead to 10x performance on the reading side, as loading can be easily parallizable.

Make sure you've read the few wiki pages: https://github.com/dpc/rdedup/wiki .

Also , there is a GC mechanism built-in, where "generations" are being tracked and all stored content moved over to another directory, to detect chunks that can get deleted. It might be useful or get in a way of your plans.

I'm happy to answer some more specific questions, but I don't think I worked on the project in significant way in 6 years, so I'll have to look at the old code myself and see how bad it is. :D

2 replies

Natsirtt Mar 31, 2025
Author

Thanks a lot for your reply 🙂

I do think the project is really well put together. I had an easier time getting into it and making changes than I had in a while! It was really fun to explore 😄

I get you, I will keep my fork going and make it my own then. I mostly felt like asking for feedback because Rust is quite new to me but if the project itself is not really active then it's just easier if I keep my own derivative of it going then!

And thanks for your insights on the load part. Sounds like there's lots of potential there for me to explore so I will do just so when I have more time for it!

I did play a bit with the GC mechanism and I think it will work perfectly well for my use case. Either a higher level daemon program or quite simply our CI/CD pipeline can keep tabs on names and deprecate them when we know they are not useful anymore (generally we just keep the builds from the past few days only) and remove names that are deemed too old; then queue in a GC say Saturday during the night and commit to the new generation (deleting the old one) on Sunday evening. I could probably even just do a GC and directly failover to the new generation over the weekend as it's quite rare people work then but just for the sake of being safe I'd do it over those two nights which should in any case see close to no activity.

If I have any question, I will surely ask then 😄 for as long as I look at this on my spare time, it might be quite sparingly but I'm hoping I can get to look at this for actually deploying it at work later in the year or maybe early next year; so maybe then! 😁

Thanks for the answer again, and well done on a really neat and well designed tool. Hope your new endeavours are just as fun!

dpc Mar 31, 2025
Maintainer

Cool.

This will all require good understanding of what's going under the hood, but I think you're doing very well already. Good luck, have fun and let's stay in touch, looking forward to know how it went.

Natsirtt · 2025-04-03T20:29:35Z

Natsirtt
Apr 3, 2025
Author

Hi @dpc !

I do have a specific question. I'm considering exposing the repo I have on my raspberry pi over HTTP using this project.
I did just that, this static server made it fairly trivial to do and this should allow anyone to load a named object over the internet which is pretty neat! Not so different from the way it seems the B2 backend was being thought of.

But the one thing that I am realising is that this offers me no way to represent a shared lock being active; and therefore I do not see at the moment a way to prevent write operations to start whist a read is ongoing (or vice versa though I guess, since in my case the write would be done with access to the filesystem, the lock file could be changed to represent an existing exclusive lock, for instance by making the file have a size of 1 or something). Now, I am thinking a could have some mechanism over HTTP to take a lock; maybe a custom endpoint that would, on the server side, be implemented to call a special new rdedup --dir /foo/bar lock command maybe. A bit crude but it could work.

But my question is: what is the purpose of the write protection in the first place? I must be missing something entirely because I am sure it has one 😄 but my understanding was that once a chunk is computed and written do disk at its address, if a new store operation resulted in the same chunk, the existing one would be untouched, no? Why would it be that a load operation could not read chunks that already exist whilst a store operation may be writing new chunks that the load is not going to be interested in?

Thanks for you're answer. And if you get an idea on how I could solve this problem, let me know! 😁

2 replies

dpc Apr 3, 2025
Maintainer

The exclusive lock is is used only for garbage collection and some renames. You can check references to

rdedup/lib/src/aio/mod.rs

Line 141 in 1d11e19

pub(crate) fn lock_exclusive(&self) -> io::Result<Box<dyn Lock>> {

for details. Depending on the details, you might even not need it.

But yeah, typically I implement locks like this with some file that that consumers periodically update when the resource is locked, and detect if it is already locked by something else. A bit delicate thing to implement, but not a rocket science either.

Possibly even web sockets could be used, as https://docs.rs/static-web-server/latest/static_web_server/ can probably be used as a library and embedded in a custom project.

Natsirtt Apr 3, 2025
Author

Aah, I see! Interesting, thanks for the pointer!

Yes then I think I can deal with this since the read-only side wouldn't
need to acquire it, and the write side which would do GC can since it has
filesystem access.

I'll read through the code more thoroughly and look at where the shared
lock is acquired too to see if I need to find a way to represent one or not
then.

Yes websocket could be an interesting backend to have too.

Thanks again, I think I can figure something out :)

Natsirtt · 2025-04-05T12:36:34Z

Natsirtt
Apr 5, 2025
Author

Hi again @dpc ! Just thought I would share with you that I had fun creating a "read only HTTP backend", in case you are curious! You can see the main commit here, if you feel like it.

It works really well... when the repository is not using encryption 😅 When it is using encryption, it works without issue on Linux but not on Windows! Which is a bit mysterious. If you happen to have a gut feeling as to where that could come from, let me know but I'm not going to ask you to debug my experiments 😁 I'm only sharing in case you find it interesting!

I will probably switch my focus to using rdedup and the new http backend to create a user-facing gui client for Unreal Engine projects management which will download binaries from an rdedup store for content creators without C++ compilers to get updates to their unreal projects (that is my original idea/need 😄); but when I come back around to investigating this shortcoming of the HTTP backend (and maybe look at the lacking locks situation also, as we discussed), I guess I will first try to see if all encryption methods exhibit the issue or only some. And then... well it's a problem for future me 🙃

I'm really enjoying Rust I have to say! And working on rdedup 😄
Have a good one!

4 replies

Natsirtt Apr 5, 2025
Author

It looks like there's a suspicious pattern/relationship between the correct data (on the left) and the jumbled-up-on-windows one (on the right)

Could it somehow be an endianness problem or am I reaching for straws? Only when decrypting? 🤔

Natsirtt Apr 5, 2025
Author

Aha! So it's not endianness I don't think now that I brushed up on how that would look 😄 but also I could reproduce this with the Local backend, which makes more sense I was a bit confused on how the new backend could have any influence there. I just happened to test this now with the new backend and had not tried to run with encryption beforehand on windows is all.

So yeah I guess I just stumbled on a windows-only decryption bug (since encrypting from windows and decrypting on linux produces the correct binary data). It could even be a bug in the encryption library rather than rdedup.

dpc Apr 5, 2025
Maintainer

Huh. So interesting. rdedup uses dependencies for encryption, so probably some bug in them. Probably the dependencies there are quite old, you might consider updating them before investigating further.

Natsirtt Apr 5, 2025
Author

That makes sense!

Local Cache backend - Request for feedback #231

Uh oh!

Uh oh!

Natsirtt Mar 30, 2025

Context

Idea

Proof of concept implementation

Issues with this first iteration

Feedback request

Final question and thoughts

Thanks a lot in advance for anyone reading and answering!

Replies: 3 comments · 8 replies

Uh oh!

Uh oh!

dpc Mar 31, 2025 Maintainer

Uh oh!

Natsirtt Mar 31, 2025 Author

Uh oh!

dpc Mar 31, 2025 Maintainer

Uh oh!

Uh oh!

Natsirtt Apr 3, 2025 Author

Uh oh!

Uh oh!

dpc Apr 3, 2025 Maintainer

Uh oh!

Uh oh!

Natsirtt Apr 3, 2025 Author

Uh oh!

Uh oh!

Natsirtt Apr 5, 2025 Author

Uh oh!

Uh oh!

Natsirtt Apr 5, 2025 Author

Uh oh!

Natsirtt Apr 5, 2025 Author

Uh oh!

dpc Apr 5, 2025 Maintainer

Uh oh!

Natsirtt Apr 5, 2025 Author

Natsirtt
Mar 30, 2025

Replies: 3 comments 8 replies

dpc
Mar 31, 2025
Maintainer

Natsirtt Mar 31, 2025
Author

dpc Mar 31, 2025
Maintainer

Natsirtt
Apr 3, 2025
Author

dpc Apr 3, 2025
Maintainer

Natsirtt Apr 3, 2025
Author

Natsirtt
Apr 5, 2025
Author

Natsirtt Apr 5, 2025
Author

Natsirtt Apr 5, 2025
Author

dpc Apr 5, 2025
Maintainer

Natsirtt Apr 5, 2025
Author