Local Cache backend - Request for feedback #231
Replies: 3 comments 8 replies
-
|
Hello! I'm the original author. On a high level your ideas make a lot of sense, and I'm very impressed. I also think your use case would benefit greatly from your idea, be that using rdedup or some other chunking dedup project. I was designing rdedup specifically to keep it as a general-purpose deduplication engine, with the separate library and such. The project should work fine, but a while ago I just didn't have any more time to invest in it, and lost interest, especially that no other developers/users materialize. I'm not aware of anyone really using it for anything, there has not been any activity over last few years, and that's a shame because AFAIK, it's all well thought through and put together. The code is probably not up to date to the latest state of Rust language and ecosystem. IMO, you should not worry too much, fork the thing and adapt it to your needs as you please. Asking me what is OK and not OK is just going to slow you down, and I myself am not even using this project personally. Everything should be structured and abstracted quite well, so changing design should not hurt.
IIRC the current implementation just naively reads stuff one by one as needed, so the IO latency will greatly slow it down, and there's no way that the device bandwidth is well utilized. Some parallelization of reading (thread pools?) for reading IO should be relatively easy to do and could easily lead to 10x performance on the reading side, as loading can be easily parallizable. Make sure you've read the few wiki pages: https://github.com/dpc/rdedup/wiki . Also , there is a GC mechanism built-in, where "generations" are being tracked and all stored content moved over to another directory, to detect chunks that can get deleted. It might be useful or get in a way of your plans. I'm happy to answer some more specific questions, but I don't think I worked on the project in significant way in 6 years, so I'll have to look at the old code myself and see how bad it is. :D |
Beta Was this translation helpful? Give feedback.
-
|
Hi @dpc ! I do have a specific question. I'm considering exposing the repo I have on my raspberry pi over HTTP using this project. But the one thing that I am realising is that this offers me no way to represent a shared lock being active; and therefore I do not see at the moment a way to prevent write operations to start whist a read is ongoing (or vice versa though I guess, since in my case the write would be done with access to the filesystem, the lock file could be changed to represent an existing exclusive lock, for instance by making the file have a size of 1 or something). Now, I am thinking a could have some mechanism over HTTP to take a lock; maybe a custom endpoint that would, on the server side, be implemented to call a special new But my question is: what is the purpose of the write protection in the first place? I must be missing something entirely because I am sure it has one 😄 but my understanding was that once a chunk is computed and written do disk at its address, if a new Thanks for you're answer. And if you get an idea on how I could solve this problem, let me know! 😁 |
Beta Was this translation helpful? Give feedback.
-
|
Hi again @dpc ! Just thought I would share with you that I had fun creating a "read only HTTP backend", in case you are curious! You can see the main commit here, if you feel like it. It works really well... when the repository is not using encryption 😅 When it is using encryption, it works without issue on Linux but not on Windows! Which is a bit mysterious. If you happen to have a gut feeling as to where that could come from, let me know but I'm not going to ask you to debug my experiments 😁 I'm only sharing in case you find it interesting! I will probably switch my focus to using I'm really enjoying Rust I have to say! And working on |
Beta Was this translation helpful? Give feedback.

Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello everyone!
Context
I found rdedup when searching around for a data deduplication library, engine or tool. I am investigating relying on one in our CI/CD pipeline at my game studio so as to be able to sync binary data (quite heavy, often several dozens of gigabytes at a time) to everybody working on the project.
I think it would already yield significant time saving for people working on our LAN where the data is on network drives, but we also have a few people working remotely accessing the data over a VPN, which of course can take quite a long time when our tool naively makes them pull the whole new updated game revision across the Atlantic Ocean 🙃
And as game binaries grow quite iteratively, it's probably a good use-case for deduplication :)
Considering I have been wanting to properly learn Rust for a few years now (only ever got around to read the book but never really wrote and "real" Rust), and since
rdedupseems quite performant and reliable, I took a keen interest in the project!So thanks for making it, first of all 😄
I think I will eventually use the library to make an in-house synchronisation tool dedicated to our use case, the day I can afford to dedicate time to this. But in the meantime I am toying around with the binary on my spare time 🙂
The first test I did was to
tarup 10 successive game editor builds, totalling for 40GB of binaries, and fed them tordedupwhich created a repository totalling only 4.7GB (including all the metadata) 😁 a good indication that data deduplication might be a good investment for us!Tip
Actually I only played with the game modules of our editor, not the engine modules or the modules of the editor itself! In other words, the data I played with is already the data that mutates the most in our setup. Engine and editor binaries are order of magnitudes larger but fairly rarely mutate in our studio. So there might be even more gains to be had for us 😉
Idea
Because of my use case (not so much storage-efficient backups but rather efficient data transfers where a lot of the data is redundant over time), I am reluctant to use
rdedupin a way where all users constantly synchronise/copy the whole data store on their local machines; as it might still be a lot of data going around and a lot of it might be irrelevant until they need a specific build of the game (for instance). I especially want to avoid people using a VPN to have to pull unnecessary data.So, my idea was to create a new
Backendtype, which I calledLocalCache, which would keep some of the chunks locally. Reading the codebase a bit, I saw an opportunity to try something a bit cheeky and I implementedBackendforLocalCacheby making it a sort of meta-backend which actually holds two underlying backends: the "real" one, calledremoteand which can be any concrete type ofBackend; and alocalone of typeLocalwhich is the actual cache anmd always filesystem-based.Then, upon reading, it first tries to delegate the read to the
localbackend, but upon encountering an error, it assumes it was a cache miss and delegates it to the remote instead. Before returning the result from the remote, it calls awriteoperation for that chunk on thelocalrepository, which is how a chunk might be already present next time aloadoperation is triggered.Proof of concept implementation
Here is a quick implementation I made on my forked repository of
rdedupAnd before I mention the outstanding issues I can think of with it, here are a few notes.
Disabled
with-xz2default featurePlease ignore the disabling of the default
with-xz2feature. To my own despair I mostly work in a Windows environment (hard not to in the games industry!) and the libraryrdeduprelies on for lmza bindings misbehaves on Windows. I do also have a Linux machine where simply installingliblzma-devfixed the same issue but I have no clue how to fix it on Windows and didn't care to figure it out for now 😌 If I would be to ever make a pull request (I do not think this stuff is ready for it anyway), I'd hopefully remember to reenable the feature in my fork 🙃backend_from_urlchangesAs I needed a code path to create a 3rd type of backend, I started to see some flaws (in my humble opinion) on the whole URL driven approach to creating backends. To me it feels like that doesn't really belong in the library part of the codebase; it feels outside of the scope of its responsibilities to have such a high-level but also very specific way to resolve a backend. I feel like it makes more sense to expect a backend as a dependency to the library; and achieving that using a closure (as was almost the case before, it was a function pointer) is a great way to do it in a deferred manner which can be useful.
Now, this is kind of based on gut feels, but I think that argument is backed by the fact that:
Repostruct was solely existing and carried in fairly low-levels of the codebase only in order to be given as a dependency to the backend factory function that is used whenever a backend thread needs to be created again. This felt wrong and felt to me like a leaky abstraction where a design decision to use URLs (which in and of itself I do like, mind you!) was making its way to completely unrelated parts of the system.To combat that, I changed the
BackendSelectFnto be a Closure instead of a function pointer. That way, the URL data is captured in the closure in parts of the codebase that want to work that way, whilst other backends may be created using completely different dependencies.Because, again in my humble opinion, I believe that the Repo
initandopenfunctions should be agnostic to the backend creation scheme, I renamedinit_customandopen_customtoinitandopenrespectively; whilst the originalinitandopenfunctions are now respectively namedinit_from_urlandopen_from_url.This all meant that the URL data could disappear from the
Repostruct which feels right to me 🙂I am not certain how I feel about the url-driven approach even being part of the lib still. Again, I like the idea it just feels out of scope but maybe not. Maybe the changes I made are enough and the url-approach is convenient to have? Maybe moving the URL stuff to either another, optional lib or optional feature of the main lib, or simply a different module just for separation of concerns' sake would be good?
I am quite curious to hear anyone's thoughts on those changes 😄
Simple test
In order to test my changes, I somewhat crudely setup the "true" (remote) repository on my HDD, and the "cache" (local) repository on my NVMe SSD, simulating a slower access to the
remoterepo.To create the cache repository, I copied the remote repository after it was
initialised but before I performed anystoreoperation onto it. Then Istored all the test tarballs in the remote repo.When
loading a specificname(weighing 2.8GB) and outputting it to a file on my SSD with the cache completely empty, it took 3 minutes and 44 seconds to complete.When
loading the samenameagain, this time the cache having been populated, it took 1 minute and 22 seconds.This of course both a contrived and ideal scenario since 100% of the chunks will be a cache hit. So not only is all of the data already available locally, we also alleviate the underlying cache
writethat a cache miss would occur on top of the longerreadall the way from the remote repo.That being said, considering how much data was deduplicated in my use-case (remember, 40GB down to 4.7GB); I imagine a significant amount of the data will be cached even in (my) real world use-cases. So this seems to be worth it!
Issues with this first iteration
Cache creation not supported
For now, I created the cache manually by literally copying recursively the repository I considered "remote", before I
stored and data into it. This is a bit crude and relies on external user action and is of course not documented! This probably should be addressed.Error handling
Encountered errors such as during a
readfromlocalmight need to be more granular. There may be a specific error to lookout for indicating that a chunk read triggerred a cache miss; whilst other errors shouldn't be ignored and should be dealt with properly.No cache eviction
There's currently no cache eviction whatsoever. If the cache backend is truly thought of as a cache that people might not want to see grow too much, there could be an eviction policy of sorts that deletes chunks that were not read in a long time. But I didn't give any thought yet on how to implement that, as it'd need new metadata somewhere to keep track of that, and also some thought on when cache eviction may happen. Is the tool invoked with a new
cachecommand that could run acleanorevictsubcommand? Or does it work "on the fly" somehow (at the cost of a slight performance hit during aloadsince cache eviction would be performed somewhere?). The former is maybe "cleaner" but the latter keeps the cache working seamlessly without needing to think of it as a bigger "feature" in the tool... I am not sure what to think of all of that yet and welcome ideas 🙂Old name are left behind
Of course, because the cache directory is not meant to be live-synced with the remote "real" data store, it might have old names that were since removed from the central repository. Similarly to the cache eviction issue, a scheme would need to be devised to deal with that. Would the
cachecommand have async_remote(or some better name) subcommand, at the cost of making the cache system too "big" or "present" in therdedupfeature set? Or is there a way to also perform this gradually during operations so caching stays a more seamlessly integrated feature?At least I think an easy way to implement it, regardless of the exact UX to get there, would be to run the
listcommand on both thelocal(lets call the set of names for localL) andremote(with its set of names calledR) repositories, compute the setn ∈ L∖Rand for all names in that set, remove them from the local repository before finally running a garbage collection pass. Thoughts on that welcome but I think it should hold up 😄Locking issues
Conceptually, a
writetolocalin areadoperation feels a bit wrong but maybe that's fine! Much more importantly it is probably dangerous as of the current implementation, as no exclusive lock is held at that point. From what I understand, a shared lock will be held. I actually didn't have time to study that part of the codebase yet and I'm just basing that on what I read in some comments, documentation and some of the code.But my understanding is that this is possibly a bad idea as it is 😬
Of course, in the use-case I project using all of this for, I would use the cache locally, not concurrently and although I implemented the
writeoperation too (which basically forwards the write to the remote but also writes locally), I would actually never do a write this way (our CI/CD would write to the remote repo and not use any caching backend).But because I am toying with the idea of the cache being a full fledged backend, the way I do it now I think is flawed and creates new opportunities for a failure of at least the local repository.
Note
As I am writing this, I realise that I do have one place in the current data flow where I could actually do something about this. I guess I can change my implementation of
lock_sharedinimpl Backend for LocalCacheand always create an exclusive lock for the local cache! This will of course make it so that a cache-repo cannot be concurrently read from, but that kind of makes sense in a way. What do people think of that idea? It would bring a limitation to the system when using the cache but I think make it safe to use?EDIT: I did that change now
Encryption
I am testing everything without any encryption. My use-case is all about getting data as fast as possible to end users in a controlled, sandboxed environment so we would likely keep it disabled, and as such I did not do any test with it.
The one thing I know is currently going to be odd with encryption is that I did not refactor the code path asking for a
passphrase, and so I am pretty sure that the fun idea of having two live backends under the hood will now result in the passphrase being asked for twice 😅 Maybe that is not a big deal, as the backends could technically have different passphrases and everything would work! But it feels like that will probably be annoying for users and currently the CLI wouldn't even specify which backend's passphrase the user is being prompted for first. So, there's definitely room for improvement there 🙃Feedback request
I hope this wasn't too much of a long read! If anyone would be so kind here are the key points I would love to get feedback on or simply open a discussion about:
LocalCachebackend. Does it make sense to others? Do we think this can work and it's worth finding ways to solve the current shortcomings with an approach like this? Or is this a naive approach to it and we'd be better suited implementing a bespoke cache feature, weaved into the different abstraction layers of the codebase?Final question and thoughts
I read somewhere the original author mentioning that as his use-case was always about creating storage efficient backups, the
loadoperation was never worked on as hard as thestoreand so he assumed there was room for performance improvements there. Anyone has any insights on that? If I ever follow-through with my current designs to use this as a data sharing tool, theloadcode path is actually much more critical to me than thestoreone 😄 If I need to am sure I can someday dive in and figure some stuff out, but any insights anyone has on where current bottlenecks are and what could be done about them would be appreciated 🙂I'm also interested in any high-level/conceptual overview of
rdedup's architecture and design, especially if anyone has more experience with it than I do and thinks they can highlight key areas that you may think I did not seem to know about and could help me in my goals 🙂Though, I have to say, between it being implemented in Rust, the great architectural work that has been done and the really interesting blog-post about the "fearless concurrency" refactor by the original author, it has been so far quite comfortable to tinker within
rdedupwhich I think is quite an achievement!Thanks a lot in advance for anyone reading and answering!
I read the original author wasn't really working on this anymore and I do not know if anyone else is; so if no one ends up interacting with this probably-too-long writeup, it's OK! My thoughts needed to go somewhere and if anything this will serve as my personal notes for this project 😄
Beta Was this translation helpful? Give feedback.
All reactions