Using SaunaFS for VM storage (RAW images vs secondary storage) #663
-
|
Hi everyone 👋 I’ve recently come across SaunaFS and I’m really impressed by the direction this project is taking! I’m currently evaluating it as part of a virtualization stack, and wanted to ask a few questions before diving deeper. My main goal is to determine whether SaunaFS could serve as:
I’d love to hear your thoughts on whether either approach is viable and what potential performance or reliability trade-offs I should be aware of. A few more specific questions:HA and failover:
Documentation source: As far as I understand, SaunaFS is a fork of LizardFS. Replication vs Erasure Coding:
FreeBSD compatibility:
Prometheus metrics: Thank you!Thanks a lot for your time and for the work you’re putting into SaunaFS 🙌 |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 4 replies
-
|
Hi @yaroslav-gwit! Prepare for a lot of text:
Interesting. I'm not sure how you would use etcd or how it would differ from uraft, but I would be very interested in the results. The manual way is:
There may be other things, like This doesn't include things like IP/host re-assignment etc. so clients need to be either restarted and pointed to the new host, or the IP/host needs to be changed so it points to the new master. You will have to figure that out on your own (The uraft source code has some of this stuff if you are interested).
I don't know if you are talking about software versions, or metadata versions, so I'll assume both for clarity: Regarding software version mismatches: We try to avoid any major compatibility issues in the minor version (the y in vx.y.z), so versions should work with each other in the major version. The only exception currently is that a older shadow cannot connect to a newer master. But generally the tested upgrade procedure is 0) metaloggers 1) shadows, 2) master, 3) chunkservers and 4) clients Regarding the metadata version, this is indeed the most important value you need to check. You should promote the metadata server with the latest version. Again, you may want to consult the uraft source code for the LizardFS attempt at this.
This is potentially the biggest pain right now for us: Currently, the master can respond to the client with success before the metadata is replicated to other shadows/metaloggers. The time window size where this happens can vary, but it can happen, for example, that a file is created, the master crashes, a shadow is promoted, and you can't write to the file because it doesn't exist. It's a potential race condition. We are thinking about perhaps adding a option to master to make sure it synchronizes with N amount of shadows or all of them before responding to client. But that isn't currently implemented and may impact performance.
You can use
Mostly yes (commands have different names and there are a few new options added, but we've left untouched most other things), but please mention it in our Slack or issues if you find anything missing.
Depends. Do you have multiple VM's reading the same image in different locations (different buildings, cities etc.)? Then I'd consider replication. Otherwise, then EC is the best.
EC is better for parallelized IO to multiple disks/servers. This is mostly because the clients are doing the computation, but the Intel ISAL library also heavily helps in this regards. Even on a normal consumer laptop, it's usually very fast. So it should not be a issue at all with computation when writing. Reading data with missing chunks might be slower though. Of course, this also depends heavily on your infrastructure and your file size. Fewer drives/chunkservers also means it's less effective.
On large files? Yes.
Not officially supported. But I'm not against fixing things on it, especially if someone else does it :^). In all seriousness, we don't compile and test for it, but we may address issues related to it, and also accept PR's to fix the aforementioned issues. So if there is interest from the community to support it, we will accept and try to help in running it on BSD. Regarding current compatibility: I don't know. All I know is that on LizardFS it probably was working, and maybe it will still work on SaunaFS. You'll need to figure out the dependencies/libraries etc. all by yourself. Also, maybe you need to edit the code with #ifdefs. You can message on Slack if you need help with that.
There is already an experimental option on master to enable Prometheus metrics (for master), but it's experimental and for good reason: There might be UB with the forking in the background. I've been working on it for the past year, but that has been a major blocker and we are trying to get rid of the forking somehow. But I've also been thinking working around that problem. Currently we are trying use a C++ prometheus library, which embeds a small web server into master. But I think it may be easier and more performant to work with the current network protocol interface and create a prometheus exporter on top of it. We recently introduced a API to access our monitoring (see here), and I'm thinking about extending that with a prometheus exporter as well, using the same techniques. I don't know when I will be working on it, but hopefully very soon. |
Beta Was this translation helpful? Give feedback.
-
We consider the binary packages and repositories to be part of the paid support we provide. We actually want to encourage people to create their own packages (for example, NixOS users have made a package, and there's a package pending in Debian FTP by yours truly). However, we currently provide packages free of charge as long as you are willing to fill out a form. It should be linked on the docs.saunafs.com, but apparently it isn't currently. I'll see that it's fixed soon. |
Beta Was this translation helpful? Give feedback.
Hi @yaroslav-gwit!
Prepare for a lot of text:
Interesting. I'm not sure how you would use etcd or how it would differ from uraft, but I would be very interested in the results.
The manual way is:
There may be other things, like
saunafs-admin promote-shadow, but that requires theha-cluster-managedoption.This doesn't include things l…