-
Notifications
You must be signed in to change notification settings - Fork 14
Preparatory splitstream format changes for ostree support #185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
3803dc3
to
9aedd96
Compare
|
||
use crate::{sha256_from_descriptor, sha256_from_digest, tar::split_async, ContentAndVerity}; | ||
|
||
pub const TAR_LAYER_CONTENT_TYPE: u64 = 0x2a037edfcae1ffea; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment for where these came from? I'm guessing random? If so just add a comment // Random unique ID
?
That said I wonder if it wouldn't be nicer to store (variable length) strings for this in the format? Maybe it could go all the way to literally suggested to be the mediaType from OCI (if applicable)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. These are random.
But, I'd rather avoid having variable length things in the header. That makes parsing it much more tricky.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could make it a real uuid tho
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine keeping as u64 too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some comments here
crates/composefs/src/repository.rs
Outdated
pub fn has_named_stream(&self, name: &str) -> bool { | ||
let stream_path = format!("streams/refs/{}", name); | ||
|
||
readlinkat(&self.repository, &stream_path, []).is_ok() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like "swallowing" errors like this, I'd say call stat
instead and require it's S_IFLNK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I redid this with stat()
#[derive(Clone, Debug, FromBytes, Immutable, IntoBytes, KnownLayout)] | ||
#[repr(C)] | ||
pub struct MappingEntry { | ||
pub body: Sha256Digest, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're changing the format here...I think it'd be nice to make this one extensible.
However...bigger picture there's another consideration: There's obviously a metric ton of binary serialization formats out there. A custom one isn't wrong necessarily but...how about say CBOR ? It has some usage and a proper RFC etc.
I guess a dividing line is "do we care about mmap()"? Probably not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
splitstreams are essentially thin wrappers of existing binary formats (tars, ostree objects, etc), adding just references to other composefs repo objects. I'm not sure its overly helpful to use a complicated binary format for the wrapping, especially one which is completely different from the inner format.
That said, I agree that we should make it at least a bit extensible. This MR adds a magic header, but also adding a version field and a few bytes of unused/unparsed space does seem quite useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed mmap, and the end result was, no, we don't want it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
splitstreams are essentially thin wrappers of existing binary formats (tars, ostree objects, etc), adding just references to other composefs repo objects. I'm not sure its overly helpful to use a complicated binary format for the wrapping, especially one which is completely different from the inner format.
Yeah, though the nice thing about Rust here is that for stuff like this there's a lot of well-done crates.
It also makes it a lot more obvious and easy to parse from other languages too if we can say "it's just CBOR" (or whatever).
Anyways: I'm basically fine with this as is too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although what about the algorithm agility? There's been some thoughts that for post-quantum crypto we may need to get away from sha256 in theory as far as I understand things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the right thing is to add a header size field, and skip parts we don't understand. Then we can easily extend this later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a new extension_size field which we skip on read.
9aedd96
to
057121b
Compare
This changes the splitstream format a bit, with the goal of allowing splitstreams to support ostree files as well (see containers#144) The primary differences are: * The header is not compressed * All referenced fs-verity objects are stored in the header, including external chunks, mapped splitstreams and (a new feature) references that are not used in chunks. * The mapping table is separate from the reference table (and generally smaller), and indexes into it. * There is a magic value to detect the file format. * There is a magic content type to detect the type wrapped in the stream. * We store a tag for what ObjectID format is used * The total size of the stream is stored in the header. The ability to reference file objects in the repo even if they are not part of the splitstream "content" will be useful for the ostree support to reference file content objects. This change also allows more efficient GC enumeration, because we don't have to parse the entire splitstream to find the referenced objects. Signed-off-by: Alexander Larsson <[email protected]>
057121b
to
bed66dc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's talk about this. I might have a bit of bandwidth to work on this if you like.
struct SplitstreamHeader { | ||
magic: [u8; 7], // Contains SPLITSTREAM_MAGIC | ||
algorithm: u8, // The fs-verity algorithm used, 1 == sha256, 2 == sha512 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also do fs-verity block size here. That's usually expressed as a bit-shift count, so 12 or 16...
We could also write it like "fsverity-sha256-12" or so as a string... some relevant discussion in #181.
n_refs: u64, | ||
n_mappings: u64, | ||
refs: [ObjectID; n_refs] // sorted | ||
mappings: [MappingEntry; n_mappings] // sorted by body |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's so sketch that we hardcode sha256 here... I think that's probably OK, but maybe we'd add an extension mechanism so we could add new types of mappings tables...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like this from the start:
- magic
- n_sections
- n_sections * (
- section_start
- section_size_in_bytes
- )
We could name the sections but I think it's quite OK to just know what the numbers mean and require that they're all present, in order. An empty section would be denoted by a zero size.
We could also get into compat vs. uncompat extensions... not sure how far it's worth going here...
This changes the splitstream format a bit, with the goal of allowing splitstreams to support ostree files as well (see #144), but it it imho a generally nice change.
The primary differences are:
The ability to reference file objects in the repo even if they are not part of the splitstream "content" will be useful for the ostree support to reference file content objects.
This change also allows more efficient GC enumeration, because we don't have to parse the entire splitstream to find the referenced objects.