Make a non-self-describing serde format (like
bincode, bitcode or
postcard) behave like a self-describing one by
transparently serializing a schema alongside the data.
In the simplest case you just add a newtype-wrapper around your data
let bytes = bitcode::serialize(&original)?;
let roundtripped: Original = bitcode::deserialize(&bytes)?;becomes
let bytes = bitcode::serialize(&SelfDescribed(&original))?;
let SelfDescribed(roundtripped) =
bitcode::deserialize::<SelfDescribed<Original>>(&bytes)?;Non-self-describing formats are generally faster and more compact than their self-describing counterparts. In exchange, however, they lose out on a really important feature: schema evolution.
When saving data using e.g. serde_json, you can later deserialize it using a
different type definition than the one used when serializing it. In particular,
you can:
- Re-order fields & variants.
- Add fields & variants with
#[serde(default)]and#[serde(skip_serializing_if)]. - Rename and alias fields & variants for backwards compatibility.
- Change field types by adding and removing certain wrappers like
Option<>and newtypes. - Change primitive field types in ways that make sense: in-range integer types, integer to float etc.
- Change other field types backwards-compatibly, using
#[serde(untagged)], usedeserialize_anyorserde_this_or_that.
Non-self-describing formats break if you do any of these things. This crate is a middle-ground, letting you recover all of these features, while not losing out on all the benefits of your favourite non-self-describing format.
use serde::{Deserialize, Serialize};
use serde_describe::SelfDescribed;
// Define a type that non-self-describing formats would generally struggle
// with, using skipped fields and untagged unions.
#[derive(Debug, PartialEq, Serialize, Deserialize)]
struct TopLevel {
int32: u32,
#[serde(default, skip_serializing_if = "Vec::is_empty")]
items: Vec<Item>,
}
#[derive(Debug, PartialEq, Serialize, Deserialize)]
#[serde(untagged)]
enum Item {
Int(u32),
Str(String),
}
// Serialize a `Vec<TopLevel>` using the `SelfDescribed` wrapper and bitcode, a
// particularly restrictive binary format.
let original = vec![
TopLevel {
int32: 10,
items: vec![Item::Str("hello".to_owned()), Item::Int(10)],
},
TopLevel {
int32: 20,
items: Vec::new(),
},
];
let bytes = bitcode::serialize(&SelfDescribed(&original))?;
// Then deserialize using the same wrapper to use the embedded schema.
let roundtripped: SelfDescribed<Vec<TopLevel>> =
bitcode::deserialize(&bytes)?;
assert_eq!(roundtripped, original);
# Ok::<_, Box<dyn std::error::Error>>(())See the example on
SelfDescribed
for a more involved case, with actual deserialization under a different type.
At a high level, the library prioritizes serde compatibility, deserialization speed and size on the wire at the expense of serialization speed.
-
Serde compatibility. It is possible to do a better job at schema inference than this crate does if types could be decorated with additional custom
deriveattributes. However, a lot of the rust ecosystem already derivesserdetypes, making this crate much easier to drop-in. -
Deserialization Speed. Should only introduce a minimal amount of overhead compared to the underlying format. The main, unavoidable, cost comes from no longer dispatching on field and variant indices, instead using their names. Otherwise, the only additional operations performed are validation and dispatching based on a relatively small, cache-friendly schema object. In addition to the schema object itself, the only other data the wrapper adds is discriminants to tag dynamically shaped data (enums, tagged or otherwise, and structs with skipped fields).
-
Size. Given a compact underlying format, the
SelfDescribedrepresentation should end up more compact than typical self-describing formats for large enough objects, as type information (shapes, field names and variant names) is deduplicated across the entire object. See the Advanced Usage section on how to go further and amortize the schema overhead across multiple objects as well. -
Serialization Speed. The library is optimized for data that is written once and read many times. As such, various, potentially expensive, schema optimization passes are performed to keep the schema size small (and not grow unbounded with the input data).
In not-super-extensive
benchmarks, as a rule of thumb,
you can expect serde_describe + postcard to beat other self-describing
binary formats, but have significant deserialization speed overhead compared to
bare postcard. Roughly:
- A few hundred bytes of uncompressed overhead, or ~5% for the given datasets, though this should be amortized.
- Serialization speed is around 30x slower (lol).
- Deserialization speed is around 1.5-2x slower.
Compared to a binary self-describing format like serde_cbor
- ~50-150% smaller objects uncompressed.
- ~20% smaller objects zstd-compressed.
- Deserialization speed is 10-20% faster.
Benchmarks can be run in serde_describe_benchmark_suite with criterion for
speed and cargo run benchmark_sizes for sizes. There is still a lot of room
for improvement both in benchmark methodology, and in optimizations both for
speed and schema size (as this dominates the space overhead).
The crate's current implementation has a few limitations. None of these are fundamental architectural constraints and they could all be removed backwards-compatibly if there is sufficient demand / motivation to do so.
- Objects must be < ~4GiB in size. More precisely, the various indices used internally by the library need to fit in 32-bit unsigned integers
- Structs can have at most 64 skippable fields. These are fields that, in
any given trace, appear as both present and absent. Fields that are always
present or always absent (irrespective of their
#[serde(skip_serializing_if)]attributes) do not count towards this limit. - No
no_stdsupport. could probably be added using heapless. - MSRV is recent. This is because I like using new features; if anyone has a requirement for an older MSRV, I'm easily persuadable.
- Untrusted input. The crate
forbid-s unsafe, and avoids input-dependent panics input, but DoS protections are left to the underlying format which should limit array sizes, maximum depth etc. It would be possible (and maybe desirable) forserde_describeto do this itself.
While pre-1.0 the format will change between 0.x to 0.y versions. Post 1.0,
the plan is to support the wire format practically forever, even across major
versions. Components are in place to do this already, but the burden to do so
at this point doesn't seem worth it.
As shown above, the simplest way to use this library is to simply wrap your
data in SelfDescribed. However, if you're storing many objects of the same
type and you need to deserialize them independently (so you can't simply
serialize a Vec<Item>), then saving each one as a separate SelfDescribed
object adds unnecessary overhead due to repeatedly serializing the schema with
each object
use serde::{Deserialize, Serialize};
use serde_describe::SelfDescribed;
#[derive(Serialize, Deserialize, PartialEq, Debug)]
struct Item {
id: u64,
#[serde(default, skip_serializing_if="Option::is_none")]
tag: Option<String>,
}
let bytes1 = postcard::to_stdvec(
&SelfDescribed(Item { id: 1, tag: Some("foo".to_owned()) }),
)?;
let bytes2 = postcard::to_stdvec(
&SelfDescribed(Item { id: 2, tag: None }),
)?;
// Quite a few bytes per item, despite the objects' small size:
assert_eq!(bytes1.len(), 41);
assert_eq!(bytes2.len(), 34);
assert_eq!(bytes1.len() + bytes2.len(), 75);
let SelfDescribed(item1) =
postcard::from_bytes::<SelfDescribed<Item>>(&bytes1)?;
let SelfDescribed(item2) =
postcard::from_bytes::<SelfDescribed<Item>>(&bytes2)?;
assert_eq!([item1.id, item2.id], [1, 2]);
# Ok::<(), Box<dyn std::error::Error>>(())To avoid this overhead, serde_describe supports the SchemaBuilder API
which, while more complicated to use, makes schema serialization and
deserialization explicit and separate from the data
use serde::{Deserialize, Serialize};
use serde_describe::{SchemaBuilder, DescribedBy, Schema};
#[derive(Serialize, Deserialize, PartialEq, Debug)]
struct Item {
id: u64,
#[serde(default, skip_serializing_if="Option::is_none")]
tag: Option<String>,
}
// A `SchemaBuilder` is used to trace multiple values, before serializing them
// and the inferred schema that describes all of them.
let mut builder = SchemaBuilder::new();
let trace1 = builder.trace(&Item { id: 1, tag: Some("foo".to_owned()) })?;
let trace2 = builder.trace(&Item { id: 2, tag: None })?;
let schema = builder.build()?;
// Save the three objects without the schema, and the schema separately.
let bytes1 = postcard::to_stdvec(&schema.describe_trace(trace1))?;
let bytes2 = postcard::to_stdvec(&schema.describe_trace(trace2))?;
let bytes_schema = postcard::to_stdvec(&schema)?;
// Each individual serialized object is now very small and the combined size of
// the two objects + schema is still smaller, as the schema is only saved once.
assert_eq!(bytes1.len(), 6);
assert_eq!(bytes2.len(), 2);
assert_eq!(bytes_schema.len(), 38);
assert_eq!(bytes1.len() + bytes2.len() + bytes_schema.len(), 46);
// To deserialize the data, first load the schema, then use it to deserialize
// `DescribedBy` objects for each of the items. Behind the scenes this uses
// `DeserializeSeed` so the underlying format needs to have a public API that
// supports it.
let schema = postcard::from_bytes::<Schema>(&bytes_schema)?;
let item1: Item = schema.deserialize_described(
&mut postcard::Deserializer::from_bytes(&bytes1),
)?;
let item2: Item = schema.deserialize_described(
&mut postcard::Deserializer::from_bytes(&bytes2),
)?;
assert_eq!([item1.id, item2.id], [1, 2]);
// You can also get a human-readable view of the inferred schema for debugging
// purposes.
assert_eq!(
format!("{:#}", schema),
"Item_4 { id: u64_1, tag?: some_3(str_2) }"
);
# Ok::<(), Box<dyn std::error::Error>>(())You can completely skip this section unless you're curious about the internal implementation of the library.
Serializing anything with serde_describe is a two-pass process
(SelfDescribed simply does both of these two phases for you)
- A serializer runs on the given value producing a
Traceand aSchema. - The underlying serializer is used to save the pair of
(schema, transform(schema, trace)).
At deserialization time, a Deserializer is built which wraps the underlying
deserializer and walks the schema graph performing coercions and calling the
appropriate serde::de::Visitor methods.
We'll tackle the trace, schema and transform in turn.
The Trace is a replayable representation of all the serde::Serializer calls
made on the internal serializer (serialize_u32, serialize_struct,
serialize_struct(field_name, ...) etc).
The upside of using this intermediate representation is that it side-steps any
risks of non-deterministic serialization (the same type resulting in two
different---potentially equivalent---sequences of Serializer calls if called
twice) and its structure is chosen such that matching back to the Schema
later is simpler. The downside is that it potentially doubles the amount of
memory required during serialization, and it is almost certainly slower than
serializing the same type again would be.
The schema contains interned type, field and variant names, as well as a
graph for the finite state machine describing how to parse the stored object.
Importantly, it is a representation of the serialized data, not of the actual abstract type definition from Rust. So if certain kinds of data are never observed, information about them isn't serialized, saving space. The upshot is that this makes a particular schema only useful with the data that was traced to generate it, not generally useful for any value of given type.
To peek behind the curtain and bring the point home let's look first at a human human-readable description of the inferred schema for a non-trivial object.
use serde::{Deserialize, Serialize};
use serde_describe::{SchemaBuilder, Schema, SelfDescribed};
#[derive(Serialize, Deserialize, PartialEq, Debug)]
struct Parent {
#[serde(default, skip)]
always_skipped_by_serde: Child,
#[serde(default, skip_serializing_if = "Option::is_none")]
always_skipped_but_data_dependent: Option<Child>,
#[serde(default, skip_serializing_if = "Vec::is_empty")]
never_skipped: Vec<Child>,
#[serde(default, skip_serializing_if = "Option::is_none")]
sometimes_skipped: Option<u32>,
tag: Option<String>,
}
#[derive(Serialize, Deserialize, PartialEq, Default, Debug)]
enum Child {
#[default]
NoData,
Unused(u32),
WithData {
mixed: Option<String>,
always_some: Option<String>,
always_none: Option<String>,
}
}
let data = vec![
Parent {
always_skipped_by_serde: Child::NoData,
always_skipped_but_data_dependent: None,
never_skipped: vec![
Child::WithData {
mixed: None,
always_some: Some("value1".to_owned()),
always_none: None,
},
Child::WithData {
mixed: None,
always_some: Some("value2".to_owned()),
always_none: None,
},
Child::NoData,
],
sometimes_skipped: Some(1234),
tag: None,
},
Parent {
always_skipped_by_serde: Child::NoData,
always_skipped_but_data_dependent: None,
never_skipped: vec![
Child::WithData {
mixed: Some("value3".to_owned()),
always_some: Some("value4".to_owned()),
always_none: None,
},
],
sometimes_skipped: None,
tag: Some("tag_value".to_owned()),
},
];
// Print the serialized `SelfDescribed` object as RON making it easier to
// inspect. We'll look at this later.
eprintln!(
"{}",
ron::ser::to_string_pretty(
&SelfDescribed(&data),
ron::ser::PrettyConfig::default(),
).unwrap(),
);
// The `display_for_value` method on `Schema` does the work of tracing the
// value and converts it to a human-readable format. The output isn't
// stable, but is useful for debugging.
assert_eq!(
format!("{:#}", Schema::display_for_value(&data).unwrap()),
"\
seq_12(
Parent_11 {
always_skipped_but_data_dependent?: !,
never_skipped: seq_8(
union_7(
Child::WithData_5 {
mixed: union_4(none_1, some_3(str_2)),
always_some: some_3(..),
always_none: none_1,
},
Child::NoData_6,
),
),
sometimes_skipped?: some_10(u32_9),
tag: union_4(..),
},
)",
);Each type gets displayed with a _<id> suffix which maps to a unique node id
in the FSM graph. This way, the mostly tree-like schema can be displayed as
such, even though in reality it has cycles and repeated nodes. Children of
previously expanded nodes nodes are simply shown as ..; see for instance
union_4(..), not repeating the previous union_4(none_1, some_3(str_2)).
Getting back to the schema itself:
- Top-level, we see it's a
seq(sequence) ofParentstructs (braces mean struct). - The
always_skipped_by_serdefield doesn't show up at all. Its presence will not take up any space in serialized object - The
always_skipped_but_data_dependentfield shows up as optional (suffixed with?), but its type is!(the never type, like in rust). Its presence will not take up any space in serialized objects because it's inferred to never be present. - The
never_skippedfield, even though it /could/ be skipped byserdesimply shows up as "required" (the same astag), because it is always present in the traced data. - The
sometimes_skippedfield is sometimes persent and sometimes absent. As such, its presence will require 1 bit per serialized object. More detail on how this is actually represented in the transform section below. - The type of
sometimes_skippedissome_10(u32_9); because the value is neverNone, we don't actually need a discriminant for this type; when present, it's alwaysSome. - Unlike
sometimes_skipped,tagis sometimesNonesometimesSome, so we'll store it as a discriminated union betweenNoneandSome(String).Option-s are not treated in any special way, instead they act like any other unit struct and newtype struct respectively.
Looking at the children in never_skipped, note that the schema if unified
across different values of Parent and Child. Also note that Child::Unused
doesn't appear in union_7, because it's never serialized.
The transform function converts the object into a subset of serde features
that are supported by non-self-describing formats. In particular:
- Unions (as seen above) are used to turn heterogeneous sequences and maps into homogeneous collections of discriminated enums.
- Union discriminants rely on variant index, with original enum variant names deduplicated and stored in the schema.
- Structs are turned into tuples with field names, types and order deduplicated and stored in the schema.
- Structs (and struct variants) with skippable fields are turned into enums with the discriminant encoding a bitset of present fields. As mentioned above, bits are only used for fields that are sometimes present; fields that always absent or always present do not use up a bit.
- Discriminants with more than 256 values are chunked into nested enums of at
most 256 variants. This is because
serdeinsists on'staticvariant names. So the crate ships with a static array of variants named_00through to_ffwhich can be used any times they're required (this is internal only, matching to user types happens on the original, real enum variant names)
This is what the RON dump of the object from the Schema example, annotated
with comments and cleaned up a bit.
// seq_12
[
// Parent_11::_01 (discriminant means the `sometimes_skipped` field is present)
_01(
// .never_skipped: seq_8
[
// union_7::_00 => Child::WithData_5
_00(
// mixed: union_4::_00 => None
_00,
// always_some: some_3(str_2)
"value1", // (no Some("...") or discriminant, as it's never None)
),
// union_7::_00 => Child::WithData_5
_00(_00, "value2"),
// union_7::_01 => Child::NoData
_01,
],
// .sometimes_skipped: some_10
1234, // (no discriminant as it's never None when present)
// .tag: union_4
_00, // _00 => none_1
),
// Parent_11::_00 (`sometimes_skipped` absent)
_00(
// .never_skipped:
[
// union_7::_00 => Child::WithData_5
_00(
// .mixed: union_4::_01 => Some(String)
_01("value3"),
// .always_some: some_3(str_2)
"value4",
),
],
// tag: union_4
_01("tag_value"), // ::_01 => some_3(str_2)
),
]