-
Notifications
You must be signed in to change notification settings - Fork 24
New feature: allow passing context during deserialization #33
Description
Consider a type definition that has changed during the evolution of the application
(** An old version of the data structure. *)
module Old = struct
type t = { foo: bar } [@@deriving bin_io]
end
(** A new version of the data structure. *)
module New = struct
type t = { foo: bar, sna: fu option } (* new version *) [@@deriving bin_io]
endConsider a scenario in which an old version of the application wrote some binary data to storage using Old.bin_write_t and a new version of the application reads it back with New.bin_write_t. If we're lucky, we'll end up raising an exception at some point while reading our buffer, but if we're not, we'll end up with some broken data structure in memory.
Without railguards, this is the default behavior for bin_prot. As of this writing, the best way to avoid this is to store some version information (or some bin_shape) alongside the data, detect whether the definition of types has changed at some point between serialization and deserialization fail deserialization in such cases. However, for many applications, this is not acceptable, as this would cause user data loss. In fact, this problem happens to an experimental branch for a fairly high-profile OCaml application I'm currently contributing to.
For most applications, having a migration procedure would be much preferable, e.g. in this example a version of bin_read_new_t that could be informed that it's reading an old_t and take the decision to introduce sna: none instead of random data or a failure.
This proposal introduces a few tweaks to bin_prot and ppx_bin_prot to allow such migrations:
- the API of
bin_protis changed a bit; - the serialized format remains unchanged.
Proposal
Change the definition from
type 'a reader = buf -> pos_ref:pos_ref -> 'ato
type ('a, 'ctx) reader = ctx:'ctx -> buf -> pos_ref:pos_ref -> 'awhere 'ctx is a context used to guide deserialization. In particular, for the problem at hand, user code would ensure that 'ctx contains information on the version of the schema used to write the data, and use this to guide transparent data migrations.
A few changes will also be needed to ppx_bin_prot to generate simple versions of such readers.
Example use
(** A context used to deserialize a buffer, containing in particular information on which version of the storage schema was used. *)
type ctx = {
stored_version: int;
(** Saving and restoring this information is the responsibility of the application, not bin_prot *)
...
}
(**
The latest definition of our data scheme.
Business logics informs us that if `sna` is somehow not stored, we should inject `none`.
*)
module New = struct
type t = { foo: bar, sna: fu option } [@@deriving bin_read ~ctx:ctx, bin_write]
end
(**
An older version of our data scheme.
Kept in the code for the sake of migrations, but shouldn't be used otherwise.
*)
module Old = struct
type t = {foo: bar} [@@deriving bin_read]
end
(** The type we're interested in serializing/deserializing *)
type t = New.t = {foo:bar, sna: fu option} [@@deriving bin_write]
(** Deserialization code for `t`, written manually. *)
let bin_read_t ~ctx buf ~pos_ref =
if ctx.stored_version = latest_version then
(* Happy path, we can immediately use the code generated by `ppx_bin_prot`. *)
New.bin_read_t ~ctx buf ~pos_ref
else if ctx.stored_version = older_version then
(* Perform migration, using code generated by `ppx_bin_prot`. *)
let old = Old.bin_read_t ~ctx buf ~pos_ref in
{
foo = old.Old.foo;
sna = none
}Note that this version of bin_read_t works even if, for instance, the definition of bar has also changed.
Detailed list of changes proposed
- Rewrite
bin_protto change the definition of'a reader, as above. - All the built-in readers (e.g.
read_int) should now be generic(foo, 'ctx) reader, ignoring their argumentctx. - Add an argument
ctxtoppx_bin_protto let users specify the type of the context. If unspecified, use'_.
Beyond this
Note that passing ctx would have applications beyond data migrations. For instance, users could take advantage of this to collect statistics during deserialization without resorting to thread-unsafe global variables and side-effects.
We could also introduce ctx into bin_write and bin_size which could be used e.g. to collect statistics or pick from several possible formats. This goes beyond the scope of the current proposal.
There may be a way to @@derive the generation of migration code (and in fact, I have a few ideas for how to do this), but this is beyond the scope of the current proposal.