New feature: allow passing context during deserialization

Consider a type definition that has changed during the evolution of the application

```ocaml
(** An old version of the data structure. *)
module Old = struct
    type t = { foo: bar } [@@deriving bin_io]
end

(** A new version of the data structure. *)
module New = struct
   type t = { foo: bar, sna: fu option } (* new version *)  [@@deriving bin_io]
end
```

Consider a scenario in which an old version of the application wrote some binary data to storage using `Old.bin_write_t` and a new version of the application reads it back with `New.bin_write_t`. If we're lucky, we'll end up raising an exception at some point while reading our buffer, but if we're not, we'll end up with some broken data structure in memory.

Without railguards, this is the default behavior for `bin_prot`. As of this writing, the best way to avoid this is to store some version information (or some `bin_shape`) alongside the data, detect whether the definition of types has changed at some point between serialization and deserialization fail deserialization in such cases. However, for many applications, this is not acceptable, as **this would cause user data loss**. In fact, this problem happens to an experimental branch for a fairly high-profile OCaml application I'm currently contributing to.

For most applications, having a migration procedure would be much preferable, e.g. in this example a version of `bin_read_new_t` that could be informed that it's reading an `old_t` and take the decision to introduce `sna: none` instead of random data or a failure.

This proposal introduces a few tweaks to `bin_prot` and `ppx_bin_prot` to allow such migrations:

- the API of `bin_prot` is changed a bit;
- the serialized format remains unchanged.

# Proposal

Change the definition from

```ocaml
type 'a reader = buf -> pos_ref:pos_ref -> 'a
```

to

```ocaml
type ('a, 'ctx) reader = ctx:'ctx -> buf -> pos_ref:pos_ref -> 'a
```

where `'ctx` is a context used to guide deserialization. In particular, for the problem at hand, user code would ensure that `'ctx` contains information on the version of the schema used to write the data, and use this to guide transparent data migrations.

A few changes will also be needed to `ppx_bin_prot` to generate simple versions of such `reader`s.

# Example use

```ocaml
(** A context used to deserialize a buffer, containing in particular information on which version of the storage schema was used. *)
type ctx = {
  stored_version: int;
  (** Saving and restoring this information is the responsibility of the application, not bin_prot *)
  ...
}

(**
 The latest definition of our data scheme.

 Business logics informs us that if `sna` is somehow not stored, we should inject `none`.
*)
module New = struct
  type t = { foo: bar, sna: fu option } [@@deriving bin_read ~ctx:ctx, bin_write]
end

(**
  An older version of our data scheme.

  Kept in the code for the sake of migrations, but shouldn't be used otherwise.
 *)
module Old = struct
  type t = {foo: bar}  [@@deriving bin_read]
end

(** The type we're interested in serializing/deserializing *)
type t = New.t = {foo:bar, sna: fu option} [@@deriving bin_write]

(** Deserialization code for `t`, written manually. *)
let bin_read_t ~ctx buf ~pos_ref =
  if ctx.stored_version = latest_version then
    (* Happy path, we can immediately use the code generated by `ppx_bin_prot`. *)
    New.bin_read_t ~ctx buf ~pos_ref
  else if ctx.stored_version = older_version then
    (* Perform migration, using code generated by `ppx_bin_prot`. *)
    let old = Old.bin_read_t ~ctx buf ~pos_ref in
    {
      foo = old.Old.foo;
      sna = none
    }
```

Note that this version of `bin_read_t` works even if, for instance, the definition of `bar` has also changed.

# Detailed list of changes proposed

- Rewrite `bin_prot` to change the definition of `'a reader`, as above.
- All the built-in readers (e.g. `read_int`) should now be generic `(foo, 'ctx) reader`, ignoring their argument `ctx`.
- Add an argument `ctx` to `ppx_bin_prot` to let users specify the type of the context. If unspecified, use `'_`.


# Beyond this

Note that passing `ctx` would have applications beyond data migrations. For instance, users could take advantage of this to collect statistics during deserialization without resorting to thread-unsafe global variables and side-effects.

We could also introduce `ctx` into `bin_write` and `bin_size` which could be used e.g. to collect statistics or pick from several possible formats. This goes beyond the scope of the current proposal.

There may be a way to `@@derive` the generation of migration code (and in fact, I have a few ideas for how to do this), but this is beyond the scope of the current proposal.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New feature: allow passing context during deserialization #33

Proposal

Example use

Detailed list of changes proposed

Beyond this

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

New feature: allow passing context during deserialization #33

Description

Proposal

Example use

Detailed list of changes proposed

Beyond this

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions