-
-
Notifications
You must be signed in to change notification settings - Fork 3k
Description
Now that we have a new binary fixed format for serialization, it would be good to eventually drop the old JSON based serialization code, since maintaining duplicate implementations slows down development, and there is a risk that the implementations will get out of sync.
At work we have some tools that consume the JSON based cache format, and it would be non-trivial to update them to directly use the binary format. As a workaround, we could provide a tool that converts a binary cache into JSON files that resemble the current JSON format (they don't need to be 100% compatible since the format isn't documented anywhere).
To implement this, we can make the binary format "self-describing", i.e. it would have enough redundancy that we can write a simple generic parser that reads arbitrary data serialized using the format and converts it into JSON, and that doesn't need to know too many details of each possible object type that can be serialized (to simplify maintenance).
Here is one possible to way to make the format self-describing:
- Each value/object is serialized as
<type tag><data>, where the format of data depends on the type tag. The type tag is a 8-bit integer. - Simple values are also encoded as a type tag followed by data. These include integers, strings, booleans, floats and
None. - We can have similar generic encoding for lists and dicts, such as
<type tag for list><number of items><arbitrary value>.... - AST nodes would also have distinct type tags, so that it would be possible to mix simple values and AST nodes.
- I would propose that AST nodes and types would be encoded using a format like this:
<type tag><field tag><arbitrary value>...<end tag>. Field tags are also 8-bit integers that map to JSON keys. The end tag is a reserved 8-bit number that is distinct from all type tags.
- I would propose that AST nodes and types would be encoded using a format like this:
Now we can implement a generic parser. It needs to know how to parse all simple values and container values, and it must have a mapping from valid type tags for AST nodes (including type objects) and the field tags to strings.
I think that this format would also make it easy to implement lazy deserialization. We can easily find the end of an arbitrary object in a serialized byte stream, without having to deserialize it. Instead of deserializing a FuncDef, for example, we could just find the and of the serialized representation and put the serialized byte string into a symbol table.
cc @ilevkivskyi