-
Notifications
You must be signed in to change notification settings - Fork 24
Add serialization design doc #362
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,342 @@ | ||
| # Protocol Serialization and Deserialization | ||
|
|
||
| This document will describe how objects are serialized and deserialized | ||
| according to some protocol, such as | ||
| [AWS RestJson1](https://smithy.io/2.0/aws/protocols/aws-restjson1-protocol.html), | ||
| based on information from a Smithy model. | ||
|
|
||
| ## Goals | ||
|
|
||
| * Shared - Protocols should be implemented as part of a shared library. If two | ||
| clients using the same protocol are installed, they should use a shared | ||
| implementation. These implementations should be as compact as possible while | ||
| still being robust. | ||
| * Hot-swappable - Implementations should be flexible enough to be swapped at | ||
| runtime if necessary. If a service supports more than one protocol, it should | ||
| be trivially easy to swap between them, even at runtime. | ||
| * Flexible - Implementations should be useable for purposes other than as a | ||
| component of making a request to a web service. Customers should be able to | ||
| feed well-formed data from any source into a protocol and have it transform | ||
| that data with no side-effects. | ||
|
|
||
| ## Terminology - `Protocol` vs protcol | ||
|
|
||
| In Smithy, a "protocol" is a method of communicating with a service over a | ||
| particular transport using a particular format. For example, the | ||
| `aws.protocols#RestJson1` protocol is a protocol that communicates over the an | ||
| HTTP transport that makes use of REST bindings and formats structured HTTP | ||
| payloads in JSON. | ||
|
|
||
| In Python, a | ||
| [`Protocol`](https://typing.readthedocs.io/en/latest/spec/protocol.html#protocols) | ||
| is a type that is used to define structural subtyping. For example, the | ||
| following shows a `Protocol` and two valid implementations of it: | ||
|
|
||
| ```python | ||
| class ExampleProtocol(Protocol): | ||
| def greet(self, name: str) -> str: | ||
| return f"Hello {name}!" | ||
|
|
||
| class ExplicitImplementation(ExampleProtocol): | ||
| pass | ||
|
|
||
| class ImplicitImplementation: | ||
| def greet(self, name: str) -> str: | ||
| return f"Good day to you {name}." | ||
| ``` | ||
|
|
||
| Since this is *structural* subtyping, it isn't required that implmentations | ||
| actual inheret from the `Protocol` or otherwise declare that they're | ||
| implementing it. But they *can* to make it more explicit or to inherit a default | ||
| implementation. The `Protocol` class itself cannot be instantiated, however. | ||
|
|
||
| This overlapping of terms clearly can cause confusion. To hopefully avoid that, | ||
| implementations of Python's `Protocol` type will referred to using the literal | ||
| `Protocol` or the general term "interface". (A protocol *isn't* quite the same | ||
| thing as an interface in other programming languages, but for our purposes it's | ||
| close enough.) Smithy protocols will be referred to simply as "protocol"s or by | ||
| their specific protocol names (e.g. restJson1). | ||
|
|
||
| ## Schemas | ||
|
|
||
| The basic building block of Smithy is the "shape", a representation of data of a | ||
| given type with known properties called "members", additional constraints and | ||
| metadata called "traits", and an identifier. | ||
|
|
||
| For each shape contained in a service, a `Schema` object will be generated that | ||
| contains almost all of its information. Traits that are known to not affect | ||
| serialization or deserialization will be omitted from the generated `Schema` to | ||
| save space. | ||
|
|
||
| Schemas will form the backbone of serialization and deserialization, carrying | ||
| information that cannot be natively included in generated data classes. | ||
|
|
||
| The `Schema` class will be a read-only dataclass. The following shows its basic | ||
| definition, though the concrete definition may have a slightly different | ||
| implementation and/or additional helper methods. | ||
|
|
||
| ```python | ||
| @dataclass(kw_only=True, frozen=True) | ||
| class Schema: | ||
| id: ShapeID | ||
| shape_type: ShapeType | ||
| traits: dict[ShapeID, "Trait"] = field(default_factory=dict) | ||
| members: dict[str, "Schema"] = field(default_factory=dict) | ||
| member_target: "Schema | None" = None | ||
| member_index: int | None = None | ||
|
|
||
| @classmethod | ||
| def collection( | ||
| cls, | ||
| *, | ||
| id: ShapeID, | ||
| shape_type: ShapeType = ShapeType.STRUCTURE, | ||
| traits: list["Trait"] | None = None, | ||
| members: Mapping[str, "MemberSchema"] | None = None, | ||
| ) -> Self: | ||
| ... | ||
|
|
||
|
|
||
| @dataclass(kw_only=True, frozen=True) | ||
| class Trait: | ||
| id: "ShapeID" | ||
| value: "DocumentValue" = field(default_factory=dict) | ||
| ``` | ||
|
|
||
| Below is an example Smithy `structure` shape, followed by the `Schema` it would | ||
| generate. | ||
|
|
||
| ```smithy | ||
| namespace com.example | ||
|
|
||
| structure ExampleStructure { | ||
| member: Integer = 0 | ||
| } | ||
| ``` | ||
|
|
||
| ```python | ||
| EXAMPLE_STRUCTURE_SCHEMA = Schema.collection( | ||
| id=ShapeID("com.example#ExampleStructure"), | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I assume ShapeID would be a kind of flyweight factory to cache previously created IDs? And ideally there could be constants that you can refer to that don't have to do any kind of cache/hash lookups for prelude trait IDs.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There's no cache currently, but it can be easily done. Honestly I could probably just add As for prelude traits, there's currently not centralized constants. There ARE centralized constants for prelude shapes (e.g. |
||
| members={ | ||
| "member": { | ||
| "target": INTEGER, | ||
mtdowling marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| "index": 0, | ||
mtdowling marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| "traits": [ | ||
| Trait(id=ShapeID("smithy.api#default"), value=0), | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Document values don't need any kind of wrapping type?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
| ], | ||
| }, | ||
| }, | ||
| ) | ||
| ``` | ||
|
|
||
| ## Shape Serializers and Serializeable Shapes | ||
JordonPhillips marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| Serialization will function by the interaction of two interfaces: | ||
| `ShapeSerializer`s and `SerializeableShape`s. | ||
|
|
||
| A `ShapeSerializer` is a class that is capable of taking a `Schema` and an | ||
| associated shape value and serializing it in some way. For example, a | ||
| `JSONShapeSerializer` could be written in Python to convert the shape to JSON. | ||
|
|
||
| A `SerializeableShape` is a class that has a `serialize` method that takes a | ||
| `ShapeSerializer` and calls the relevant methods needed to serialize it. All | ||
| generated shapes will implement the `SerializeableShape` interface, which will | ||
| then be the method by which all serialization is performed. | ||
|
|
||
| Using open interfaces in this way allows for great flexibility in the generated | ||
| Python code, which will be discussed more later. | ||
|
|
||
| In Python these interfaces will be represented as shown below: | ||
|
|
||
| ```python | ||
| @runtime_checkable | ||
| class ShapeSerializer(Protocol): | ||
|
|
||
| def begin_struct( | ||
| self, schema: "Schema" | ||
| ) -> AbstractContextManager["ShapeSerializer"]: | ||
| ... | ||
|
|
||
| def write_struct(self, schema: "Schema", struct: "SerializeableStruct") -> None: | ||
| with self.begin_struct(schema=schema) as struct_serializer: | ||
| struct.serialize_members(struct_serializer) | ||
|
|
||
| def begin_list( | ||
| self, | ||
| schema: "Schema", | ||
| size: int, | ||
| ) -> AbstractContextManager["ShapeSerializer"]: | ||
| ... | ||
|
|
||
| def begin_map( | ||
| self, | ||
| schema: "Schema", | ||
| size: int, | ||
| ) -> AbstractContextManager["MapSerializer"]: | ||
| ... | ||
|
|
||
| def write_null(self, schema: "Schema") -> None: | ||
| ... | ||
|
|
||
| def write_boolean(self, schema: "Schema", value: bool) -> None: | ||
| ... | ||
|
|
||
| def write_byte(self, schema: "Schema", value: int) -> None: | ||
| self.write_integer(schema, value) | ||
|
|
||
| def write_short(self, schema: "Schema", value: int) -> None: | ||
| self.write_integer(schema, value) | ||
|
|
||
| def write_integer(self, schema: "Schema", value: int) -> None: | ||
| ... | ||
|
|
||
| def write_long(self, schema: "Schema", value: int) -> None: | ||
| self.write_integer(schema, value) | ||
|
|
||
| def write_float(self, schema: "Schema", value: float) -> None: | ||
| ... | ||
|
|
||
| def write_double(self, schema: "Schema", value: float) -> None: | ||
| self.write_float(schema, value) | ||
|
|
||
| def write_big_integer(self, schema: "Schema", value: int) -> None: | ||
| self.write_integer(schema, value) | ||
|
|
||
| def write_big_decimal(self, schema: "Schema", value: Decimal) -> None: | ||
| ... | ||
|
|
||
| def write_string(self, schema: "Schema", value: str) -> None: | ||
| ... | ||
|
|
||
| def write_blob(self, schema: "Schema", value: bytes) -> None: | ||
| ... | ||
|
|
||
| def write_timestamp(self, schema: "Schema", value: datetime.datetime) -> None: | ||
| ... | ||
|
|
||
| def write_document(self, schema: "Schema", value: "Document") -> None: | ||
| ... | ||
|
|
||
|
|
||
| @runtime_checkable | ||
| class MapSerializer(Protocol): | ||
| def entry(self, key: str, value_writer: Callable[[ShapeSerializer], None]): | ||
| ... | ||
|
|
||
|
|
||
| @runtime_checkable | ||
| class SerializeableShape(Protocol): | ||
| def serialize(self, serializer: ShapeSerializer) -> None: | ||
| ... | ||
|
|
||
|
|
||
| @runtime_checkable | ||
| class SerializeableStruct(SerializeableShape, Protocol): | ||
| def serialize_members(self, serializer: ShapeSerializer) -> None: | ||
| ... | ||
| ``` | ||
|
|
||
| Below is an example Smithy `structure` shape, followed by the | ||
| `SerializebleShape` it would generate. | ||
|
|
||
| ```smithy | ||
| namespace com.example | ||
|
|
||
| structure ExampleStructure { | ||
| member: Integer = 0 | ||
| } | ||
| ``` | ||
|
|
||
| ```python | ||
| @dataclass(kw_only=True) | ||
| class ExampleStructure: | ||
| member: int = 0 | ||
|
|
||
| def serialize(self, serializer: ShapeSerializer): | ||
| serializer.write_struct(EXAMPLE_STRUCTURE_SCHEMA, self) | ||
|
|
||
| def serialize_members(self, serializer: ShapeSerializer): | ||
| serializer.write_integer( | ||
| EXAMPLE_STRUCTURE_SCHEMA.members["member"], self.member | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To avoid any hashmap/hashing overhead, I would either use the member index or also generate constants that refer to each member
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The member index can't be used as it refers to the order present in the model at generation time, which is not the same as the order in the members dict. Generating constants for each member would add massive bloat to the artifact size, which I think offsets whatever gains you might get in performance.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Interesting that it would bloat it that much. I just worry that having to do string hashing even to grab the member schema is going to be an unnecessary perf hit. I'd benchmark it.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You have to add, at minimum, two extra lines of code for every member - one to define and and one to import it. We might end up switching to using a wildcard import though, which would eliminate the lines bit. |
||
| ) | ||
| ``` | ||
|
|
||
| ### Performing Serialization | ||
|
|
||
| To serialize a shape, all that is needed is an instance of the shape and a | ||
| serializer. The following shows how one might serialize a shape to JSON bytes: | ||
|
|
||
| ```python | ||
| >>> shape = ExampleStructure(member=9) | ||
| >>> serializer = JSONShapeSerializer() | ||
| >>> shape.serialize(serializer) | ||
| >>> print(serializer.get_result()) | ||
| b'{"member":9}' | ||
| ``` | ||
|
|
||
| The process for performing serialization never changes from the high level. | ||
| Different implementations (such as for XML, CBOR, etc.) will all interact with | ||
| the shape in the same exact way. The same interface will be used to implement | ||
| HTTP bindings, event stream bindings, and any other sort of model-driven data | ||
| binding that may be needed. | ||
|
|
||
| These implementations can be swapped at any time without having to regenerate | ||
| the client, and can be used for purposes other than making client calls to a | ||
| service. A service could, for example, model its event structures and include | ||
| them in their client. A customer could then use the generated | ||
| `SerializeableShape`s to serialize those events without having to do so | ||
| manually. | ||
|
|
||
| ### Composing Serializers | ||
|
|
||
| While simple `ShapeSerializer`s can exist, the need to bind data to multiple | ||
| locations or with conditional formatting may mean that a single | ||
| `ShapeSerializer` may not be sufficient to implement a protocol, or even | ||
| content-type. Instead, more complex protocols should *compose* multiple | ||
| `ShapeSerializer`s to achieve their intended purpose. The | ||
| `InterceptingSerializer` class aims, in part, to make this easier. | ||
|
|
||
| ```python | ||
| class InterceptingSerializer(ShapeSerializer, metaclass=ABCMeta): | ||
| @abstractmethod | ||
| def before(self, schema: Schema) -> ShapeSerializer: ... | ||
|
|
||
| @abstractmethod | ||
| def after(self, schema: Schema) -> None: ... | ||
|
|
||
| def write_boolean(self, schema: Schema, value: bool) -> None: | ||
| self.before(schema).write_boolean(schema, value) | ||
| self.after(schema) | ||
|
|
||
| [...] | ||
| ``` | ||
|
|
||
| The `before` method allows for dispatching to different serializers depending on | ||
| the schema. You may dispatch to different serializers depending on whether the | ||
| shape is bound to an HTTP header or query string, for example. | ||
|
|
||
| ```python | ||
| class HTTPBindingSerializer(InterceptingSerializer): | ||
| _header_serializer: ShapeSerializer | ||
| _query_serializer: ShapeSerializer | ||
|
|
||
| def before(self, schema: Schema) -> ShapeSerializer: | ||
| if HTTP_HEADER_TRAIT in schema.traits: | ||
| return _header_serializer | ||
| elif HTTP_QUERY_TRAIT in schema.traits: | ||
| return _query_serializer | ||
| ... | ||
| ``` | ||
|
|
||
| Since each of these sub-serializers may only be able to handle shapes of a | ||
| certain type, they may want to inherit from `SpecificShapeSerializer`, which | ||
| throws an error by default for shape types whose serialize method is not | ||
| implemented. | ||
|
|
||
| ```python | ||
| class HTTPHeaderSerializer(SpecificShapeSerializer): | ||
| def write_boolean(self, schema: "Schema", value: bool) -> None: | ||
| ... | ||
|
|
||
| [...] | ||
| ``` | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
member_index would rely on a member_list that it can index into, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it's meta-knowledge only the code generator knows about and only the generated deserialize method uses. In most cases it will match the ordering in the members dictionary of its parent, but not always (e.g. in the case of recursive members that get inserted later). I'm thinking of getting rid of this anyway - it's a performance optimization in Java but I don't know that it makes a real difference in Python.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha. It's handled internally. I think it might still end up as a performance boost even in Python, assuming array indexing is faster than hashmaps. When deserializing a type, the codec does the work of identifying which member to deserialize, and then hands that to the function used to build up the type. That function needs to determine what member schema it's given to know how to set the right value on the structure. I wonder if you can use the array index somehow to better handle dispatching in a wayt that doesn't require comparing strings (which is going to be slow). If this implies double-dispatch than that has its own perf hit, but just throwing out there that we want to avoid string hashing here probably.