Add serialization design doc #362

JordonPhillips · 2025-01-10T15:16:48Z

This adds a design doc for schema-based serialization. In the future this will be expanded with information about deserialization, codecs, and protocols.

This is a fairly rough first draft, which I blame on having covid, but should nevertheless be understandable.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

This adds a design doc for schema-based serialization. In the future this will be expanded with information about deserialization, codecs, and protocols.

mtdowling · 2025-01-13T20:55:43Z

designs/serialization.md

+    traits: dict[ShapeID, "Trait"] = field(default_factory=dict)
+    members: dict[str, "Schema"] = field(default_factory=dict)
+    member_target: "Schema | None" = None
+    member_index: int | None = None


member_index would rely on a member_list that it can index into, right?

No, it's meta-knowledge only the code generator knows about and only the generated deserialize method uses. In most cases it will match the ordering in the members dictionary of its parent, but not always (e.g. in the case of recursive members that get inserted later). I'm thinking of getting rid of this anyway - it's a performance optimization in Java but I don't know that it makes a real difference in Python.

Gotcha. It's handled internally. I think it might still end up as a performance boost even in Python, assuming array indexing is faster than hashmaps. When deserializing a type, the codec does the work of identifying which member to deserialize, and then hands that to the function used to build up the type. That function needs to determine what member schema it's given to know how to set the right value on the structure. I wonder if you can use the array index somehow to better handle dispatching in a wayt that doesn't require comparing strings (which is going to be slow). If this implies double-dispatch than that has its own perf hit, but just throwing out there that we want to avoid string hashing here probably.

mtdowling · 2025-01-13T20:58:38Z

designs/serialization.md

+
+```python
+EXAMPLE_STRUCTURE_SCHEMA = Schema.collection(
+    id=ShapeID("com.example#ExampleStructure"),


I assume ShapeID would be a kind of flyweight factory to cache previously created IDs? And ideally there could be constants that you can refer to that don't have to do any kind of cache/hash lookups for prelude trait IDs.

There's no cache currently, but it can be easily done. Honestly I could probably just add @cache to the constructor and it should work fine.

As for prelude traits, there's currently not centralized constants. There ARE centralized constants for prelude shapes (e.g. smithy.api#String)

designs/serialization.md

mtdowling · 2025-01-13T21:00:12Z

designs/serialization.md

+            "target": INTEGER,
+            "index": 0,
+            "traits": [
+                Trait(id=ShapeID("smithy.api#default"), value=0),


Document values don't need any kind of wrapping type?

DocumentValue is the unwrapped json-like representation

designs/serialization.md

mtdowling · 2025-01-13T21:03:12Z

designs/serialization.md

+
+    def serialize_members(self, serializer: ShapeSerializer):
+        serializer.write_integer(
+            EXAMPLE_STRUCTURE_SCHEMA.members["member"], self.member


To avoid any hashmap/hashing overhead, I would either use the member index or also generate constants that refer to each member

The member index can't be used as it refers to the order present in the model at generation time, which is not the same as the order in the members dict. Generating constants for each member would add massive bloat to the artifact size, which I think offsets whatever gains you might get in performance.

Interesting that it would bloat it that much. I just worry that having to do string hashing even to grab the member schema is going to be an unnecessary perf hit. I'd benchmark it.

You have to add, at minimum, two extra lines of code for every member - one to define and and one to import it. We might end up switching to using a wildcard import though, which would eliminate the lines bit.

designs/serialization.md

mtdowling · 2025-01-13T21:08:44Z

designs/serialization.md

+        with self.begin_struct(schema=schema) as struct_serializer:
+            struct.serialize_members(struct_serializer)
+
+    def begin_list(self, schema: "Schema") -> AbstractContextManager["ShapeSerializer"]:


Not sure if it applies here, or if you account for it somehow elsewhere, but in smithy-java, we ensure to pass in a kind of state value to avoid needing to rely on things like state capturing lambdas (i.e., closing over the outer scope in a lambda may require an allocation, so we instead pass in a generic value to thread through the serialization process). For example:

/** * Begin a list and write zero or more values into it using the provided serializer. * * @param schema List schema. * @param listState State to pass into the consumer. * @param size Number of elements in the list, or -1 if unknown. * @param consumer Received in the context of the list and writes zero or more values. */ <T> void writeList(Schema schema, T listState, int size, BiConsumer<T, ShapeSerializer> consumer);

This would apply generally to any nested type serde.

This method returns a context manager, which is responsible for maintaining the state of the list. The context manager presents itself as a ShapeSerializer and you add elements to the list by calling the normal serializer methods. Then any necessary finalization happens when the context manager's scope ends. For example, here's how you might write a list (taken from tests):

with serializer.begin_list(schema) as ls: for element in self.list_member: ls.write_string(target_schema, element)

Here's an implementation of the context manager for JSON lists.

class JSONListSerializer(InterceptingSerializer): _stream: "StreamingJSONEncoder" _parent: JSONShapeSerializer _is_first_entry = True def __init__( self, stream: "StreamingJSONEncoder", parent: JSONShapeSerializer ) -> None: self._stream = stream self._parent = parent def __enter__(self) -> Self: self._stream.write_array_start() return self def __exit__( self, exc_type: type[BaseException] | None, exc_value: BaseException | None, traceback: TracebackType | None, ) -> None: if not exc_value: self._stream.write_array_end() def before(self, schema: "Schema") -> ShapeSerializer: if self._is_first_entry: self._is_first_entry = False else: self._stream.write_more() return self._parent

When the context manager begins, it writes [, then before each element aside from the first it writes ,, and finally when the context exits without error it writes ].

I don't think this is the same situation as a state-capturing lambda in Java

Another element in the Java version is size hints - I'm not sure how valuable those would be. You can't pre-allocate lists in Python so I don't think it helps at all unless we have a protocol that needs to declare list size ahead of time on the wire.

Even with CBOR, knowing the size before serializing a list allows you to encode the list as a finite list rather than an indefinite list.

Co-authored-by: Michael Dowling <[email protected]>

designs/serialization.md

Add serialization design doc

bc01a17

This adds a design doc for schema-based serialization. In the future this will be expanded with information about deserialization, codecs, and protocols.

JordonPhillips requested a review from a team as a code owner January 10, 2025 15:16

mtdowling reviewed Jan 13, 2025

View reviewed changes

Fix typo

133df1e

Co-authored-by: Michael Dowling <[email protected]>

JordonPhillips mentioned this pull request Jan 14, 2025

Add deserialization design doc #363

Merged

Add size hints to serializer design

25bfb64

jonathan343 reviewed Jan 24, 2025

View reviewed changes

designs/serialization.md Show resolved Hide resolved

jonathan343 previously approved these changes Jan 24, 2025

View reviewed changes

Clarify protocol vs Protocol

86ebd53

JordonPhillips dismissed jonathan343’s stale review via 86ebd53 January 28, 2025 15:12

JordonPhillips requested a review from jonathan343 January 28, 2025 15:26

jonathan343 approved these changes Jan 28, 2025

View reviewed changes

JordonPhillips merged commit 08bf50d into develop Jan 29, 2025
5 checks passed

JordonPhillips deleted the serialization-design branch January 29, 2025 13:48

Add serialization design doc #362

Add serialization design doc #362

Uh oh!

Conversation

JordonPhillips commented Jan 10, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants