Skip to content

Conversation

@JordonPhillips
Copy link
Contributor

This adds a design doc for schema-based serialization. In the future this will be expanded with information about deserialization, codecs, and protocols.

This is a fairly rough first draft, which I blame on having covid, but should nevertheless be understandable.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

This adds a design doc for schema-based serialization. In the future
this will be expanded with information about deserialization, codecs,
and protocols.
@JordonPhillips JordonPhillips requested a review from a team as a code owner January 10, 2025 15:16
traits: dict[ShapeID, "Trait"] = field(default_factory=dict)
members: dict[str, "Schema"] = field(default_factory=dict)
member_target: "Schema | None" = None
member_index: int | None = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

member_index would rely on a member_list that it can index into, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's meta-knowledge only the code generator knows about and only the generated deserialize method uses. In most cases it will match the ordering in the members dictionary of its parent, but not always (e.g. in the case of recursive members that get inserted later). I'm thinking of getting rid of this anyway - it's a performance optimization in Java but I don't know that it makes a real difference in Python.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. It's handled internally. I think it might still end up as a performance boost even in Python, assuming array indexing is faster than hashmaps. When deserializing a type, the codec does the work of identifying which member to deserialize, and then hands that to the function used to build up the type. That function needs to determine what member schema it's given to know how to set the right value on the structure. I wonder if you can use the array index somehow to better handle dispatching in a wayt that doesn't require comparing strings (which is going to be slow). If this implies double-dispatch than that has its own perf hit, but just throwing out there that we want to avoid string hashing here probably.


```python
EXAMPLE_STRUCTURE_SCHEMA = Schema.collection(
id=ShapeID("com.example#ExampleStructure"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume ShapeID would be a kind of flyweight factory to cache previously created IDs? And ideally there could be constants that you can refer to that don't have to do any kind of cache/hash lookups for prelude trait IDs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no cache currently, but it can be easily done. Honestly I could probably just add @cache to the constructor and it should work fine.

As for prelude traits, there's currently not centralized constants. There ARE centralized constants for prelude shapes (e.g. smithy.api#String)

"target": INTEGER,
"index": 0,
"traits": [
Trait(id=ShapeID("smithy.api#default"), value=0),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Document values don't need any kind of wrapping type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DocumentValue is the unwrapped json-like representation


def serialize_members(self, serializer: ShapeSerializer):
serializer.write_integer(
EXAMPLE_STRUCTURE_SCHEMA.members["member"], self.member
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid any hashmap/hashing overhead, I would either use the member index or also generate constants that refer to each member

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The member index can't be used as it refers to the order present in the model at generation time, which is not the same as the order in the members dict. Generating constants for each member would add massive bloat to the artifact size, which I think offsets whatever gains you might get in performance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting that it would bloat it that much. I just worry that having to do string hashing even to grab the member schema is going to be an unnecessary perf hit. I'd benchmark it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have to add, at minimum, two extra lines of code for every member - one to define and and one to import it. We might end up switching to using a wildcard import though, which would eliminate the lines bit.

with self.begin_struct(schema=schema) as struct_serializer:
struct.serialize_members(struct_serializer)

def begin_list(self, schema: "Schema") -> AbstractContextManager["ShapeSerializer"]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if it applies here, or if you account for it somehow elsewhere, but in smithy-java, we ensure to pass in a kind of state value to avoid needing to rely on things like state capturing lambdas (i.e., closing over the outer scope in a lambda may require an allocation, so we instead pass in a generic value to thread through the serialization process). For example:

    /**
     * Begin a list and write zero or more values into it using the provided serializer.
     *
     * @param schema    List schema.
     * @param listState State to pass into the consumer.
     * @param size      Number of elements in the list, or -1 if unknown.
     * @param consumer  Received in the context of the list and writes zero or more values.
     */
    <T> void writeList(Schema schema, T listState, int size, BiConsumer<T, ShapeSerializer> consumer);

This would apply generally to any nested type serde.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method returns a context manager, which is responsible for maintaining the state of the list. The context manager presents itself as a ShapeSerializer and you add elements to the list by calling the normal serializer methods. Then any necessary finalization happens when the context manager's scope ends. For example, here's how you might write a list (taken from tests):

with serializer.begin_list(schema) as ls:
    for element in self.list_member:
        ls.write_string(target_schema, element)

Here's an implementation of the context manager for JSON lists.

class JSONListSerializer(InterceptingSerializer):
    _stream: "StreamingJSONEncoder"
    _parent: JSONShapeSerializer
    _is_first_entry = True

    def __init__(
        self, stream: "StreamingJSONEncoder", parent: JSONShapeSerializer
    ) -> None:
        self._stream = stream
        self._parent = parent

    def __enter__(self) -> Self:
        self._stream.write_array_start()
        return self

    def __exit__(
        self,
        exc_type: type[BaseException] | None,
        exc_value: BaseException | None,
        traceback: TracebackType | None,
    ) -> None:
        if not exc_value:
            self._stream.write_array_end()

    def before(self, schema: "Schema") -> ShapeSerializer:
        if self._is_first_entry:
            self._is_first_entry = False
        else:
            self._stream.write_more()

        return self._parent

When the context manager begins, it writes [, then before each element aside from the first it writes ,, and finally when the context exits without error it writes ].

I don't think this is the same situation as a state-capturing lambda in Java

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another element in the Java version is size hints - I'm not sure how valuable those would be. You can't pre-allocate lists in Python so I don't think it helps at all unless we have a protocol that needs to declare list size ahead of time on the wire.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even with CBOR, knowing the size before serializing a list allows you to encode the list as a finite list rather than an indefinite list.

Co-authored-by: Michael Dowling <[email protected]>
jonathan343
jonathan343 previously approved these changes Jan 24, 2025
@JordonPhillips JordonPhillips merged commit 08bf50d into develop Jan 29, 2025
5 checks passed
@JordonPhillips JordonPhillips deleted the serialization-design branch January 29, 2025 13:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants