-
Notifications
You must be signed in to change notification settings - Fork 24
Add serialization design doc #362
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This adds a design doc for schema-based serialization. In the future this will be expanded with information about deserialization, codecs, and protocols.
| traits: dict[ShapeID, "Trait"] = field(default_factory=dict) | ||
| members: dict[str, "Schema"] = field(default_factory=dict) | ||
| member_target: "Schema | None" = None | ||
| member_index: int | None = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
member_index would rely on a member_list that it can index into, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it's meta-knowledge only the code generator knows about and only the generated deserialize method uses. In most cases it will match the ordering in the members dictionary of its parent, but not always (e.g. in the case of recursive members that get inserted later). I'm thinking of getting rid of this anyway - it's a performance optimization in Java but I don't know that it makes a real difference in Python.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha. It's handled internally. I think it might still end up as a performance boost even in Python, assuming array indexing is faster than hashmaps. When deserializing a type, the codec does the work of identifying which member to deserialize, and then hands that to the function used to build up the type. That function needs to determine what member schema it's given to know how to set the right value on the structure. I wonder if you can use the array index somehow to better handle dispatching in a wayt that doesn't require comparing strings (which is going to be slow). If this implies double-dispatch than that has its own perf hit, but just throwing out there that we want to avoid string hashing here probably.
|
|
||
| ```python | ||
| EXAMPLE_STRUCTURE_SCHEMA = Schema.collection( | ||
| id=ShapeID("com.example#ExampleStructure"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume ShapeID would be a kind of flyweight factory to cache previously created IDs? And ideally there could be constants that you can refer to that don't have to do any kind of cache/hash lookups for prelude trait IDs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no cache currently, but it can be easily done. Honestly I could probably just add @cache to the constructor and it should work fine.
As for prelude traits, there's currently not centralized constants. There ARE centralized constants for prelude shapes (e.g. smithy.api#String)
| "target": INTEGER, | ||
| "index": 0, | ||
| "traits": [ | ||
| Trait(id=ShapeID("smithy.api#default"), value=0), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Document values don't need any kind of wrapping type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DocumentValue is the unwrapped json-like representation
|
|
||
| def serialize_members(self, serializer: ShapeSerializer): | ||
| serializer.write_integer( | ||
| EXAMPLE_STRUCTURE_SCHEMA.members["member"], self.member |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid any hashmap/hashing overhead, I would either use the member index or also generate constants that refer to each member
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The member index can't be used as it refers to the order present in the model at generation time, which is not the same as the order in the members dict. Generating constants for each member would add massive bloat to the artifact size, which I think offsets whatever gains you might get in performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting that it would bloat it that much. I just worry that having to do string hashing even to grab the member schema is going to be an unnecessary perf hit. I'd benchmark it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have to add, at minimum, two extra lines of code for every member - one to define and and one to import it. We might end up switching to using a wildcard import though, which would eliminate the lines bit.
designs/serialization.md
Outdated
| with self.begin_struct(schema=schema) as struct_serializer: | ||
| struct.serialize_members(struct_serializer) | ||
|
|
||
| def begin_list(self, schema: "Schema") -> AbstractContextManager["ShapeSerializer"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if it applies here, or if you account for it somehow elsewhere, but in smithy-java, we ensure to pass in a kind of state value to avoid needing to rely on things like state capturing lambdas (i.e., closing over the outer scope in a lambda may require an allocation, so we instead pass in a generic value to thread through the serialization process). For example:
/**
* Begin a list and write zero or more values into it using the provided serializer.
*
* @param schema List schema.
* @param listState State to pass into the consumer.
* @param size Number of elements in the list, or -1 if unknown.
* @param consumer Received in the context of the list and writes zero or more values.
*/
<T> void writeList(Schema schema, T listState, int size, BiConsumer<T, ShapeSerializer> consumer);This would apply generally to any nested type serde.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method returns a context manager, which is responsible for maintaining the state of the list. The context manager presents itself as a ShapeSerializer and you add elements to the list by calling the normal serializer methods. Then any necessary finalization happens when the context manager's scope ends. For example, here's how you might write a list (taken from tests):
with serializer.begin_list(schema) as ls:
for element in self.list_member:
ls.write_string(target_schema, element)Here's an implementation of the context manager for JSON lists.
class JSONListSerializer(InterceptingSerializer):
_stream: "StreamingJSONEncoder"
_parent: JSONShapeSerializer
_is_first_entry = True
def __init__(
self, stream: "StreamingJSONEncoder", parent: JSONShapeSerializer
) -> None:
self._stream = stream
self._parent = parent
def __enter__(self) -> Self:
self._stream.write_array_start()
return self
def __exit__(
self,
exc_type: type[BaseException] | None,
exc_value: BaseException | None,
traceback: TracebackType | None,
) -> None:
if not exc_value:
self._stream.write_array_end()
def before(self, schema: "Schema") -> ShapeSerializer:
if self._is_first_entry:
self._is_first_entry = False
else:
self._stream.write_more()
return self._parentWhen the context manager begins, it writes [, then before each element aside from the first it writes ,, and finally when the context exits without error it writes ].
I don't think this is the same situation as a state-capturing lambda in Java
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another element in the Java version is size hints - I'm not sure how valuable those would be. You can't pre-allocate lists in Python so I don't think it helps at all unless we have a protocol that needs to declare list size ahead of time on the wire.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even with CBOR, knowing the size before serializing a list allows you to encode the list as a finite list rather than an indefinite list.
Co-authored-by: Michael Dowling <[email protected]>
This adds a design doc for schema-based serialization. In the future this will be expanded with information about deserialization, codecs, and protocols.
This is a fairly rough first draft, which I blame on having covid, but should nevertheless be understandable.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.