diff --git a/designs/documents.md b/designs/documents.md new file mode 100644 index 000000000..1f1c2d23f --- /dev/null +++ b/designs/documents.md @@ -0,0 +1,446 @@ +# Document Types + +Smithy's [document type](https://smithy.io/2.0/spec/simple-types.html#document) +represents protocol-agnostic data of any type that can be expressed by the +Smithy data model. In other clients, documents were represented essentially as +JSON, but this fails to achieve a number of goals we have for the type. + +This specification describes how documents are defined and used in +smithy-python. + +## Goals + +* Extend documents to fully support the Smithy data model. +* Integrate documents with schema-based serialization and deserialization. +* Abstract away the underlying media type of documents. +* Allow documents to carry structured values. +* Make documents easily convertible to/from plain Python. + +## Specification + +The `Document` type in Python is a class with typed accessors for Smithy +data types that implements `SerializableStruct` and `DeserializableShape`. +Below is an example of what the base `Document` type will look like at the high +level, subsequent sections will discuss details of each component. + +```python +type DocumentValue = ( + Mapping[str, DocumentValue] + | Sequence[DocumentValue] + | str + | int + | float + | Decimal + | bool + | None + | bytes + | datetime.datetime +) + + +class Document: + _schema: Schema + + def __init__( + self, + value: DocumentValue | dict[str, "Document"] | list["Document"] = None, + *, + schema: Schema = _DOCUMENT, + ) -> None: + ... + + @property + def shape_type(self) -> ShapeType: + ... + + @property + def discriminator(self) -> ShapeID: + ... + + def is_none(self) -> bool: + ... + + def as_bytes(self) -> bytes: + ... + + def as_bool(self) -> bool: + ... + + def as_string(self) -> str: + ... + + def as_datetime(self) -> datetime.datetime: + ... + + def as_int(self) -> int: + ... + + def as_float(self) -> float: + ... + + def as_decimal(self) -> Decimal: + ... + + def as_list(self) -> list["Document"]: + ... + + def as_map(self) -> dict[str, "Document"]: + ... + + def as_value(self) -> DocumentValue: + ... + + def as_shape[S: DeserializableShape](self, shape_class: type[S]) -> S: + ... + + def serialize(self, serializer: ShapeSerializer) -> None: + serializer.write_document(self._schema, self) + + def serialize_contents(self, serializer: ShapeSerializer) -> None: + ... + + def serialize_members(self, serializer: ShapeSerializer) -> None: + ... + + def get(self, name: str, default: "Document | None" = None) -> "Document | None": + return self.as_map().get(name, default) + + def __getitem__(self, key: str | int | slice) -> "Document": + ... + + def __setitem__( + self, + key: str | int, + value: "Document | list[Document] | dict[str, Document] | DocumentValue", + ) -> None: + ... + + def __delitem__(self, key: str | int) -> None: + ... + + @classmethod + def from_shape(cls, shape: SerializableShape) -> "Document": + ... +``` + +### Shape Type + +`Document`s maintain awareness of their Smithy data type which is exposed as a +read-only `shape_type` property. This allows users to inspect the type of the +document's data in a static way that is more accurate than simply checking the +Python type of the underlying data. Python has only one type of `integer`, for +example, while Smithy has five. + +When constructed without an explicit shape type, a shape type will be guessed. +Python `integer`s will be assumed to be `LONG`s and Python `float`s will be +assumed to be `DOUBLE`s. + +`ShapeType` is an enum, so users can `match` on a document with exhaustiveness +checks. For example: + +```python +match document.shape_type: + case ShapeType.STRUCTURE | ShapeType.UNION: + print("In JSON, this would become a map") + case ShapeType.TIMESTAMP: + print("JSON has no native timestamp type") + case ShapeType.BLOB: + print("JSON has no native binary type") + case _: + pass +``` + +### Document Schemas + +`Document`s contain a private `Schema` from which their shape type is derived by +default. This schema is also used by default in serialization implementations. +While the `Document` class itself doesn't inspect anything but the shape type, +the schema is passed along to the `ShapeSerializer`, which may use the schema's +traits or defined members to influence serialization. This means that a +`Document` can fully capture any value that any structure, union, or other data +shape can capture and it will be serialized exactly as if it were that type. + +As a simple example, a `Document` that contains a timestamp may include a schema +that contains the `@timestampFormat` trait. This would be serialized in exactly +the same way as a modeled timestamp with the same trait. + +By default, the document's schema is a generic schema from the Smithy prelude, +based on the underlying type. A string shape takes on the base `String` for +example. Lists and maps by default use a schema with the shape type `DOCUMENT`. +Since it is not possible to determine whether a given `dict` is a map, +structure, or union, they are treated as maps by default. See +[structures and unions](#structures-and-unions) for more information on how +typed structure and union `Document`s may be created. + +It is also possible to pass in a schema when constructing a document by using +the `schema` keyword-only argument. + +### Accessors + +The `Document` type has accessors for each Python data type that is generated +from Smithy. This means, for example, that there is an `as_integer`, but not an +`as_short` or `as_big_decimal`. If it is necessary to distinguish between the +different integer or float sizes, the `shape_type` can be used. + +When any of the accessor methods are called, the underlying data is asserted to +be of the requested type before it is returned, raising an exception if it is +not. In some cases these types are coerced where they're compatible. For +example, attempting to access a `Decimal` from a document that contains a +`float` will return a `Decimal` representation of that float. + +The following example shows how accessors may be used: + +```python +# With no or limited prior knowledge, a match can be used to handle different +# possible types. +match document.shape_type: + case ShapeType.STRING: + print(document.as_string()) + case ShapeType.BLOB: + print(document.as_blob().decode("utf-8")) + case _: + pass + +# With concrete prior knowledge, the accessor can be called directly. +# If the underlying type doesn't match, an exception will be raised. +print(f"Total: {document.as_float()}") +``` + +Note that the values of lists and maps are `Document`s themselves. + +#### DocumentValue + +In addition to accessing Smithy data types, `Document`s will have a method to +return the document as a plain Python object without any context. These +`DocumentValue`s are similar to a JSON value, except that they include +`datetime.datetime`, `bytes`, and `Decimal` as possible types. `Document`s may +also be construced by providing a `DocumentValue`. Other implementations of the +`Document` type may not include this functionality, but for Python it is a huge +ease-of-use improvement for those that don't need the extra saftey and +functionality of the full `Document` type. + +Currently the base `Document` implementation lazily converts between the two +representations and caches the results. This is done for the sake of speed, but +has an unfortunate space cost. This behavior is not part of the API, however, +and MAY be changed without backwards compatibility implications. + +```python +>>> document = Document({"foo": "bar"}) +>>> document.shape_type +ShapeType.DOCUMENT +>>> document.as_value() +{"foo": "bar"} +>>> document["foo"] +Document(value="bar") +``` + +#### Structures and Unions + +Smithy structures and unions are generated into their own individual classes in +Python, and so it isn't possible to have a unique accessor for each of them. +Instead, `Document`s have a generic `as_shape` method that supports converting +the `Document` to any `DeserializableShape`. If the document does not +structurally match the target shape, an error will be raised. + +Similarly, a `from_shape` method allows for converting any `SerializableShape` +into a document. When a `Document` is constructed this way, it will retain the +schema of the shape it was constructed from. + +```python +@dataclass +class ExampleStruct: + foo: str + bar: str | None = None + + @classmethod + def deserialize(cls, deserializer: ShapeDeserializer) -> Self: + ... + +document = Document({"foo": "spam"}) +print(document.as_shape(ExampleStruct)) +# ExampleStruct(foo="spam") + +structure = ExampleStruct(foo="spam", bar="eggs") +print(Document.from_shape(structure)) +# Document( +# value={"foo": Document(value="spam"), "bar": Document(value="eggs")}, +# schema=EXAMPLE_STRUCT_SCHEMA, +# ) +``` + +These methods use the `serialize` and `deserialize` methods under the hood, +providing private `ShapeSerializer`s and `ShapeDeserializer`s to handle the +heavy lifting. + +### Container Methods + +`Document`s are often containers, and so to make `Document` usage as natural as +possible, most of Python's built-in +[container methods](https://docs.python.org/3/reference/datamodel.html#emulating-container-types) +will be supported. If the underlying type cannot support that method, an +exception will be raised. The following list specifies which container methods +specifically will be supported and what shape types they apply to. + +* `__len__` - Valid for lists, maps, structures, and unions. +* `__getitem__` - Valid for lists, maps, structures, and unions. +* `get` - Valid for maps, structures, and unions. This is equivalent to `dict`'s + `get` method which allows for a default value to be specified. +* `__setitem__` - Valid for lists, maps, and structures, and unions. Unions and + structures may only use keys representing members that are in their schema. + Unions may only replace the current value. +* `__delitem__` - Valid for lists, maps, and structures. Notably not valid for + unions as a union MUST contain exactly one member. +* `__iter__` - Valid for lists, maps, structures, and unions. +* `__contains__` - Valid for lists, maps, structures, and unions. + +Notably none of these methods are supported for strings or blobs, though some of +them could be. This is because the intent of implementing these methods is to +allow the `Document` to function as a container, and neither strings nor blobs +are really containers. There is also some potential for ambiguity if these were +to be implemented in the case of the underlying value being coercable into +another type, such as a base64 string that could be coerced into a blob. Since +the default behavior is to throw an exception, this can be changed at a later +date in a backwards-compatible way. + +### Serialization and Deserialization + +The major advantage of the `Document` type over just using `DocumentValue` +directly is that it integrates with schema-based serialization and +deserialization. Since `Document` implements `SerializableShape` and +`DeserializableShape` it can be directly used with any `ShapeSerializer` or +`ShapeDeserializer` without any additional effort. The same `Document` can be +serialized to and from XML, JSON, CBOR, etc. without any loss of information. + +#### Protocol-Specific Documents + +Some media types may not support all of the Smithy data model. JSON, for +example, does not have a native capability to store blobs. In current AWS +protocols that use JSON, blobs are stored as base64-encoded strings. The base +`Document` class is unaware of this, but a `JSONDocument` subclass can smooth +over the issue by attempting to coerce a string into a blob when `as_blob` +is called. + +```python +class JSONDocument(Document): + @override + def as_blob(self) -> bytes: + if self.shape_type is ShapeType.STRING: + return b64decode(self.as_string()) + return super().as_blob() +``` + +This same strategy can be used to support timestamps based on the +`@timestampFormat`, to support representing `Decimal`s as strings to avoid +precision loss, etc. These specialized `Document`s can also contain +configuration for these and any number of other features. + +Protocol-specific documents are also instrumental in fully supporting late-bound +types. + +#### Late-Bound Types + +`Document`s may also be used to implement a system of late-bound types. That is, +clients and servers may exchange `Document`s which represent particular modeled +shapes. AWS services, for example, may send structured events to trigger a +Lambda function. If the shape of that event is known, the `Document` may be +converted to that shape. + +If the specific shape is known ahead of time with external knowledge, then +`as_shape` can be used directly. But there are cases where the `Document` may +represent one of several shapes. For this reason, `Document`s have a +`discriminator` property that delclares what shape it represents. By default +this is the id of the document's schema, but that schema itself will be a +default schema from the prelude unless the document is specifically constructed +with one. + +To reliably communicate the discriminator, it needs to be embedded in the +document itself. How this is done is protocol-specific, and thus is the domain +of protocol-specific documents. A JSON-based protocol could, for example, +embed the shape id as the value of a top-level `__type` property. + +```json +{ + "__type": "smithy.example#ExampleStruct", + "foo": "spam", + "bar": "eggs" +} +``` + +A `TypeRegistry` would then be used to get the concrete class, which is then used +to convert the `Document` into the target class. + +```python +>>> registry.deserialize(document) +``` + +##### Type Registries + +A `TypeRegistry` is essentially a nestable mapping of `ShapeID` to +`DeserializableShape`. If a given shape ID can't be found within the local set +of known shapes, it delegates to a sub-registry. + +```python +class TypeRegistry: + def __init__( + self, + types: dict[ShapeID, DeserializableShape], + sub_registry: "TypeRegistry | None" = None + ): + self._types = types + self._sub_registry = sub_registry + + def get(self, shape: ShapeID) -> type[DeserializableShape]: + ... + + def deserialize(self, document: Document) -> DeserializableShape: + return document.as_shape(self.get(document.discriminator)) +``` + +It also provides a convenience method for deserializing documents that can be +used to further customize how documents are deserialized, such as if the +document's concrete value is nested alongside the discriminator, like below. + +```json +{ + "__type": "smithy.example#ExampleStruct", + "__value": { + "foo": "spam", + "bar": "eggs" + } +} +``` + +#### Botocore Interop + +It may not be obvious, `Document`'s serialization and deserialization methods +enable interoperability with botocore-style inputs and outputs without any +additional effort. This is because the keys of any underlying dicts will be +based on the member names from the schema, which is exactly what botocore does. + +```python +head_object_input_dict = { + "Bucket": "spam", + "Key": "eggs", +} +head_object_input = Document(head_object_input_dict).as_shape(HeadObjectInput) + +head_object_output = HeadObjectOutput( + delete_marker=False, + last_modified=datetime(2015, 1, 1), + [...] +) +head_object_output_dict = Document.from_shape(head_object_output).as_value() +# {"DeleteMarker": False, "LastModified": datetime(2015, 1, 1), ...} +``` + +This isn't 100% foolproof, however. There are several instances where botocore +special-cases the name of a key or the type of its value. In those cases, the +conversion would fail. + +There is also a performance tradeoff to be considered when using `Document`s for +this purpose. Having to construct an intermediate representation and only then +constructing the output type adds extra allocations and runtime in general. In +most cases this should not be significant, especially if done at a boundary +location. A more efficient solution could be implemented with direct +`ShapeSerializer`s and `ShapeDeserializer`s, which could also incorporate +botocore's special-cased members.