|
| 1 | +# Support structured data in TFX through `struct2tensor` and `DataView` |
| 2 | + |
| 3 | +Status | Proposed |
| 4 | +:------------ | :-------------------------------------------------------------- |
| 5 | +**Author(s) ** | Zhuo Peng ( [email protected]) |
| 6 | +**Sponsor ** | Zhitao Li ( [email protected]) |
| 7 | +**Updated** | 2021-03-05 |
| 8 | + |
| 9 | +## Objective |
| 10 | + |
| 11 | +This RFC proposes several additions to TFX in order to support building ML |
| 12 | +pipelines that process __structurally richer__ data that TFX does not have |
| 13 | +apriori knowledge about how to parse. Such knowledge is provided by the |
| 14 | +user, through __`struct2tensor`__ (showcased in this RFC) or other TensorFlow |
| 15 | +graphs and made available to all TFX components through __Standardized TFX |
| 16 | +inputs__ and __`DataView`s__. |
| 17 | + |
| 18 | +### Background |
| 19 | + |
| 20 | +### `struct2tensor` |
| 21 | + |
| 22 | +[`struct2tensor`](https://github.com/google/struct2tensor) is a library to |
| 23 | +create TF graphs (a `struct2tensor` |
| 24 | +"[expression](https://github.com/google/struct2tensor/blob/master/g3doc/api_docs/python/s2t/Expression.md)") |
| 25 | +that parse serialized Protocol Buffers (protobuf) into a representation (a bag |
| 26 | +of TF (composite) Tensors) that preserves the protobuf structure (for example |
| 27 | +`tf.RaggedTensor`s and `tf.SparseTensor`s). It also allows manipulation of such |
| 28 | +structure. |
| 29 | + |
| 30 | +### Standardized TFX inputs |
| 31 | + |
| 32 | +The |
| 33 | +[Standardized TFX inputs RFC](https://github.com/1025KB/community/blob/875c04645f9029cb3c5d75bfdb8bf63e5560e9d9/rfcs/20191017-tfx-standardized-inputs.md) |
| 34 | +introduced a common in-memory data representation to TFX components and an I/O |
| 35 | +abstraction layer that produces the representation. The chosen representation, |
| 36 | +Apache Arrow, is powerful enough to represent protobuf-like structured data, or |
| 37 | +what the `tf.Tensor`, `tf.RaggedTensor`, or `tf.SparseTensor` logically |
| 38 | +represent. |
| 39 | + |
| 40 | +### Goal |
| 41 | + |
| 42 | +* Propose a `TFXIO` for `struct2tensor`. |
| 43 | + * Note that although designed for `struct2tensor`, this `TFXIO` only sees |
| 44 | + the TF Graph that `struct2tensor` builds, which means it can support other |
| 45 | + TF Graphs that decode string records into (composite) Tensors. |
| 46 | + |
| 47 | +* Propose the orchestration support needed by the proposed `TFXIO`. |
| 48 | + |
| 49 | +### Non Goal |
| 50 | + |
| 51 | +* Address how components / libraries can handle the new Tensor / Arrow types. |
| 52 | + For example, TF Transform needs to be able to accept `tf.RaggedTensors` and |
| 53 | + output `tf.RaggedTensors`. These need to be addressed separately in each |
| 54 | + component, perhaps by separate designs, if needed. |
| 55 | +* Address how TF serving can allow serving a model that has a (composite) |
| 56 | + Tensor-based Predict signature, or any other signatures that do not use |
| 57 | + `struct2tensor` to parse input protobufs. In this doc, it is assumed that |
| 58 | + the |
| 59 | + exported serving graph would take a dense 1-D Tensor of dtype `tf.string` |
| 60 | + whose values are serialized protobufs. |
| 61 | + - The reason why the above problem might be relevant to this design is |
| 62 | + that in certain use cases, it might be desirable to use a different |
| 63 | + format in serving than in training (e.g. using protobufs in training |
| 64 | + while |
| 65 | + using JSON in serving -- as long as they parse to the same (composite) |
| 66 | + tensors fed into the model graph). |
| 67 | + |
| 68 | + |
| 69 | +## Motivation |
| 70 | + |
| 71 | +TFX has historically assumed that `tf.Example` is the data payload format and |
| 72 | +it is the only format fully supported by all the components. `tf.Example` |
| 73 | +naturally represents flat data, while certain ML tasks need *structurally |
| 74 | +richer* logical representations. For example, in the list-wise ranking problem, |
| 75 | +one “example” input to the model consists of a list of documents to rank, and |
| 76 | +each document contains some features. [`tensorflow_ranking`](https://github.com/tensorflow/ranking) |
| 77 | +is a library that helps build such ranking models. Supporting |
| 78 | +`tensorflow_ranking` in TFX has been a hot feature request. |
| 79 | + |
| 80 | +<div align="center"> |
| 81 | +<img src='20210305-tfx-struct2tensor/tf_example_vs_elwc.png', width='700'> |
| 82 | +<p><i> |
| 83 | + left: flat data represented by tf.Examples<br> |
| 84 | + right: typical data for ranking problems -- each “example” contains |
| 85 | + several “candidates” |
| 86 | +</i></p> |
| 87 | +</div> |
| 88 | + |
| 89 | +While it’s possible to encode anything in `tf.Examples`, this approach poses |
| 90 | +challenges to any component that needs to understand the data (e.g. Data |
| 91 | +Validation and Model Validation), and would also lead to bad user experience as |
| 92 | +they are forced to devise hacks. |
| 93 | + |
| 94 | +It’s also possible to address the problem in a case-by-case fashion by making |
| 95 | +TFX support a standard “container format” for each category of problem. We have |
| 96 | +compared that with the generic solution based on `struct2tensor` in previous |
| 97 | +efforts and concluded that we do |
| 98 | +not want another first-class citizen container format. |
| 99 | + |
| 100 | +Given that `struct2tensor` is able to decode an arbitrary protobuf (thus a good |
| 101 | +subset of all kinds of structured data) into a Tensor representation that |
| 102 | +preserves the structure (`tf.RaggedTensor`), we propose to |
| 103 | +solve the problem of supporting structured data in TFX through supporting |
| 104 | +`struct2tensor`. |
| 105 | + |
| 106 | +Thanks to Standardized TFX Inputs, a large portion of the solution is to create |
| 107 | +a `TFXIO` implementation for `struct2tensor`, and (as we will see later), the |
| 108 | +proper orchestration support needed for instantiating such a `TFXIO` in |
| 109 | +components. |
| 110 | + |
| 111 | +## Design Proposal |
| 112 | + |
| 113 | +### `GraphToTensorTFXIO` |
| 114 | + |
| 115 | +<div align="center"><img src='20210305-tfx-struct2tensor/graph_to_tensor_tfxio.png', width='700'></div> |
| 116 | + |
| 117 | +The diagram above shows how the proposed `GraphToTensorTFXIO` works: |
| 118 | + |
| 119 | +* (1) The “Proto storage” is a format that Apache Beam can read from and |
| 120 | + produce `PCollection[bytes]`. While the most naive example of such a format |
| 121 | + is TFRecord, it does not have to be a row-based format. The only requirement |
| 122 | + is that Beam can read it and produce `PCollection[bytes]`. |
| 123 | + |
| 124 | +* (2) It relies on the fact that the `struct2tensor` query can be compiled to |
| 125 | + a TF graph that converts a string tensor (containing serialized protos) to a |
| 126 | + bunch of composite tensors, and thus can be stored in a file (SavedModel). |
| 127 | + |
| 128 | +* (3) For beam-based components, `TFXIO` creates a PTransform that: decodes |
| 129 | + the serialized records of protos to (batched) tensors using the saved TF |
| 130 | + graph converts the tensors to arrow RecordBatches. |
| 131 | + |
| 132 | +* (4) `TFXIO` will also create `TensorRepresentations` according to the output |
| 133 | + signature of the saved TF graph, so that the following is identity: PICTURE |
| 134 | + 3 |
| 135 | + |
| 136 | +* (5) For TF trainers, `TFXIO` creates a `tf.data.Dataset` that: |
| 137 | + |
| 138 | + - reads the serialized records of protobufs as a string tensor |
| 139 | + - `.map()` the string tensor to decoded it into tensors using the saved |
| 140 | + `struct2tensor` query. |
| 141 | + |
| 142 | +### `struct2tensor` query as an artifact |
| 143 | + |
| 144 | +We realize that the saved `struct2tensor` query (a TF SavedModel) should be an |
| 145 | +artifact, rather than merely a property of the Examples artifact, because it |
| 146 | +may be updated frequently (e.g. new fields in the protobuf to be parsed can be |
| 147 | +added), and updates will affect most components that consume it, thus it needs |
| 148 | +to become part of the provenance of an affected artifact. It may be updated |
| 149 | +independently of Examples artifact. A pipeline may use multiple `struct2tensor` |
| 150 | +queries, and the user may determine, for each component, which query to use to |
| 151 | +apply to the input Examples. |
| 152 | + |
| 153 | +To make it a proper artifact the following orchestration changes are proposed: |
| 154 | + |
| 155 | + * A new artifact type, DataView |
| 156 | + * New properties in the Examples artifact |
| 157 | + * `container_format` (e.g. `FORMAT_TF_RECORD_GZIP`) |
| 158 | + * `payload_format` (e.g. `FORMAT_TF_EXAMPLE`, `FORMAT_PROTO`) |
| 159 | + * `data_view_uri` |
| 160 | + * `data_view_id` (the MLMD artifact id of DataView) |
| 161 | + * A new custom component, DataViewProvider that takes the module_file ( |
| 162 | + which contains the `struct2tensor` query) as an ExecutionProperty and |
| 163 | + no input Artifact, and outputs a DataView Artifact. |
| 164 | + * A new custom component, DataViewBinder that takes Examples and DataView as |
| 165 | + input, and outputs Examples Artifacts that are identical to the input except |
| 166 | + that their `data_adapter_uri` properties are populated. |
| 167 | + |
| 168 | +With the proposed new properties in Examples artifact, some logic to determine |
| 169 | +which `TFXIO` implementation to use to read an Examples artifact is needed. Thus |
| 170 | +we also propose a util function that lives in TFX to create a `TFXIO` given an |
| 171 | +Examples artifact. |
| 172 | + |
| 173 | +The topology of a pipeline may look like the right half of the following |
| 174 | +diagram: |
| 175 | + |
| 176 | +<div align="center"> |
| 177 | +<img src='20210305-tfx-struct2tensor/data_view_components.png', width='700'> |
| 178 | +<p><i> |
| 179 | + left: a tf.Example-based pipeline topology<br> |
| 180 | + right: proposed topology of a struct2tensor-based pipeline |
| 181 | +</ig</p> |
| 182 | +</div> |
| 183 | + |
| 184 | + |
| 185 | +Note that: |
| 186 | + |
| 187 | +* The outputs of DataViewBinder are different instances of the Examples |
| 188 | + artifacts than the input ones. Thus MLMD will be able to record events that |
| 189 | + establish the lineage of the input and output. |
| 190 | + |
| 191 | +* This design allows multiple DataViews to be bound to the same data, yielding |
| 192 | + different bound Examples artifacts. |
| 193 | + |
| 194 | +* This design also allows components to take Examples without a bound adapter |
| 195 | + as input (this way TFDV will be able to analyze both adapted and unadapted |
| 196 | + data, and establish links between raw proto fields and transformed ones). |
| 197 | + |
| 198 | +### Garbage Collection of Artifacts |
| 199 | + |
| 200 | +In this section we discuss some of the constraints / requirements that this |
| 201 | +proposal impose on the design of GC (at the time of writing this doc, there’s |
| 202 | +not a concrete plan yet). |
| 203 | + |
| 204 | +#### Artifacts sharing URIs -- GC for Examples Artifacts |
| 205 | + |
| 206 | +DataViewBinder outputs an Examples Artifact that shares URI with its input. |
| 207 | +While MLMD allows this, the garbage collector must be aware when making the |
| 208 | +decision of deleting a URI, that multiple Artifacts are sharing them, and only |
| 209 | +if all the referring Artifacts are being GC’ed can the URI be deleted. |
| 210 | + |
| 211 | +#### Artifacts referring to multiple URIs -- GC for DataView Artifacts |
| 212 | + |
| 213 | +Note that a component that consumes adapted data only needs to use the output |
| 214 | +Examples Artifact from DataViewBinder, which means at execution time, only the |
| 215 | +URI of the Examples Artifact will be “locked”, however, that Examples Artifact |
| 216 | +is also referring to the URI of a DataView Artifact. The garbage collector needs |
| 217 | +to be aware of the existence of that URI and also lock it appropriately. |
| 218 | + |
| 219 | +One way to add such support, is to have an extension property in an Artifact, |
| 220 | +say, `gc_context`, which could contain additional URIs. Then the DataView |
| 221 | +component is able to set that property. |
0 commit comments