Skip to content
This repository was archived by the owner on Jul 10, 2025. It is now read-only.

Commit 66c9aa9

Browse files
authored
Merge pull request #364 from brills/s2t
RFC: (TFX) Support structured data in TFX through `struct2tensor` and `DataView`
2 parents 1f70bcd + b829407 commit 66c9aa9

File tree

4 files changed

+221
-0
lines changed

4 files changed

+221
-0
lines changed

rfcs/20210305-tfx-struct2tensor.md

Lines changed: 221 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,221 @@
1+
# Support structured data in TFX through `struct2tensor` and `DataView`
2+
3+
Status | Proposed
4+
:------------ | :--------------------------------------------------------------
5+
**Author(s)** | Zhuo Peng ([email protected])
6+
**Sponsor** | Zhitao Li ([email protected])
7+
**Updated** | 2021-03-05
8+
9+
## Objective
10+
11+
This RFC proposes several additions to TFX in order to support building ML
12+
pipelines that process __structurally richer__ data that TFX does not have
13+
apriori knowledge about how to parse. Such knowledge is provided by the
14+
user, through __`struct2tensor`__ (showcased in this RFC) or other TensorFlow
15+
graphs and made available to all TFX components through __Standardized TFX
16+
inputs__ and __`DataView`s__.
17+
18+
### Background
19+
20+
### `struct2tensor`
21+
22+
[`struct2tensor`](https://github.com/google/struct2tensor) is a library to
23+
create TF graphs (a `struct2tensor`
24+
"[expression](https://github.com/google/struct2tensor/blob/master/g3doc/api_docs/python/s2t/Expression.md)")
25+
that parse serialized Protocol Buffers (protobuf) into a representation (a bag
26+
of TF (composite) Tensors) that preserves the protobuf structure (for example
27+
`tf.RaggedTensor`s and `tf.SparseTensor`s). It also allows manipulation of such
28+
structure.
29+
30+
### Standardized TFX inputs
31+
32+
The
33+
[Standardized TFX inputs RFC](https://github.com/1025KB/community/blob/875c04645f9029cb3c5d75bfdb8bf63e5560e9d9/rfcs/20191017-tfx-standardized-inputs.md)
34+
introduced a common in-memory data representation to TFX components and an I/O
35+
abstraction layer that produces the representation. The chosen representation,
36+
Apache Arrow, is powerful enough to represent protobuf-like structured data, or
37+
what the `tf.Tensor`, `tf.RaggedTensor`, or `tf.SparseTensor` logically
38+
represent.
39+
40+
### Goal
41+
42+
* Propose a `TFXIO` for `struct2tensor`.
43+
* Note that although designed for `struct2tensor`, this `TFXIO` only sees
44+
the TF Graph that `struct2tensor` builds, which means it can support other
45+
TF Graphs that decode string records into (composite) Tensors.
46+
47+
* Propose the orchestration support needed by the proposed `TFXIO`.
48+
49+
### Non Goal
50+
51+
* Address how components / libraries can handle the new Tensor / Arrow types.
52+
For example, TF Transform needs to be able to accept `tf.RaggedTensors` and
53+
output `tf.RaggedTensors`. These need to be addressed separately in each
54+
component, perhaps by separate designs, if needed.
55+
* Address how TF serving can allow serving a model that has a (composite)
56+
Tensor-based Predict signature, or any other signatures that do not use
57+
`struct2tensor` to parse input protobufs. In this doc, it is assumed that
58+
the
59+
exported serving graph would take a dense 1-D Tensor of dtype `tf.string`
60+
whose values are serialized protobufs.
61+
- The reason why the above problem might be relevant to this design is
62+
that in certain use cases, it might be desirable to use a different
63+
format in serving than in training (e.g. using protobufs in training
64+
while
65+
using JSON in serving -- as long as they parse to the same (composite)
66+
tensors fed into the model graph).
67+
68+
69+
## Motivation
70+
71+
TFX has historically assumed that `tf.Example` is the data payload format and
72+
it is the only format fully supported by all the components. `tf.Example`
73+
naturally represents flat data, while certain ML tasks need *structurally
74+
richer* logical representations. For example, in the list-wise ranking problem,
75+
one “example” input to the model consists of a list of documents to rank, and
76+
each document contains some features. [`tensorflow_ranking`](https://github.com/tensorflow/ranking)
77+
is a library that helps build such ranking models. Supporting
78+
`tensorflow_ranking` in TFX has been a hot feature request.
79+
80+
<div align="center">
81+
<img src='20210305-tfx-struct2tensor/tf_example_vs_elwc.png', width='700'>
82+
<p><i>
83+
left: flat data represented by tf.Examples<br>
84+
right: typical data for ranking problems -- each “example” contains
85+
several “candidates”
86+
</i></p>
87+
</div>
88+
89+
While it’s possible to encode anything in `tf.Examples`, this approach poses
90+
challenges to any component that needs to understand the data (e.g. Data
91+
Validation and Model Validation), and would also lead to bad user experience as
92+
they are forced to devise hacks.
93+
94+
It’s also possible to address the problem in a case-by-case fashion by making
95+
TFX support a standard “container format” for each category of problem. We have
96+
compared that with the generic solution based on `struct2tensor` in previous
97+
efforts and concluded that we do
98+
not want another first-class citizen container format.
99+
100+
Given that `struct2tensor` is able to decode an arbitrary protobuf (thus a good
101+
subset of all kinds of structured data) into a Tensor representation that
102+
preserves the structure (`tf.RaggedTensor`), we propose to
103+
solve the problem of supporting structured data in TFX through supporting
104+
`struct2tensor`.
105+
106+
Thanks to Standardized TFX Inputs, a large portion of the solution is to create
107+
a `TFXIO` implementation for `struct2tensor`, and (as we will see later), the
108+
proper orchestration support needed for instantiating such a `TFXIO` in
109+
components.
110+
111+
## Design Proposal
112+
113+
### `GraphToTensorTFXIO`
114+
115+
<div align="center"><img src='20210305-tfx-struct2tensor/graph_to_tensor_tfxio.png', width='700'></div>
116+
117+
The diagram above shows how the proposed `GraphToTensorTFXIO` works:
118+
119+
* (1) The “Proto storage” is a format that Apache Beam can read from and
120+
produce `PCollection[bytes]`. While the most naive example of such a format
121+
is TFRecord, it does not have to be a row-based format. The only requirement
122+
is that Beam can read it and produce `PCollection[bytes]`.
123+
124+
* (2) It relies on the fact that the `struct2tensor` query can be compiled to
125+
a TF graph that converts a string tensor (containing serialized protos) to a
126+
bunch of composite tensors, and thus can be stored in a file (SavedModel).
127+
128+
* (3) For beam-based components, `TFXIO` creates a PTransform that: decodes
129+
the serialized records of protos to (batched) tensors using the saved TF
130+
graph converts the tensors to arrow RecordBatches.
131+
132+
* (4) `TFXIO` will also create `TensorRepresentations` according to the output
133+
signature of the saved TF graph, so that the following is identity: PICTURE
134+
3
135+
136+
* (5) For TF trainers, `TFXIO` creates a `tf.data.Dataset` that:
137+
138+
- reads the serialized records of protobufs as a string tensor
139+
- `.map()` the string tensor to decoded it into tensors using the saved
140+
`struct2tensor` query.
141+
142+
### `struct2tensor` query as an artifact
143+
144+
We realize that the saved `struct2tensor` query (a TF SavedModel) should be an
145+
artifact, rather than merely a property of the Examples artifact, because it
146+
may be updated frequently (e.g. new fields in the protobuf to be parsed can be
147+
added), and updates will affect most components that consume it, thus it needs
148+
to become part of the provenance of an affected artifact. It may be updated
149+
independently of Examples artifact. A pipeline may use multiple `struct2tensor`
150+
queries, and the user may determine, for each component, which query to use to
151+
apply to the input Examples.
152+
153+
To make it a proper artifact the following orchestration changes are proposed:
154+
155+
* A new artifact type, DataView
156+
* New properties in the Examples artifact
157+
* `container_format` (e.g. `FORMAT_TF_RECORD_GZIP`)
158+
* `payload_format` (e.g. `FORMAT_TF_EXAMPLE`, `FORMAT_PROTO`)
159+
* `data_view_uri`
160+
* `data_view_id` (the MLMD artifact id of DataView)
161+
* A new custom component, DataViewProvider that takes the module_file (
162+
which contains the `struct2tensor` query) as an ExecutionProperty and
163+
no input Artifact, and outputs a DataView Artifact.
164+
* A new custom component, DataViewBinder that takes Examples and DataView as
165+
input, and outputs Examples Artifacts that are identical to the input except
166+
that their `data_adapter_uri` properties are populated.
167+
168+
With the proposed new properties in Examples artifact, some logic to determine
169+
which `TFXIO` implementation to use to read an Examples artifact is needed. Thus
170+
we also propose a util function that lives in TFX to create a `TFXIO` given an
171+
Examples artifact.
172+
173+
The topology of a pipeline may look like the right half of the following
174+
diagram:
175+
176+
<div align="center">
177+
<img src='20210305-tfx-struct2tensor/data_view_components.png', width='700'>
178+
<p><i>
179+
left: a tf.Example-based pipeline topology<br>
180+
right: proposed topology of a struct2tensor-based pipeline
181+
</ig</p>
182+
</div>
183+
184+
185+
Note that:
186+
187+
* The outputs of DataViewBinder are different instances of the Examples
188+
artifacts than the input ones. Thus MLMD will be able to record events that
189+
establish the lineage of the input and output.
190+
191+
* This design allows multiple DataViews to be bound to the same data, yielding
192+
different bound Examples artifacts.
193+
194+
* This design also allows components to take Examples without a bound adapter
195+
as input (this way TFDV will be able to analyze both adapted and unadapted
196+
data, and establish links between raw proto fields and transformed ones).
197+
198+
### Garbage Collection of Artifacts
199+
200+
In this section we discuss some of the constraints / requirements that this
201+
proposal impose on the design of GC (at the time of writing this doc, there’s
202+
not a concrete plan yet).
203+
204+
#### Artifacts sharing URIs -- GC for Examples Artifacts
205+
206+
DataViewBinder outputs an Examples Artifact that shares URI with its input.
207+
While MLMD allows this, the garbage collector must be aware when making the
208+
decision of deleting a URI, that multiple Artifacts are sharing them, and only
209+
if all the referring Artifacts are being GC’ed can the URI be deleted.
210+
211+
#### Artifacts referring to multiple URIs -- GC for DataView Artifacts
212+
213+
Note that a component that consumes adapted data only needs to use the output
214+
Examples Artifact from DataViewBinder, which means at execution time, only the
215+
URI of the Examples Artifact will be “locked”, however, that Examples Artifact
216+
is also referring to the URI of a DataView Artifact. The garbage collector needs
217+
to be aware of the existence of that URI and also lock it appropriately.
218+
219+
One way to add such support, is to have an extension property in an Artifact,
220+
say, `gc_context`, which could contain additional URIs. Then the DataView
221+
component is able to set that property.
46.1 KB
Loading
41.9 KB
Loading
45 KB
Loading

0 commit comments

Comments
 (0)