Skip to content

Commit f207efa

Browse files
Updated internal links by external github links in read_custom_datasets.md
PiperOrigin-RevId: 638021937
1 parent 1cfb1c8 commit f207efa

File tree

1 file changed

+282
-0
lines changed

1 file changed

+282
-0
lines changed
Lines changed: 282 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,282 @@
1+
# Read Custom Datasets
2+
3+
4+
5+
## Overview
6+
7+
[TFRecord](https://www.tensorflow.org/tutorials/load_data/tfrecord), a simple
8+
format for storing a sequence of binary records, is the default and recommended
9+
data format supported by TensorFlow Model Garden (TMG) for performance reasons.
10+
The
11+
[tf.train.Example](https://www.tensorflow.org/api_docs/python/tf/train/Example)
12+
message (or protobuf) is a flexible message type that represents a `{"string":
13+
value}` mapping. It is designed for use with TensorFlow and is used throughout
14+
the higher-level APIs such as [TFX](https://www.tensorflow.org/tfx/).
15+
16+
If your dataset is already encoded as `tf.train.Example` and in TFRecord format,
17+
please check the various
18+
[dataloaders](https://github.com/tensorflow/models/tree/master/official/vision/dataloaders/)
19+
we have created to handle standard input formats for classification, detection
20+
and segmentation. If the dataset is not in the recommended format or not in
21+
standard structure that can be handled by the provided
22+
[dataloaders](https://github.com/tensorflow/models/tree/master/official/vision/dataloaders),
23+
we have outlined the steps in the following sections to <br>
24+
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;- &nbsp;&nbsp;&nbsp;&nbsp; Encode the data using the
25+
[tf.train.Example](https://www.tensorflow.org/api_docs/python/tf/train/Example)
26+
message, and then serialize, write, and read
27+
[tf.train.Example](https://www.tensorflow.org/api_docs/python/tf/train/Example)
28+
messages to and from <br> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; `.tfrecord` files.<br>
29+
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;- &nbsp;&nbsp;&nbsp;&nbsp; Customize the dataloader to reads, decodes and parses the input data.
30+
31+
## Convert the dataset into tf.train.Example and TFRecord
32+
33+
The primary reason for converting a dataset into TFRecord format in TensorFlow
34+
is to improve input data reading performance during training. Reading data from
35+
disk or over a network can be a bottleneck in the training process, and using
36+
the TFRecord format can help to streamline this process and improve overall
37+
training speed.
38+
39+
The TFRecord format is a binary format that stores data in a compressed,
40+
serialized format. This makes it more efficient for reading, as the data can be
41+
read quickly and without the need for decompression or deserialization.
42+
43+
Additionally, the TFRecord format is designed to be scalable and efficient for
44+
large datasets. It can be split into multiple files and read from multiple
45+
threads in parallel, improving overall input pipeline performance.
46+
47+
### Instructions
48+
49+
To convert a dataset into TFRecord format in TensorFlow, you need to<br>
50+
51+
* first convert the data to TensorFlow's Feature format;<br>
52+
* then create a feature message using tf.train.Example;<br>
53+
* and lastly serialize the tf.train.Example message into a TFRecord file using
54+
tf.io.TFRecordWriter. The tf.train.Example holds the protobuf message (the
55+
data).
56+
57+
More concretely,:<br>
58+
59+
&nbsp;&nbsp; 1. Convert your data to TensorFlow's Feature format using `tf.train.Feature`:<br>
60+
<dl><dd>
61+
A `tf.train.Feature` is a dictionary containing data types that can be
62+
serialized to a TFRecord format. The `tf.train.Feature` message type can accept
63+
one of the following three types:
64+
65+
* tf.train.BytesList
66+
* tf.train.FloatList
67+
* tf.train.Int64List
68+
69+
Based on the type of values in the dataset, the user must first convert them
70+
into above types. Below are the simple helper functions that help in the
71+
conversion and return a `tf.train.Feature` object.Refer to the helper builder
72+
[class](https://github.com/tensorflow/models/blob/ea1054c5885ad8b8ff847db02c010f8b51e25f5b/official/core/tf_example_builder.py#L100)
73+
here.
74+
75+
**tf.train.Int64List:** This type is used to represent a list of 64-bit integer
76+
values. Below is the example of how to put int data into an Int64List.
77+
78+
```python
79+
def add_ints_feature(self, key: str,
80+
value: Union[int, Sequence[int]]) -> TfExampleBuilder:
81+
....
82+
return self.add_feature(key,tf.train.Feature(
83+
int64_list=tf.train.Int64List(value=_to_array(value))))
84+
```
85+
86+
<br>
87+
**tf.train.BytesList:** This type is used to represent a list of byte strings,
88+
which can be used to store arbitrary data as a string of bytes.
89+
90+
```python
91+
def add_bytes_feature(self, key: str,
92+
value: BytesValueType) -> TfExampleBuilder:
93+
....
94+
return self.add_feature(key, tf.train.Feature(
95+
bytes_list=tf.train.BytesList(value=_to_bytes_array(value))))
96+
```
97+
98+
<br>
99+
**tf.train.FloatList:** This type is used to represent a list of floating-point values. Below is a conversion example.
100+
101+
```python
102+
def add_floats_feature(self, key: str,
103+
value: Union[float, Sequence[float]]) -> TfExampleBuilder:
104+
....
105+
return self.add_feature(key,tf.train.Feature(
106+
float_list=tf.train.FloatList(value=_to_array(value))))
107+
```
108+
109+
Note: The exact steps for converting your data to TensorFlow's Feature format
110+
will depend on the structure of your data. You may need to create multiple
111+
Feature objects for each record, depending on the number of features in your
112+
data. </dd></dl>
113+
114+
<br>
115+
&nbsp;&nbsp; 2. Map the features using `tf.train.Example`:<br>
116+
117+
<dd><dl> Fundamentally, a tf.train.Example is a {"string": tf.train.Feature}
118+
mapping. From above we have `tf.train.Feature` values, we can now map them in a
119+
`tf.train.Example`. The format for keys to features mapping of tf.train.Example
120+
varies based on the use case.
121+
122+
For example,<br>
123+
124+
```python
125+
feature = {
126+
'feature0': _int64_feature(feature0),
127+
'feature1': _int64_feature(feature1),
128+
'feature2': _bytes_feature(feature2),
129+
'feature3': _float_feature(feature3),
130+
}
131+
tf.train.Example(features=tf.train.Features(feature=feature))
132+
```
133+
134+
<br>
135+
The sample usage of helper builder [class](https://github.com/tensorflow/models/blob/ea1054c5885ad8b8ff847db02c010f8b51e25f5b/official/core/tf_example_builder.py#L100) is<br>
136+
137+
```python
138+
>>> example_builder = TfExampleBuilder()
139+
>>> example = (
140+
example_builder.add_bytes_feature('feature_a', 'foobarbaz')
141+
.add_ints_feature('feature_b', [1, 2, 3])
142+
.example
143+
```
144+
145+
</dd></dl>
146+
<br>
147+
&nbsp;&nbsp; 3. Serialize the data: <br>
148+
149+
<dd><dl> Serialize the `tf.train.Example` message into a TFRecord file, use
150+
TensorFlow API’s `tf.io.TFRecordWriter` and `SerializeToString()`to serialize
151+
the data. Here is some code to iterate over annotations, process them and write
152+
into TFRecords. Refer to the
153+
[code](https://github.com/tensorflow/models/blob/ea1054c5885ad8b8ff847db02c010f8b51e25f5b/official/vision/data/tfrecord_lib.py#L118)
154+
here.
155+
156+
```python
157+
def write_tf_record_dataset(output_path, tf_example_iterator,num_shards):
158+
writers = [
159+
tf.io.TFRecordWriter(
160+
output_path + '-%05d-of-%05d.tfrecord' % (i, num_shards))
161+
for i in range(num_shards)
162+
]
163+
....
164+
165+
for idx, record in enumerate(
166+
tf_example_iterator):
167+
if idx % LOG_EVERY == 0:
168+
tf_example = process_features(record)
169+
writers[idx % num_shards].write(tf_example.SerializeToString())
170+
```
171+
172+
</dd></dl>
173+
174+
### Example
175+
176+
Here is an
177+
[example](https://github.com/tensorflow/models/blob/master/official/vision/data/create_coco_tf_record.py)
178+
of how to create a TFRecords file in TensorFlow. In this example, we Convert raw
179+
COCO dataset to TFRecord format. The resulting TFRecords file can then be used
180+
to train the model.
181+
<br>
182+
<br>
183+
184+
## Decoder
185+
186+
With a customized dataset in TFRecord, a customized
187+
[Decoder](https://github.com/tensorflow/models/blob/ea1054c5885ad8b8ff847db02c010f8b51e25f5b/official/vision/examples/starter/example_input.py#L30)
188+
is typically needed. The decoder decodes a TF Example record and returns a dictionary of decoded tensors. Below
189+
are some essential steps to customize a decoder.
190+
191+
### Instructions
192+
193+
To create a custom data loader for new dataset , user need to follow the below
194+
steps:
195+
196+
* **Create a subclass Class**
197+
<br>
198+
199+
<dd><dl>
200+
Create `class CustomizeDecoder(decoder.Decoder)`.The CustomizeDecoder class should be a subclass of the [generic decoder interface](https://github.com/tensorflow/models/blob/master/official/vision/dataloaders/decoder.py) and must implement all the abstract methods. In particular, it should have the implementation of abstract method `decode`, to decode the serialized example into tensors.<br>
201+
202+
The constructor defines the mapping between the field name and the value from an input tf.Example. There is no limit on the number of fields to decode based on the usecase.<br>
203+
204+
Below is the tf.Example decoder for classification task and Object Detection.
205+
Here we define two fields for image bytes and labels for classification tasks
206+
whereas ten fields for Object Detection.
207+
208+
```python
209+
class Decoder(decoder.Decoder):
210+
211+
def __init__(self):
212+
self._keys_to_features = {
213+
214+
'image/encoded':
215+
tf.io.FixedLenFeature((), tf.string, default_value=''),
216+
217+
'image/class/label':
218+
tf.io.FixedLenFeature((), tf.int64, default_value=-1)
219+
}
220+
....
221+
```
222+
223+
<br>
224+
Sample Constructor for Object Detection :
225+
226+
```python
227+
class Decoder(decoder.Decoder):
228+
229+
def __init__(self):
230+
self._keys_to_features = {
231+
232+
'image/encoded': tf.io.FixedLenFeature((), tf.string),
233+
'image/height': tf.io.FixedLenFeature((), tf.int64, -1),
234+
'image/width': tf.io.FixedLenFeature((), tf.int64, -1),
235+
'image/object/bbox/xmin': tf.io.VarLenFeature(tf.float32),
236+
'image/object/bbox/xmax': tf.io.VarLenFeature(tf.float32),
237+
'image/object/bbox/ymin': tf.io.VarLenFeature(tf.float32),
238+
'image/object/bbox/ymax': tf.io.VarLenFeature(tf.float32),
239+
'image/object/class/label': tf.io.VarLenFeature(tf.int64),
240+
'image/object/area': tf.io.VarLenFeature(tf.float32),
241+
'image/object/is_crowd': tf.io.VarLenFeature(tf.int64),
242+
}
243+
....
244+
```
245+
246+
</dd></dl>
247+
248+
<br>
249+
250+
* **Abstract Method Implementation and Return Type**<br>
251+
252+
<dd><dl> The implementation method `decode()` decodes the serialized example
253+
into tensors. It takes in a serialized string tensor argument that encodes the
254+
data. And returns decoded tensors i.e a dictionary of field key name and decoded
255+
tensor mapping. The output will be consumed by methods in Parser.
256+
257+
```python
258+
class Decoder(decoder.Decoder):
259+
260+
def __init__(self):
261+
....
262+
263+
def decode(self,
264+
serialized_example: tf.train.Example) -> Mapping[str,tf.Tensor]:
265+
266+
return tf.io.parse_single_example(
267+
serialized_example, self._keys_to_features)
268+
269+
```
270+
271+
</dd></dl>
272+
273+
### Example
274+
275+
Creating a Decoder is an optional step and it varies with the use case. Below
276+
are some use cases where we have included the Decoder and Parser based on the
277+
requirements.
278+
279+
| |
280+
-------------------------------------------------------------------------------------------------------------------------------------------------------- | ---
281+
[Classification](https://github.com/tensorflow/models/blob/master/official/vision/dataloaders/classification_input.py) | Both Decoder and Parser
282+
[Object Detection](https://github.com/tensorflow/models/blob/master/official/vision/dataloaders/tf_example_decoder.py) | Only Decoder

0 commit comments

Comments
 (0)