Skip to content

Commit c42c666

Browse files
Updated internal links with external github links in TF-NLP Data processing
PiperOrigin-RevId: 640215866
1 parent 72d0462 commit c42c666

File tree

1 file changed

+100
-0
lines changed

1 file changed

+100
-0
lines changed
Lines changed: 100 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,100 @@
1+
# TF-NLP Data Processing
2+
3+
## Code locations
4+
5+
Open sourced data processing libraries:
6+
[tensorflow_models/official/nlp/data/](https://github.com/tensorflow/models/tree/28d972a0b30b628cbb7f67a090ea564c3eda99ea/official/nlp/data)
7+
8+
## Preprocess data offline v.s. TFDS
9+
10+
Inside TF-NLP, there are flexible ways to provide training data to the input
11+
pipeline: 1) using python scripts/beam/flume to process/tokenize the data
12+
offline; 2) reading the text data directly from
13+
[TFDS](https://www.tensorflow.org/datasets/api_docs/python/tfds) and using
14+
[TF.Text](https://www.tensorflow.org/tutorials/tensorflow_text/intro) for
15+
tokenization and preprocessing inside the tf.data input pipeline.
16+
17+
### Preprocessing scripts
18+
19+
We have implemented data preprocessing for multiple datasets in the following
20+
python scripts:
21+
22+
* [create_pretraining_data.py](https://github.com/tensorflow/models/blob/28d972a0b30b628cbb7f67a090ea564c3eda99ea/official/nlp/data/create_pretraining_data.py)
23+
24+
* [create_finetuning_data.py](https://github.com/tensorflow/models/blob/28d972a0b30b628cbb7f67a090ea564c3eda99ea/official/nlp/data/create_finetuning_data.py)
25+
26+
Then, the processed files with `tf.Example` protos inside should be specified to
27+
the `input_path` argument in
28+
[`DataConfig`](https://github.com/tensorflow/models/blob/28d972a0b30b628cbb7f67a090ea564c3eda99ea/official/core/config_definitions.py#L28).
29+
30+
### TFDS usages
31+
32+
For convenience and consolidation, we built a common
33+
[input_reader.py](https://github.com/tensorflow/models/blob/28d972a0b30b628cbb7f67a090ea564c3eda99ea/official/core/input_reader.py)
34+
library to standardize input reading, which has built-in pass for TFDS.
35+
Specifying the arguments in the
36+
[`DataConfig`](https://github.com/tensorflow/models/blob/28d972a0b30b628cbb7f67a090ea564c3eda99ea/official/core/config_definitions.py#L28),
37+
`tfds_name`, `tfds_data_dir` and `tfds_split`, will let the tf.data pipeline
38+
read from the corresponding dataset inside TFDS.
39+
40+
## DataLoaders
41+
42+
To manage multiple datasets and processing functions, we defined the
43+
[DataLoader](https://github.com/tensorflow/models/blob/28d972a0b30b628cbb7f67a090ea564c3eda99ea/official/nlp/data/data_loader.py)
44+
class to work with the
45+
[data loader factory](https://github.com/tensorflow/models/blob/28d972a0b30b628cbb7f67a090ea564c3eda99ea/official/nlp/data/data_loader_factory.py).
46+
47+
Each dataloader defines the tf.data input pipeline inside the `load` method.
48+
49+
```python
50+
@abc.abstractmethod
51+
def load(
52+
self,
53+
input_context: Optional[tf.distribute.InputContext] = None
54+
) -> tf.data.Dataset:
55+
```
56+
57+
Then, the `load` method is called inside each NLP task's `build_input` method
58+
and the trainer wrap that to create distributed datasets.
59+
60+
```python
61+
def build_inputs(self, params, input_context=None):
62+
"""Returns tf.data.Dataset for pretraining."""
63+
data_loader = YourDataLoader(params)
64+
return data_loader.load(input_context)
65+
```
66+
67+
By default, in the example above, `params` is the `train_data` or
68+
`validation_data` field of the `task` field of the experiment config. `params`
69+
is a type of `DataConfig`.
70+
71+
It is important to note that, for TPU training, the entire `load` method will
72+
run on the TPU workers and it requires that the function does not access
73+
resources outside, e.g. the task attributes.
74+
75+
To work with raw text features, we need to use the `DataLoader`s handling the
76+
text data with TF.Text. You can take the following dataloaders as references:
77+
78+
* [sentence_prediction_dataloader.py](https://github.com/tensorflow/models/blob/28d972a0b30b628cbb7f67a090ea564c3eda99ea/official/nlp/data/sentence_prediction_dataloader.py)
79+
for BERT GLUE fine tuning using TFDS with raw text features.
80+
81+
## Speed up training using TF.data service and dynamic sequence length on TPUs
82+
83+
With TF 2.x, we can enable some types of dynamic shapes on TPUs, thanks to TF
84+
2.x programing model and TPUStrategy/XLA works.
85+
86+
Depending on the data distribution, we are seeing 50% to 90% speed up on typical
87+
text data for BERT pretraining applications relative to padded static shape
88+
inputs.
89+
90+
To enable dynamic sequence, we need to use
91+
`tf data service` for the global bucketizing over
92+
sequences. To enable it, you can simply add `--enable_tf_data_service` when you
93+
start experiments.
94+
95+
To pair with tf data service, we need to use the dataloaders that has the
96+
bucketizing function implemented. You can take the following dataloaders as
97+
references:
98+
99+
* [pretrain_dynamic_dataloader.py](https://github.com/tensorflow/models/blob/28d972a0b30b628cbb7f67a090ea564c3eda99ea/official/nlp/data/pretrain_dynamic_dataloader.py)
100+
for BERT pretraining on the tokenized datasets.

0 commit comments

Comments
 (0)