|
| 1 | +# TF-NLP Data Processing |
| 2 | + |
| 3 | +## Code locations |
| 4 | + |
| 5 | +Open sourced data processing libraries: |
| 6 | +[tensorflow_models/official/nlp/data/](https://github.com/tensorflow/models/tree/28d972a0b30b628cbb7f67a090ea564c3eda99ea/official/nlp/data) |
| 7 | + |
| 8 | +## Preprocess data offline v.s. TFDS |
| 9 | + |
| 10 | +Inside TF-NLP, there are flexible ways to provide training data to the input |
| 11 | +pipeline: 1) using python scripts/beam/flume to process/tokenize the data |
| 12 | +offline; 2) reading the text data directly from |
| 13 | +[TFDS](https://www.tensorflow.org/datasets/api_docs/python/tfds) and using |
| 14 | +[TF.Text](https://www.tensorflow.org/tutorials/tensorflow_text/intro) for |
| 15 | +tokenization and preprocessing inside the tf.data input pipeline. |
| 16 | + |
| 17 | +### Preprocessing scripts |
| 18 | + |
| 19 | +We have implemented data preprocessing for multiple datasets in the following |
| 20 | +python scripts: |
| 21 | + |
| 22 | +* [create_pretraining_data.py](https://github.com/tensorflow/models/blob/28d972a0b30b628cbb7f67a090ea564c3eda99ea/official/nlp/data/create_pretraining_data.py) |
| 23 | + |
| 24 | +* [create_finetuning_data.py](https://github.com/tensorflow/models/blob/28d972a0b30b628cbb7f67a090ea564c3eda99ea/official/nlp/data/create_finetuning_data.py) |
| 25 | + |
| 26 | +Then, the processed files with `tf.Example` protos inside should be specified to |
| 27 | +the `input_path` argument in |
| 28 | +[`DataConfig`](https://github.com/tensorflow/models/blob/28d972a0b30b628cbb7f67a090ea564c3eda99ea/official/core/config_definitions.py#L28). |
| 29 | + |
| 30 | +### TFDS usages |
| 31 | + |
| 32 | +For convenience and consolidation, we built a common |
| 33 | +[input_reader.py](https://github.com/tensorflow/models/blob/28d972a0b30b628cbb7f67a090ea564c3eda99ea/official/core/input_reader.py) |
| 34 | +library to standardize input reading, which has built-in pass for TFDS. |
| 35 | +Specifying the arguments in the |
| 36 | +[`DataConfig`](https://github.com/tensorflow/models/blob/28d972a0b30b628cbb7f67a090ea564c3eda99ea/official/core/config_definitions.py#L28), |
| 37 | +`tfds_name`, `tfds_data_dir` and `tfds_split`, will let the tf.data pipeline |
| 38 | +read from the corresponding dataset inside TFDS. |
| 39 | + |
| 40 | +## DataLoaders |
| 41 | + |
| 42 | +To manage multiple datasets and processing functions, we defined the |
| 43 | +[DataLoader](https://github.com/tensorflow/models/blob/28d972a0b30b628cbb7f67a090ea564c3eda99ea/official/nlp/data/data_loader.py) |
| 44 | +class to work with the |
| 45 | +[data loader factory](https://github.com/tensorflow/models/blob/28d972a0b30b628cbb7f67a090ea564c3eda99ea/official/nlp/data/data_loader_factory.py). |
| 46 | + |
| 47 | +Each dataloader defines the tf.data input pipeline inside the `load` method. |
| 48 | + |
| 49 | +```python |
| 50 | +@abc.abstractmethod |
| 51 | +def load( |
| 52 | + self, |
| 53 | + input_context: Optional[tf.distribute.InputContext] = None |
| 54 | +) -> tf.data.Dataset: |
| 55 | +``` |
| 56 | + |
| 57 | +Then, the `load` method is called inside each NLP task's `build_input` method |
| 58 | +and the trainer wrap that to create distributed datasets. |
| 59 | + |
| 60 | +```python |
| 61 | +def build_inputs(self, params, input_context=None): |
| 62 | + """Returns tf.data.Dataset for pretraining.""" |
| 63 | + data_loader = YourDataLoader(params) |
| 64 | + return data_loader.load(input_context) |
| 65 | +``` |
| 66 | + |
| 67 | +By default, in the example above, `params` is the `train_data` or |
| 68 | +`validation_data` field of the `task` field of the experiment config. `params` |
| 69 | +is a type of `DataConfig`. |
| 70 | + |
| 71 | +It is important to note that, for TPU training, the entire `load` method will |
| 72 | +run on the TPU workers and it requires that the function does not access |
| 73 | +resources outside, e.g. the task attributes. |
| 74 | + |
| 75 | +To work with raw text features, we need to use the `DataLoader`s handling the |
| 76 | +text data with TF.Text. You can take the following dataloaders as references: |
| 77 | + |
| 78 | +* [sentence_prediction_dataloader.py](https://github.com/tensorflow/models/blob/28d972a0b30b628cbb7f67a090ea564c3eda99ea/official/nlp/data/sentence_prediction_dataloader.py) |
| 79 | + for BERT GLUE fine tuning using TFDS with raw text features. |
| 80 | + |
| 81 | +## Speed up training using TF.data service and dynamic sequence length on TPUs |
| 82 | + |
| 83 | +With TF 2.x, we can enable some types of dynamic shapes on TPUs, thanks to TF |
| 84 | +2.x programing model and TPUStrategy/XLA works. |
| 85 | + |
| 86 | +Depending on the data distribution, we are seeing 50% to 90% speed up on typical |
| 87 | +text data for BERT pretraining applications relative to padded static shape |
| 88 | +inputs. |
| 89 | + |
| 90 | +To enable dynamic sequence, we need to use |
| 91 | +`tf data service` for the global bucketizing over |
| 92 | +sequences. To enable it, you can simply add `--enable_tf_data_service` when you |
| 93 | +start experiments. |
| 94 | + |
| 95 | +To pair with tf data service, we need to use the dataloaders that has the |
| 96 | +bucketizing function implemented. You can take the following dataloaders as |
| 97 | +references: |
| 98 | + |
| 99 | +* [pretrain_dynamic_dataloader.py](https://github.com/tensorflow/models/blob/28d972a0b30b628cbb7f67a090ea564c3eda99ea/official/nlp/data/pretrain_dynamic_dataloader.py) |
| 100 | + for BERT pretraining on the tokenized datasets. |
0 commit comments