|
| 1 | +# Frequently Asked Questions |
| 2 | + |
| 3 | + |
| 4 | +<!-- ## Scope of the document |
| 5 | +
|
| 6 | +Scope of this document is to cover FAQs related to TensorFlow-Models-NLP. --> |
| 7 | + |
| 8 | +## Introduction |
| 9 | + |
| 10 | +Goal of this document is to capture Frequently Asked Questions (FAQs) related to |
| 11 | +TensorFlow-Models-NLP (TF-NLP). The source of these questions is limited to |
| 12 | +external resources (GitHub, StackOverflow,Google groups etc). |
| 13 | + |
| 14 | +## FAQs of TF-NLP |
| 15 | + |
| 16 | +-------------------------------------------------------------------------------- |
| 17 | + |
| 18 | +**Q1: How to cite TF-NLP as the libraries are used for research code bases |
| 19 | +externally?** |
| 20 | + |
| 21 | +If you use TensorFlow Model Garden in your research github repos, please cite |
| 22 | +this repository in your publication. The citation is at the following |
| 23 | +[location](https://github.com/tensorflow/models#citing-tensorflow-model-garden). |
| 24 | + |
| 25 | +-------------------------------------------------------------------------------- |
| 26 | + |
| 27 | +**Q2: How to Load NLP Pretrained Models ?** |
| 28 | + |
| 29 | +* [**How to Initialize from Checkpoint:**](https://github.com/tensorflow/models/blob/master/official/nlp/docs/pretrained_models.md#how-to-load-pretrained-models) |
| 30 | + If you use the TF-NLP training library, you can specify the checkpoint path |
| 31 | + link directly when launching your job. For example, follow the BERT |
| 32 | + [fine-tuning command](https://github.com/tensorflow/models/blob/master/official/nlp/docs/train.md#fine-tuning-squad-with-a-pre-trained-bert-checkpoint), |
| 33 | + to initialize the model from the checkpoint specified by \ |
| 34 | + `--params_override=task.init_checkpoint=PATH_TO_INIT_CKPT` |
| 35 | + |
| 36 | +* [**How to load TF-HUB SavedModel:**](https://github.com/tensorflow/models/blob/master/official/nlp/docs/pretrained_models.md#how-to-load-tf-hub-savedmodel) |
| 37 | + TF NLP's fine-tuning tasks such as question answering (SQuAD) and sentence |
| 38 | + prediction (GLUE) support loading a model from TF-HUB. These built-in tasks |
| 39 | + support a specific task.hub_module_url parameter. To set this parameter, |
| 40 | + follow the BERT |
| 41 | + [fine-tuning command](https://github.com/tensorflow/models/blob/master/official/nlp/docs/train.md#fine-tuning-sentence-classification-with-bert-from-tf-hub), |
| 42 | + and replace `--params_override=task.init_checkpoint=...` with \ |
| 43 | + `--params_override=task.hub_module_url=TF_HUB_URL`. |
| 44 | + |
| 45 | +-------------------------------------------------------------------------------- |
| 46 | + |
| 47 | +**Q3: How do I go about changing the pretraining loss functions for BERT ?** |
| 48 | + |
| 49 | +You can change the loss function for the pretraining in the |
| 50 | +[code](https://github.com/tensorflow/models/blob/d93c7e932de27522b2fa3b115f58d06d6f640537/official/nlp/tasks/masked_lm.py#L76) |
| 51 | +here. |
| 52 | + |
| 53 | +-------------------------------------------------------------------------------- |
| 54 | + |
| 55 | +**Q4: The |
| 56 | +[transformer code](https://github.com/tensorflow/models/blob/d93c7e932de27522b2fa3b115f58d06d6f640537/official/nlp/modeling/models/seq2seq_transformer.py#L31) |
| 57 | +extends keras.Model. Can I use the constructs like model.fit() for training as |
| 58 | +we do for any tf2/keras model? Are there any tutorials and starting points to |
| 59 | +set up the training and evaluation of a transformer model using TF-NLP?** |
| 60 | + |
| 61 | +Keras Model native `fit()` and `predict()` do not work for the seq2seq |
| 62 | +transformer model. TF model garden uses the workflow defined |
| 63 | +[here](https://github.com/tensorflow/models/blob/d93c7e932de27522b2fa3b115f58d06d6f640537/official/nlp/docs/train.md#model-garden-nlp-common-training-driver). |
| 64 | +\ |
| 65 | +The |
| 66 | +[code](https://github.com/tensorflow/models/blob/91d543a1a976e513822f03e63cf7e7d2dc0d92e1/official/nlp/tasks/translation.py) |
| 67 | +defines the translation task. |
| 68 | + |
| 69 | +-------------------------------------------------------------------------------- |
| 70 | + |
| 71 | +**Q5: Is there an easy way to set up a model server from a checkpoint (as |
| 72 | +opposed to an exported saved_model)?** |
| 73 | + |
| 74 | +Model server requires saved_model. If you just want to inspect the outputs, this |
| 75 | +[colab](https://www.tensorflow.org/tfmodels/nlp/customize_encoder) |
| 76 | +can help. |
| 77 | + |
| 78 | +-------------------------------------------------------------------------------- |
| 79 | + |
| 80 | +**Q6: Training with global batch size (4096) and local batch size (128) on 4x4 |
| 81 | +TPUs is very slow. Will the quality change by increasing TPUs to 8x8 with fixed |
| 82 | +local batch size (128) and global batch size (16392)?** |
| 83 | + |
| 84 | +Experiment configuration can be overridden by `--params_override` |
| 85 | +[FLAG](https://github.com/tensorflow/models/blob/master/official/nlp/docs/train.md#overriding-configuration-via-yaml-and-flags) |
| 86 | +through the command line. It only supports scalars. Please find the |
| 87 | +[implementation](https://github.com/tensorflow/models/blob/12cfda05b3fd34a3dd7b3271cd922cd00d0d0c41/official/modeling/hyperparams/params_dict.py#L339) |
| 88 | +here. |
| 89 | + |
| 90 | +-------------------------------------------------------------------------------- |
| 91 | + |
| 92 | +**Q7: Training with global batch size (4096) and local batch size (128) on 4x4 |
| 93 | +TPUs is very slow. Will the quality change by increasing TPUs to 8x8 with fixed |
| 94 | +local batch size (128) and global batch size (16392)?** |
| 95 | + |
| 96 | +The global batch size should be the key factor. As you increase the batch size, |
| 97 | +you may need to tune the Learning Rate to match the quality of the smaller batch |
| 98 | +size. If the task is retrieval it is recommended using the global softmax. An |
| 99 | +example can be found |
| 100 | +[here](https://github.com/tensorflow/models/blob/12cfda05b3fd34a3dd7b3271cd922cd00d0d0c41/official/modeling/tf_utils.py#L225). |
| 101 | + |
| 102 | +-------------------------------------------------------------------------------- |
| 103 | + |
| 104 | +**Q8: In some TF NLP |
| 105 | +[examples](https://github.com/tensorflow/models/blob/12cfda05b3fd34a3dd7b3271cd922cd00d0d0c41/official/nlp/tasks/question_answering.py#L15), |
| 106 | +the model output logits are casted into float32: Isn't logits already in the |
| 107 | +format of float?** |
| 108 | + |
| 109 | +For mixed precision training, the activations inside the model could be |
| 110 | +bfloat16/float16 format. The model output logits are casted into float32 to make |
| 111 | +sure the softmax and losses are calculated in float32. This is done to avoid any |
| 112 | +numeric issues that may occur if the intermediate tensor flowing from the |
| 113 | +softmax to the loss is float16 or bfloat16. You can also refer to the |
| 114 | +[mixed precision guide](https://www.tensorflow.org/guide/mixed_precision#building_the_model) |
| 115 | +for more information. |
| 116 | + |
| 117 | +-------------------------------------------------------------------------------- |
| 118 | + |
| 119 | +**Q9: Is it possible to use gradient clipping in the optimizer used in the Bert |
| 120 | +encoder? If yes, Is there any sample on its usage ?** |
| 121 | + |
| 122 | +We have the `gradient_clip_norm` argument in AdamW. Also new |
| 123 | +[Keras optimizers](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Optimizer) |
| 124 | +offer `global_clipnorm`, `clipnorm` and `clipvalue` as kwargs. |
| 125 | + |
| 126 | +Please refer to the |
| 127 | +[Example](https://github.com/tensorflow/models/blob/12cfda05b3fd34a3dd7b3271cd922cd00d0d0c41/official/nlp/configs/experiments/glue_mnli_matched.yaml#L23) |
| 128 | +below: |
| 129 | + |
| 130 | +``` |
| 131 | +optimizer: |
| 132 | + adamw: |
| 133 | + beta_1: 0.9 |
| 134 | + beta_2: 0.999 |
| 135 | + weight_decay_rate: 0.05 |
| 136 | + gradient_clip_norm: 0.0 |
| 137 | + type: adamw |
| 138 | +``` |
| 139 | + |
| 140 | +Please find the bert paper using legacy |
| 141 | +[implementation](https://github.com/tensorflow/models/blob/12cfda05b3fd34a3dd7b3271cd922cd00d0d0c41/official/modeling/optimization/legacy_adamw.py#L78) |
| 142 | +here[[ref]](https://github.com/tensorflow/models/blob/12cfda05b3fd34a3dd7b3271cd922cd00d0d0c41/official/projects/detr/configs/detr.py#L88). |
| 143 | + |
| 144 | +-------------------------------------------------------------------------------- |
| 145 | + |
| 146 | +**Q10: I am trying to create an embedding table with 4.7 million rows and 512 |
| 147 | +dimensions. However, the `nlp.modeling.layers.OnDeviceEmbedding` fails with the |
| 148 | +following error: UnknownError: Attempting to allocate 4.54G. That was not |
| 149 | +possible. There are 2.94G free.; \ |
| 150 | +Is there a way to increase this capacity or alternatives to OnDeviceEmbedding |
| 151 | +that can work in the same framework?** |
| 152 | + |
| 153 | +The embedding with 4.7 million rows and 512 dimensions looks very big. This will |
| 154 | +be placed on the TPU tensor core. \ |
| 155 | +Below tips might help: |
| 156 | + |
| 157 | +* Try to reduce the number of rows |
| 158 | +* Consider |
| 159 | + [mixed_precision_dtype](https://github.com/tensorflow/models/blob/12cfda05b3fd34a3dd7b3271cd922cd00d0d0c41/official/core/config_definitions.py#L147): |
| 160 | + 'bfloat16' training to reduce memory cost. |
| 161 | + |
| 162 | +-------------------------------------------------------------------------------- |
| 163 | + |
| 164 | +**Q11: What is the difference between seq_length in glue_mnli_matched.yaml and |
| 165 | +max_position_embeddings in bert_en_uncased_base.yaml ? Why are they not the |
| 166 | +same?** |
| 167 | + |
| 168 | +`seq_length` is the padded input length and `max_position_embeddings` is the |
| 169 | +size of learned position embeddings. Seq_length value should be always less or |
| 170 | +equal to max_position_embeddings value (seq_length <= max_position_embeddings). |
| 171 | + |
| 172 | +-------------------------------------------------------------------------------- |
| 173 | + |
| 174 | +**Q12: While running a model using the tf-nlp framework, it is noticed that when |
| 175 | +the number of validation steps (even by 10) is increased, the experiments get |
| 176 | +much slower. Is that expected?** |
| 177 | + |
| 178 | +This is not expected for 10 validation steps. Recommended tips below: |
| 179 | + |
| 180 | +* Increase the validation interval |
| 181 | +* Use `--add_eval` to start a side-car job for eval |
| 182 | +* Collect xprof for the eval job. It is known that tf2 eager execution is |
| 183 | + slow. |
| 184 | + |
| 185 | +-------------------------------------------------------------------------------- |
| 186 | + |
| 187 | +**Q13: How to load checkpoints for the BERT model? Any recommendations on how to |
| 188 | +deal with the variables mismatch error?** |
| 189 | + |
| 190 | +We recommend using `tf.train.Checkpoint` and manage the objects (including inner |
| 191 | +layers) directly. The details on restoring the encoder weights can be found |
| 192 | +[here](https://www.tensorflow.org/tfmodels/nlp/fine_tune_bert#restore_the_encoder_weights). |
| 193 | +More on TF-NLP checkpoint tutorial is |
| 194 | +[here](https://github.com/tensorflow/models/blob/12cfda05b3fd34a3dd7b3271cd922cd00d0d0c41/official/nlp/docs/pretrained_models.md#pre-trained-models) |
| 195 | +\ |
| 196 | +The variable mismatch error is due to the classifier_model not equal to the |
| 197 | +threephil model. The recommendation is using the same code and class of |
| 198 | +threephil model to read the checkpoint. The keras functional model cannot |
| 199 | +guarantee the python objects are matched if the model creation code is |
| 200 | +different. \ |
| 201 | +More to read as: https://www.tensorflow.org/guide/checkpoint |
| 202 | + |
| 203 | +-------------------------------------------------------------------------------- |
| 204 | + |
| 205 | +**Q14: Fail to save Bert2Bert model instance without passing the label input |
| 206 | +i.e. target_id ?** |
| 207 | + |
| 208 | +Bert2Bert needs input_ids, input_mask, segment_ids and target_ids to train. You |
| 209 | +should save the model with all features provided. |
| 210 | + |
| 211 | +If you care about inference and there is no target_id, you should not use Keras |
| 212 | +model.save(). Keras does not support None as inputs. Instead, we directly define |
| 213 | +a tf.Module including the bert2bert core model and save the tf.function using |
| 214 | +tf.saved_model.save() API. Refer |
| 215 | +[example](https://github.com/tensorflow/models/blob/12cfda05b3fd34a3dd7b3271cd922cd00d0d0c41/official/nlp/serving/serving_modules.py#L414) |
| 216 | +for the translation task. Usually, the seq2seq model is not friendly to Keras |
| 217 | +assumptions. |
| 218 | + |
| 219 | +-------------------------------------------------------------------------------- |
| 220 | + |
| 221 | +**Q15: How to fix the TPU Inference error with the Transformer?** |
| 222 | + |
| 223 | +The potential causes for the error may be having different inputs, and the batch |
| 224 | +size of one of them differs from the rest. |
| 225 | + |
| 226 | +Here are some explanations and troubleshooting tips : |
| 227 | + |
| 228 | +* Resolve the batching issue by implementing |
| 229 | + signature batching |
| 230 | +* Address the dynamic dimension problem by setting `max_batch_size` and |
| 231 | + `allowed_batch_sizes` to 1. |
| 232 | + |
| 233 | +-------------------------------------------------------------------------------- |
| 234 | + |
| 235 | +**Q16: Are there any models/methods that can improve the latency of the |
| 236 | +feed-forward neural network portion of the transformer encoder block (on CPU and |
| 237 | +GPU)?** |
| 238 | + |
| 239 | +There are `sparsemixture` and `Conditional computation` blocks to speed up. The |
| 240 | +`Block sparse feedforward` |
| 241 | +[layer](https://github.com/tensorflow/models/blob/12cfda05b3fd34a3dd7b3271cd922cd00d0d0c41/official/nlp/modeling/layers/block_diag_feedforward.py#L15) |
| 242 | +might be promising for performance purposes. This would work nicely on CPU and |
| 243 | +GPU since reshaping ops in this layer are free on CPU/GPUs. It offers speed-up |
| 244 | +for models of similar sizes (a caveat is we observed some quality drop with |
| 245 | +block sparse feedforward in the past). |
| 246 | + |
| 247 | +Refer to |
| 248 | +[Sparse Mixer encoder](https://github.com/tensorflow/models/blob/12cfda05b3fd34a3dd7b3271cd922cd00d0d0c41/official/nlp/modeling/networks/sparse_mixer.py) |
| 249 | +network and |
| 250 | +[FNet encoder](https://github.com/tensorflow/models/blob/12cfda05b3fd34a3dd7b3271cd922cd00d0d0c41/official/nlp/modeling/networks/fnet.py) |
| 251 | +network for some more sparsemixture references. |
| 252 | + |
| 253 | +Conditional computation |
| 254 | +is an AI model architecture where specific sections of the computational graph |
| 255 | +are activated based on input conditions. Models following this paradigm |
| 256 | +demonstrate efficiency, especially with increased model capacity or reduced |
| 257 | +inference latency. |
| 258 | + |
| 259 | +Refer |
| 260 | +[ExpandCondense tensor network layer](https://github.com/tensorflow/models/blob/12cfda05b3fd34a3dd7b3271cd922cd00d0d0c41/official/nlp/modeling/layers/tn_expand_condense.py) |
| 261 | +and |
| 262 | +[Gated linear feedforward layer](https://github.com/tensorflow/models/blob/12cfda05b3fd34a3dd7b3271cd922cd00d0d0c41/official/nlp/modeling/layers/gated_feedforward.py) |
| 263 | +for FFN blocks. The above mentioned techniques work really well with long |
| 264 | +sequence length. |
| 265 | + |
| 266 | +Please refer to the additional notes below based on your specific use cases. |
| 267 | + |
| 268 | +* For small student models, we used only 1 expert and route much fewer tokens |
| 269 | + to the FFN expert. |
| 270 | +* We need to set routing_group_size so each routing combines the tokens in |
| 271 | + multiple sequences and selects for example 1/4 of the tokens. |
| 272 | +* This will work well in the case of distillation or when we can pretrain the |
| 273 | + model. There will be a quality gap because a lot of tokens skip the FFN |
| 274 | + computation. |
| 275 | + |
| 276 | +-------------------------------------------------------------------------------- |
| 277 | + |
| 278 | +**Q17: How to obtain final layer embeddings from a model? Is there an example?** |
| 279 | + |
| 280 | +Refer to the `call` |
| 281 | +[method](https://github.com/tensorflow/models/blob/12cfda05b3fd34a3dd7b3271cd922cd00d0d0c41/official/nlp/modeling/networks/bert_encoder.py#L280) |
| 282 | +of the Transformer-based BERT encoder network. The `sequence_output` is the last |
| 283 | +layer embeddings [batch_size, seq len, hidden size]. |
| 284 | + |
| 285 | +-------------------------------------------------------------------------------- |
| 286 | + |
| 287 | +**Q18: Is it possible to convert public TF hub models like |
| 288 | +[sentence-t5](https://tfhub.dev/google/collections/sentence-t5/1) for TPU use?** |
| 289 | + |
| 290 | +The Inference Converter V2 deploys user-provided function(s) on the XLA device |
| 291 | +(TPU or XLA GPU) and optimizes them. |
| 292 | + |
| 293 | +-------------------------------------------------------------------------------- |
| 294 | + |
| 295 | +**Q19: Is it possible to have a dynamic batch size for `edit5` models using |
| 296 | +[sampling modules](https://github.com/tensorflow/models/blob/12cfda05b3fd34a3dd7b3271cd922cd00d0d0c41/official/nlp/modeling/ops/sampling_module.py)?** |
| 297 | + |
| 298 | +This may depend on the |
| 299 | +[decoding algorithm](https://github.com/tensorflow/models/blob/12cfda05b3fd34a3dd7b3271cd922cd00d0d0c41/official/nlp/modeling/ops/decoding_module.py#L136) |
| 300 | +for `beam_search`, the source of the issue is at the sample initial time it |
| 301 | +needs to allocate the [batch_size, beam_size, ...] buffer so that batch size is |
| 302 | +fixed. However, note that it may not be easily achievable. |
| 303 | + |
| 304 | +Users can also see that, in |
| 305 | +AutoMUM distillation |
| 306 | +[sampling module](https://github.com/tensorflow/models/blob/12cfda05b3fd34a3dd7b3271cd922cd00d0d0c41/official/nlp/modeling/ops/sampling_module.py) |
| 307 | +which makes the batch size static. |
| 308 | + |
| 309 | +Possibly, for greedy decoding, it can be done since it doesn't require the |
| 310 | +`beam_size`. |
| 311 | + |
| 312 | +-------------------------------------------------------------------------------- |
| 313 | + |
| 314 | +**Q20: Is multi-label tagging distillation supported by text tagging |
| 315 | +distillation?** |
| 316 | + |
| 317 | +Currently the template is just doing basic things of per token binary |
| 318 | +classification. If you intend to perform multi-label classification for each |
| 319 | +token, it shouldn't be overly challenging. It mainly involves adjusting the |
| 320 | +number of classes and switching to a multi-label loss. |
| 321 | + |
| 322 | +-------------------------------------------------------------------------------- |
| 323 | + |
| 324 | +**Q21: The TFM |
| 325 | +[Bert](https://github.com/tensorflow/models/blob/12cfda05b3fd34a3dd7b3271cd922cd00d0d0c41/official/nlp/modeling/networks/bert_encoder.py#L132) |
| 326 | +intentionally utilizes an `OnDeviceEmbedding`. Is it possible to incorporate an |
| 327 | +option to implement `CPU-forced` embedding table ideas by putting embeddings for |
| 328 | +transformer models on CPU to save HBM memory?** |
| 329 | + |
| 330 | +For the optimization, users can just place word embeddings on cpu. Just |
| 331 | +utilizing the `input_word_embeddings` path in |
| 332 | +[BertEncoderV2](https://github.com/tensorflow/models/blob/12cfda05b3fd34a3dd7b3271cd922cd00d0d0c41/official/nlp/modeling/networks/bert_encoder.py#L238) |
| 333 | +class for optimizing HBM usage during serving is sufficient. |
| 334 | + |
| 335 | +-------------------------------------------------------------------------------- |
| 336 | + |
| 337 | +**Q22: Is there a possibility of getting TF2 versions of Gemini/MUM? Basically, |
| 338 | +a checkpoint converter and a TF2-variant of instantiating the corresponding |
| 339 | +Transformer?** |
| 340 | + |
| 341 | +[JAX](https://github.com/google/jax) is the way forward at the moment for |
| 342 | +Gemini. |
| 343 | + |
| 344 | +-------------------------------------------------------------------------------- |
| 345 | + |
| 346 | +**Q23:Is it possible to |
| 347 | +perform MLM pretraining in text tagging as well??** |
| 348 | + |
| 349 | + The MLM functionality in `text_tagging` is |
| 350 | +currently not available. |
| 351 | + |
| 352 | +-------------------------------------------------------------------------------- |
| 353 | + |
| 354 | +## Glossary |
| 355 | + |
| 356 | +Acronym | Meaning |
| 357 | +------- | -------------------------- |
| 358 | +TFM | Tensorflow Models |
| 359 | +FAQs | Frequently Asked Questions |
| 360 | +TF | TensorFlow |
0 commit comments