Skip to content

Commit 323a0eb

Browse files
Updated internal links with external github link in NLP FAQs.
PiperOrigin-RevId: 638776892
1 parent 081f599 commit 323a0eb

File tree

1 file changed

+360
-0
lines changed

1 file changed

+360
-0
lines changed

official/nlp/docs/faq.md

Lines changed: 360 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,360 @@
1+
# Frequently Asked Questions
2+
3+
4+
<!-- ## Scope of the document
5+
6+
Scope of this document is to cover FAQs related to TensorFlow-Models-NLP. -->
7+
8+
## Introduction
9+
10+
Goal of this document is to capture Frequently Asked Questions (FAQs) related to
11+
TensorFlow-Models-NLP (TF-NLP). The source of these questions is limited to
12+
external resources (GitHub, StackOverflow,Google groups etc).
13+
14+
## FAQs of TF-NLP
15+
16+
--------------------------------------------------------------------------------
17+
18+
**Q1: How to cite TF-NLP as the libraries are used for research code bases
19+
externally?**
20+
21+
If you use TensorFlow Model Garden in your research github repos, please cite
22+
this repository in your publication. The citation is at the following
23+
[location](https://github.com/tensorflow/models#citing-tensorflow-model-garden).
24+
25+
--------------------------------------------------------------------------------
26+
27+
**Q2: How to Load NLP Pretrained Models ?**
28+
29+
* [**How to Initialize from Checkpoint:**](https://github.com/tensorflow/models/blob/master/official/nlp/docs/pretrained_models.md#how-to-load-pretrained-models)
30+
If you use the TF-NLP training library, you can specify the checkpoint path
31+
link directly when launching your job. For example, follow the BERT
32+
[fine-tuning command](https://github.com/tensorflow/models/blob/master/official/nlp/docs/train.md#fine-tuning-squad-with-a-pre-trained-bert-checkpoint),
33+
to initialize the model from the checkpoint specified by \
34+
`--params_override=task.init_checkpoint=PATH_TO_INIT_CKPT`
35+
36+
* [**How to load TF-HUB SavedModel:**](https://github.com/tensorflow/models/blob/master/official/nlp/docs/pretrained_models.md#how-to-load-tf-hub-savedmodel)
37+
TF NLP's fine-tuning tasks such as question answering (SQuAD) and sentence
38+
prediction (GLUE) support loading a model from TF-HUB. These built-in tasks
39+
support a specific task.hub_module_url parameter. To set this parameter,
40+
follow the BERT
41+
[fine-tuning command](https://github.com/tensorflow/models/blob/master/official/nlp/docs/train.md#fine-tuning-sentence-classification-with-bert-from-tf-hub),
42+
and replace `--params_override=task.init_checkpoint=...` with \
43+
`--params_override=task.hub_module_url=TF_HUB_URL`.
44+
45+
--------------------------------------------------------------------------------
46+
47+
**Q3: How do I go about changing the pretraining loss functions for BERT ?**
48+
49+
You can change the loss function for the pretraining in the
50+
[code](https://github.com/tensorflow/models/blob/d93c7e932de27522b2fa3b115f58d06d6f640537/official/nlp/tasks/masked_lm.py#L76)
51+
here.
52+
53+
--------------------------------------------------------------------------------
54+
55+
**Q4: The
56+
[transformer code](https://github.com/tensorflow/models/blob/d93c7e932de27522b2fa3b115f58d06d6f640537/official/nlp/modeling/models/seq2seq_transformer.py#L31)
57+
extends keras.Model. Can I use the constructs like model.fit() for training as
58+
we do for any tf2/keras model? Are there any tutorials and starting points to
59+
set up the training and evaluation of a transformer model using TF-NLP?**
60+
61+
Keras Model native `fit()` and `predict()` do not work for the seq2seq
62+
transformer model. TF model garden uses the workflow defined
63+
[here](https://github.com/tensorflow/models/blob/d93c7e932de27522b2fa3b115f58d06d6f640537/official/nlp/docs/train.md#model-garden-nlp-common-training-driver).
64+
\
65+
The
66+
[code](https://github.com/tensorflow/models/blob/91d543a1a976e513822f03e63cf7e7d2dc0d92e1/official/nlp/tasks/translation.py)
67+
defines the translation task.
68+
69+
--------------------------------------------------------------------------------
70+
71+
**Q5: Is there an easy way to set up a model server from a checkpoint (as
72+
opposed to an exported saved_model)?**
73+
74+
Model server requires saved_model. If you just want to inspect the outputs, this
75+
[colab](https://www.tensorflow.org/tfmodels/nlp/customize_encoder)
76+
can help.
77+
78+
--------------------------------------------------------------------------------
79+
80+
**Q6: Training with global batch size (4096) and local batch size (128) on 4x4
81+
TPUs is very slow. Will the quality change by increasing TPUs to 8x8 with fixed
82+
local batch size (128) and global batch size (16392)?**
83+
84+
Experiment configuration can be overridden by `--params_override`
85+
[FLAG](https://github.com/tensorflow/models/blob/master/official/nlp/docs/train.md#overriding-configuration-via-yaml-and-flags)
86+
through the command line. It only supports scalars. Please find the
87+
[implementation](https://github.com/tensorflow/models/blob/12cfda05b3fd34a3dd7b3271cd922cd00d0d0c41/official/modeling/hyperparams/params_dict.py#L339)
88+
here.
89+
90+
--------------------------------------------------------------------------------
91+
92+
**Q7: Training with global batch size (4096) and local batch size (128) on 4x4
93+
TPUs is very slow. Will the quality change by increasing TPUs to 8x8 with fixed
94+
local batch size (128) and global batch size (16392)?**
95+
96+
The global batch size should be the key factor. As you increase the batch size,
97+
you may need to tune the Learning Rate to match the quality of the smaller batch
98+
size. If the task is retrieval it is recommended using the global softmax. An
99+
example can be found
100+
[here](https://github.com/tensorflow/models/blob/12cfda05b3fd34a3dd7b3271cd922cd00d0d0c41/official/modeling/tf_utils.py#L225).
101+
102+
--------------------------------------------------------------------------------
103+
104+
**Q8: In some TF NLP
105+
[examples](https://github.com/tensorflow/models/blob/12cfda05b3fd34a3dd7b3271cd922cd00d0d0c41/official/nlp/tasks/question_answering.py#L15),
106+
the model output logits are casted into float32: Isn't logits already in the
107+
format of float?**
108+
109+
For mixed precision training, the activations inside the model could be
110+
bfloat16/float16 format. The model output logits are casted into float32 to make
111+
sure the softmax and losses are calculated in float32. This is done to avoid any
112+
numeric issues that may occur if the intermediate tensor flowing from the
113+
softmax to the loss is float16 or bfloat16. You can also refer to the
114+
[mixed precision guide](https://www.tensorflow.org/guide/mixed_precision#building_the_model)
115+
for more information.
116+
117+
--------------------------------------------------------------------------------
118+
119+
**Q9: Is it possible to use gradient clipping in the optimizer used in the Bert
120+
encoder? If yes, Is there any sample on its usage ?**
121+
122+
We have the `gradient_clip_norm` argument in AdamW. Also new
123+
[Keras optimizers](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Optimizer)
124+
offer `global_clipnorm`, `clipnorm` and `clipvalue` as kwargs.
125+
126+
Please refer to the
127+
[Example](https://github.com/tensorflow/models/blob/12cfda05b3fd34a3dd7b3271cd922cd00d0d0c41/official/nlp/configs/experiments/glue_mnli_matched.yaml#L23)
128+
below:
129+
130+
```
131+
optimizer:
132+
adamw:
133+
beta_1: 0.9
134+
beta_2: 0.999
135+
weight_decay_rate: 0.05
136+
gradient_clip_norm: 0.0
137+
type: adamw
138+
```
139+
140+
Please find the bert paper using legacy
141+
[implementation](https://github.com/tensorflow/models/blob/12cfda05b3fd34a3dd7b3271cd922cd00d0d0c41/official/modeling/optimization/legacy_adamw.py#L78)
142+
here[[ref]](https://github.com/tensorflow/models/blob/12cfda05b3fd34a3dd7b3271cd922cd00d0d0c41/official/projects/detr/configs/detr.py#L88).
143+
144+
--------------------------------------------------------------------------------
145+
146+
**Q10: I am trying to create an embedding table with 4.7 million rows and 512
147+
dimensions. However, the `nlp.modeling.layers.OnDeviceEmbedding` fails with the
148+
following error: UnknownError: Attempting to allocate 4.54G. That was not
149+
possible. There are 2.94G free.; \
150+
Is there a way to increase this capacity or alternatives to OnDeviceEmbedding
151+
that can work in the same framework?**
152+
153+
The embedding with 4.7 million rows and 512 dimensions looks very big. This will
154+
be placed on the TPU tensor core. \
155+
Below tips might help:
156+
157+
* Try to reduce the number of rows
158+
* Consider
159+
[mixed_precision_dtype](https://github.com/tensorflow/models/blob/12cfda05b3fd34a3dd7b3271cd922cd00d0d0c41/official/core/config_definitions.py#L147):
160+
'bfloat16' training to reduce memory cost.
161+
162+
--------------------------------------------------------------------------------
163+
164+
**Q11: What is the difference between seq_length in glue_mnli_matched.yaml and
165+
max_position_embeddings in bert_en_uncased_base.yaml ? Why are they not the
166+
same?**
167+
168+
`seq_length` is the padded input length and `max_position_embeddings` is the
169+
size of learned position embeddings. Seq_length value should be always less or
170+
equal to max_position_embeddings value (seq_length <= max_position_embeddings).
171+
172+
--------------------------------------------------------------------------------
173+
174+
**Q12: While running a model using the tf-nlp framework, it is noticed that when
175+
the number of validation steps (even by 10) is increased, the experiments get
176+
much slower. Is that expected?**
177+
178+
This is not expected for 10 validation steps. Recommended tips below:
179+
180+
* Increase the validation interval
181+
* Use `--add_eval` to start a side-car job for eval
182+
* Collect xprof for the eval job. It is known that tf2 eager execution is
183+
slow.
184+
185+
--------------------------------------------------------------------------------
186+
187+
**Q13: How to load checkpoints for the BERT model? Any recommendations on how to
188+
deal with the variables mismatch error?**
189+
190+
We recommend using `tf.train.Checkpoint` and manage the objects (including inner
191+
layers) directly. The details on restoring the encoder weights can be found
192+
[here](https://www.tensorflow.org/tfmodels/nlp/fine_tune_bert#restore_the_encoder_weights).
193+
More on TF-NLP checkpoint tutorial is
194+
[here](https://github.com/tensorflow/models/blob/12cfda05b3fd34a3dd7b3271cd922cd00d0d0c41/official/nlp/docs/pretrained_models.md#pre-trained-models)
195+
\
196+
The variable mismatch error is due to the classifier_model not equal to the
197+
threephil model. The recommendation is using the same code and class of
198+
threephil model to read the checkpoint. The keras functional model cannot
199+
guarantee the python objects are matched if the model creation code is
200+
different. \
201+
More to read as: https://www.tensorflow.org/guide/checkpoint
202+
203+
--------------------------------------------------------------------------------
204+
205+
**Q14: Fail to save Bert2Bert model instance without passing the label input
206+
i.e. target_id ?**
207+
208+
Bert2Bert needs input_ids, input_mask, segment_ids and target_ids to train. You
209+
should save the model with all features provided.
210+
211+
If you care about inference and there is no target_id, you should not use Keras
212+
model.save(). Keras does not support None as inputs. Instead, we directly define
213+
a tf.Module including the bert2bert core model and save the tf.function using
214+
tf.saved_model.save() API. Refer
215+
[example](https://github.com/tensorflow/models/blob/12cfda05b3fd34a3dd7b3271cd922cd00d0d0c41/official/nlp/serving/serving_modules.py#L414)
216+
for the translation task. Usually, the seq2seq model is not friendly to Keras
217+
assumptions.
218+
219+
--------------------------------------------------------------------------------
220+
221+
**Q15: How to fix the TPU Inference error with the Transformer?**
222+
223+
The potential causes for the error may be having different inputs, and the batch
224+
size of one of them differs from the rest.
225+
226+
Here are some explanations and troubleshooting tips :
227+
228+
* Resolve the batching issue by implementing
229+
signature batching
230+
* Address the dynamic dimension problem by setting `max_batch_size` and
231+
`allowed_batch_sizes` to 1.
232+
233+
--------------------------------------------------------------------------------
234+
235+
**Q16: Are there any models/methods that can improve the latency of the
236+
feed-forward neural network portion of the transformer encoder block (on CPU and
237+
GPU)?**
238+
239+
There are `sparsemixture` and `Conditional computation` blocks to speed up. The
240+
`Block sparse feedforward`
241+
[layer](https://github.com/tensorflow/models/blob/12cfda05b3fd34a3dd7b3271cd922cd00d0d0c41/official/nlp/modeling/layers/block_diag_feedforward.py#L15)
242+
might be promising for performance purposes. This would work nicely on CPU and
243+
GPU since reshaping ops in this layer are free on CPU/GPUs. It offers speed-up
244+
for models of similar sizes (a caveat is we observed some quality drop with
245+
block sparse feedforward in the past).
246+
247+
Refer to
248+
[Sparse Mixer encoder](https://github.com/tensorflow/models/blob/12cfda05b3fd34a3dd7b3271cd922cd00d0d0c41/official/nlp/modeling/networks/sparse_mixer.py)
249+
network and
250+
[FNet encoder](https://github.com/tensorflow/models/blob/12cfda05b3fd34a3dd7b3271cd922cd00d0d0c41/official/nlp/modeling/networks/fnet.py)
251+
network for some more sparsemixture references.
252+
253+
Conditional computation
254+
is an AI model architecture where specific sections of the computational graph
255+
are activated based on input conditions. Models following this paradigm
256+
demonstrate efficiency, especially with increased model capacity or reduced
257+
inference latency.
258+
259+
Refer
260+
[ExpandCondense tensor network layer](https://github.com/tensorflow/models/blob/12cfda05b3fd34a3dd7b3271cd922cd00d0d0c41/official/nlp/modeling/layers/tn_expand_condense.py)
261+
and
262+
[Gated linear feedforward layer](https://github.com/tensorflow/models/blob/12cfda05b3fd34a3dd7b3271cd922cd00d0d0c41/official/nlp/modeling/layers/gated_feedforward.py)
263+
for FFN blocks. The above mentioned techniques work really well with long
264+
sequence length.
265+
266+
Please refer to the additional notes below based on your specific use cases.
267+
268+
* For small student models, we used only 1 expert and route much fewer tokens
269+
to the FFN expert.
270+
* We need to set routing_group_size so each routing combines the tokens in
271+
multiple sequences and selects for example 1/4 of the tokens.
272+
* This will work well in the case of distillation or when we can pretrain the
273+
model. There will be a quality gap because a lot of tokens skip the FFN
274+
computation.
275+
276+
--------------------------------------------------------------------------------
277+
278+
**Q17: How to obtain final layer embeddings from a model? Is there an example?**
279+
280+
Refer to the `call`
281+
[method](https://github.com/tensorflow/models/blob/12cfda05b3fd34a3dd7b3271cd922cd00d0d0c41/official/nlp/modeling/networks/bert_encoder.py#L280)
282+
of the Transformer-based BERT encoder network. The `sequence_output` is the last
283+
layer embeddings [batch_size, seq len, hidden size].
284+
285+
--------------------------------------------------------------------------------
286+
287+
**Q18: Is it possible to convert public TF hub models like
288+
[sentence-t5](https://tfhub.dev/google/collections/sentence-t5/1) for TPU use?**
289+
290+
The Inference Converter V2 deploys user-provided function(s) on the XLA device
291+
(TPU or XLA GPU) and optimizes them.
292+
293+
--------------------------------------------------------------------------------
294+
295+
**Q19: Is it possible to have a dynamic batch size for `edit5` models using
296+
[sampling modules](https://github.com/tensorflow/models/blob/12cfda05b3fd34a3dd7b3271cd922cd00d0d0c41/official/nlp/modeling/ops/sampling_module.py)?**
297+
298+
This may depend on the
299+
[decoding algorithm](https://github.com/tensorflow/models/blob/12cfda05b3fd34a3dd7b3271cd922cd00d0d0c41/official/nlp/modeling/ops/decoding_module.py#L136)
300+
for `beam_search`, the source of the issue is at the sample initial time it
301+
needs to allocate the [batch_size, beam_size, ...] buffer so that batch size is
302+
fixed. However, note that it may not be easily achievable.
303+
304+
Users can also see that, in
305+
AutoMUM distillation
306+
[sampling module](https://github.com/tensorflow/models/blob/12cfda05b3fd34a3dd7b3271cd922cd00d0d0c41/official/nlp/modeling/ops/sampling_module.py)
307+
which makes the batch size static.
308+
309+
Possibly, for greedy decoding, it can be done since it doesn't require the
310+
`beam_size`.
311+
312+
--------------------------------------------------------------------------------
313+
314+
**Q20: Is multi-label tagging distillation supported by text tagging
315+
distillation?**
316+
317+
Currently the template is just doing basic things of per token binary
318+
classification. If you intend to perform multi-label classification for each
319+
token, it shouldn't be overly challenging. It mainly involves adjusting the
320+
number of classes and switching to a multi-label loss.
321+
322+
--------------------------------------------------------------------------------
323+
324+
**Q21: The TFM
325+
[Bert](https://github.com/tensorflow/models/blob/12cfda05b3fd34a3dd7b3271cd922cd00d0d0c41/official/nlp/modeling/networks/bert_encoder.py#L132)
326+
intentionally utilizes an `OnDeviceEmbedding`. Is it possible to incorporate an
327+
option to implement `CPU-forced` embedding table ideas by putting embeddings for
328+
transformer models on CPU to save HBM memory?**
329+
330+
For the optimization, users can just place word embeddings on cpu. Just
331+
utilizing the `input_word_embeddings` path in
332+
[BertEncoderV2](https://github.com/tensorflow/models/blob/12cfda05b3fd34a3dd7b3271cd922cd00d0d0c41/official/nlp/modeling/networks/bert_encoder.py#L238)
333+
class for optimizing HBM usage during serving is sufficient.
334+
335+
--------------------------------------------------------------------------------
336+
337+
**Q22: Is there a possibility of getting TF2 versions of Gemini/MUM? Basically,
338+
a checkpoint converter and a TF2-variant of instantiating the corresponding
339+
Transformer?**
340+
341+
[JAX](https://github.com/google/jax) is the way forward at the moment for
342+
Gemini.
343+
344+
--------------------------------------------------------------------------------
345+
346+
**Q23:Is it possible to
347+
perform MLM pretraining in text tagging as well??**
348+
349+
The MLM functionality in `text_tagging` is
350+
currently not available.
351+
352+
--------------------------------------------------------------------------------
353+
354+
## Glossary
355+
356+
Acronym | Meaning
357+
------- | --------------------------
358+
TFM | Tensorflow Models
359+
FAQs | Frequently Asked Questions
360+
TF | TensorFlow

0 commit comments

Comments
 (0)