Skip to content

Commit e84fae2

Browse files
sjrlagnieszka-m
andauthored
Migrating to use native Pytorch AMP (#2827)
* Started making changes to use native Pytorch AMP * Updated compute_loss functions to use torch.cuda.amp.autocast * Updating docstrings * Add use_amp to trainer_checkpoint * Removed mentions of apex and started to add the necessary warnings * Removing unused instances of use_amp variable * Added fast training test for FARMReader. Needed to add max_query_length as a parameter in FARMReader.__init__ and FARMReader.train * Make max_query_length optional in FARMReader.train * Update lg Co-authored-by: Agnieszka Marzec <[email protected]> Co-authored-by: agnieszka-m <[email protected]>
1 parent 35e9ff2 commit e84fae2

File tree

15 files changed

+253
-275
lines changed

15 files changed

+253
-275
lines changed

docs/_src/api/api/evaluation.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -143,7 +143,7 @@ Computes Transformer-based similarity of predicted answer to gold labels to deri
143143

144144
Returns per QA pair a) the similarity of the most likely prediction (top 1) to all available gold labels
145145
b) the highest similarity of all predictions to gold labels
146-
c) a matrix consisting of the similarities of all the predicitions compared to all gold labels
146+
c) a matrix consisting of the similarities of all the predictions compared to all gold labels
147147

148148
**Arguments**:
149149

docs/_src/api/api/reader.md

Lines changed: 18 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -149,7 +149,7 @@ def train(data_dir: str,
149149
evaluate_every: int = 300,
150150
save_dir: Optional[str] = None,
151151
num_processes: Optional[int] = None,
152-
use_amp: str = None,
152+
use_amp: bool = False,
153153
checkpoint_root_dir: Path = Path("model_checkpoints"),
154154
checkpoint_every: Optional[int] = None,
155155
checkpoints_to_keep: int = 3,
@@ -193,14 +193,10 @@ Note that the evaluation report is logged at evaluation level INFO while Haystac
193193
- `num_processes`: The number of processes for `multiprocessing.Pool` during preprocessing.
194194
Set to value of 1 to disable multiprocessing. When set to 1, you cannot split away a dev set from train set.
195195
Set to None to use all CPU cores minus one.
196-
- `use_amp`: Optimization level of NVIDIA's automatic mixed precision (AMP). The higher the level, the faster the model.
197-
Available options:
198-
None (Don't use AMP)
199-
"O0" (Normal FP32 training)
200-
"O1" (Mixed Precision => Recommended)
201-
"O2" (Almost FP16)
202-
"O3" (Pure FP16).
203-
See details on: https://nvidia.github.io/apex/amp.html
196+
- `use_amp`: Whether to use automatic mixed precision (AMP) natively implemented in PyTorch to improve
197+
training speed and reduce GPU memory usage.
198+
For more information, see (Haystack Optimization)[https://haystack.deepset.ai/guides/optimization]
199+
and (Automatic Mixed Precision Package - Torch.amp)[https://pytorch.org/docs/stable/amp.html].
204200
- `checkpoint_root_dir`: The Path of a directory where all train checkpoints are saved. For each individual
205201
checkpoint, a subdirectory with the name epoch_{epoch_num}_step_{step_num} is created.
206202
- `checkpoint_every`: Save a train checkpoint after this many steps of training.
@@ -237,7 +233,7 @@ def distil_prediction_layer_from(
237233
evaluate_every: int = 300,
238234
save_dir: Optional[str] = None,
239235
num_processes: Optional[int] = None,
240-
use_amp: str = None,
236+
use_amp: bool = False,
241237
checkpoint_root_dir: Path = Path("model_checkpoints"),
242238
checkpoint_every: Optional[int] = None,
243239
checkpoints_to_keep: int = 3,
@@ -284,7 +280,7 @@ A list containing torch device objects and/or strings is supported (For example
284280
[torch.device('cuda:0'), "mps", "cuda:1"]). When specifying `use_gpu=False` the devices
285281
parameter is not used and a single cpu device is used for inference.
286282
- `student_batch_size`: Number of samples the student model receives in one batch for training
287-
- `student_batch_size`: Number of samples the teacher model receives in one batch for distillation
283+
- `teacher_batch_size`: Number of samples the teacher model receives in one batch for distillation
288284
- `n_epochs`: Number of iterations on the whole training data set
289285
- `learning_rate`: Learning rate of the optimizer
290286
- `max_seq_len`: Maximum text length (in tokens). Everything longer gets cut down.
@@ -296,14 +292,10 @@ Options for different schedules are available in FARM.
296292
- `num_processes`: The number of processes for `multiprocessing.Pool` during preprocessing.
297293
Set to value of 1 to disable multiprocessing. When set to 1, you cannot split away a dev set from train set.
298294
Set to None to use all CPU cores minus one.
299-
- `use_amp`: Optimization level of NVIDIA's automatic mixed precision (AMP). The higher the level, the faster the model.
300-
Available options:
301-
None (Don't use AMP)
302-
"O0" (Normal FP32 training)
303-
"O1" (Mixed Precision => Recommended)
304-
"O2" (Almost FP16)
305-
"O3" (Pure FP16).
306-
See details on: https://nvidia.github.io/apex/amp.html
295+
- `use_amp`: Whether to use automatic mixed precision (AMP) natively implemented in PyTorch to improve
296+
training speed and reduce GPU memory usage.
297+
For more information, see (Haystack Optimization)[https://haystack.deepset.ai/guides/optimization]
298+
and (Automatic Mixed Precision Package - Torch.amp)[https://pytorch.org/docs/stable/amp.html].
307299
- `checkpoint_root_dir`: the Path of directory where all train checkpoints are saved. For each individual
308300
checkpoint, a subdirectory with the name epoch_{epoch_num}_step_{step_num} is created.
309301
- `checkpoint_every`: save a train checkpoint after this many steps of training.
@@ -347,7 +339,7 @@ def distil_intermediate_layers_from(
347339
evaluate_every: int = 300,
348340
save_dir: Optional[str] = None,
349341
num_processes: Optional[int] = None,
350-
use_amp: str = None,
342+
use_amp: bool = False,
351343
checkpoint_root_dir: Path = Path("model_checkpoints"),
352344
checkpoint_every: Optional[int] = None,
353345
checkpoints_to_keep: int = 3,
@@ -389,8 +381,7 @@ that gets split off from training data for eval.
389381
A list containing torch device objects and/or strings is supported (For example
390382
[torch.device('cuda:0'), "mps", "cuda:1"]). When specifying `use_gpu=False` the devices
391383
parameter is not used and a single cpu device is used for inference.
392-
- `student_batch_size`: Number of samples the student model receives in one batch for training
393-
- `student_batch_size`: Number of samples the teacher model receives in one batch for distillation
384+
- `batch_size`: Number of samples the student model and teacher model receives in one batch for training
394385
- `n_epochs`: Number of iterations on the whole training data set
395386
- `learning_rate`: Learning rate of the optimizer
396387
- `max_seq_len`: Maximum text length (in tokens). Everything longer gets cut down.
@@ -402,21 +393,16 @@ Options for different schedules are available in FARM.
402393
- `num_processes`: The number of processes for `multiprocessing.Pool` during preprocessing.
403394
Set to value of 1 to disable multiprocessing. When set to 1, you cannot split away a dev set from train set.
404395
Set to None to use all CPU cores minus one.
405-
- `use_amp`: Optimization level of NVIDIA's automatic mixed precision (AMP). The higher the level, the faster the model.
406-
Available options:
407-
None (Don't use AMP)
408-
"O0" (Normal FP32 training)
409-
"O1" (Mixed Precision => Recommended)
410-
"O2" (Almost FP16)
411-
"O3" (Pure FP16).
412-
See details on: https://nvidia.github.io/apex/amp.html
396+
- `use_amp`: Whether to use automatic mixed precision (AMP) natively implemented in PyTorch to improve
397+
training speed and reduce GPU memory usage.
398+
For more information, see (Haystack Optimization)[https://haystack.deepset.ai/guides/optimization]
399+
and (Automatic Mixed Precision Package - Torch.amp)[https://pytorch.org/docs/stable/amp.html].
413400
- `checkpoint_root_dir`: the Path of directory where all train checkpoints are saved. For each individual
414401
checkpoint, a subdirectory with the name epoch_{epoch_num}_step_{step_num} is created.
415402
- `checkpoint_every`: save a train checkpoint after this many steps of training.
416403
- `checkpoints_to_keep`: maximum number of train checkpoints to save.
417404
- `caching`: whether or not to use caching for preprocessed dataset and teacher logits
418405
- `cache_path`: Path to cache the preprocessed dataset and teacher logits
419-
- `distillation_loss_weight`: The weight of the distillation loss. A higher weight means the teacher outputs are more important.
420406
- `distillation_loss`: Specifies how teacher and model logits should be compared. Can either be a string ("mse" for mean squared error or "kl_div" for kl divergence loss) or a callable loss function (needs to have named parameters student_logits and teacher_logits)
421407
- `temperature`: The temperature for distillation. A higher temperature will result in less certainty of teacher outputs. A lower temperature means more certainty. A temperature of 1.0 does not change the certainty of the model.
422408
- `processor`: The processor to use for preprocessing. If None, the default SquadProcessor is used.
@@ -663,7 +649,7 @@ Example:
663649
**Arguments**:
664650

665651
- `question`: Question string
666-
- `documents`: List of documents as string type
652+
- `texts`: A list of Document texts as a string type
667653
- `top_k`: The maximum number of answers to return
668654

669655
**Returns**:

docs/_src/api/api/retriever.md

Lines changed: 10 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -946,7 +946,7 @@ def train(data_dir: str,
946946
weight_decay: float = 0.0,
947947
num_warmup_steps: int = 100,
948948
grad_acc_steps: int = 1,
949-
use_amp: str = None,
949+
use_amp: bool = False,
950950
optimizer_name: str = "AdamW",
951951
optimizer_correct_bias: bool = True,
952952
save_dir: str = "../saved_models/dpr",
@@ -984,12 +984,10 @@ you should use the file_system strategy.
984984
- `epsilon`: epsilon parameter of optimizer
985985
- `weight_decay`: weight decay parameter of optimizer
986986
- `grad_acc_steps`: number of steps to accumulate gradient over before back-propagation is done
987-
- `use_amp`: Whether to use automatic mixed precision (AMP) or not. The options are:
988-
"O0" (FP32)
989-
"O1" (Mixed Precision)
990-
"O2" (Almost FP16)
991-
"O3" (Pure FP16).
992-
For more information, refer to: https://nvidia.github.io/apex/amp.html
987+
- `use_amp`: Whether to use automatic mixed precision (AMP) natively implemented in PyTorch to improve
988+
training speed and reduce GPU memory usage.
989+
For more information, see (Haystack Optimization)[https://haystack.deepset.ai/guides/optimization]
990+
and (Automatic Mixed Precision Package - Torch.amp)[https://pytorch.org/docs/stable/amp.html].
993991
- `optimizer_name`: what optimizer to use (default: AdamW)
994992
- `num_warmup_steps`: number of warmup steps
995993
- `optimizer_correct_bias`: Whether to correct bias in optimizer
@@ -1305,7 +1303,7 @@ def train(data_dir: str,
13051303
weight_decay: float = 0.0,
13061304
num_warmup_steps: int = 100,
13071305
grad_acc_steps: int = 1,
1308-
use_amp: str = None,
1306+
use_amp: bool = False,
13091307
optimizer_name: str = "AdamW",
13101308
optimizer_correct_bias: bool = True,
13111309
save_dir: str = "../saved_models/mm_retrieval",
@@ -1345,12 +1343,10 @@ very similar (high score by BM25) to query but do not contain the answer)-
13451343
- `epsilon`: Epsilon parameter of optimizer.
13461344
- `weight_decay`: Weight decay parameter of optimizer.
13471345
- `grad_acc_steps`: Number of steps to accumulate gradient over before back-propagation is done.
1348-
- `use_amp`: Whether to use automatic mixed precision (AMP) or not. The options are:
1349-
"O0" (FP32)
1350-
"O1" (Mixed Precision)
1351-
"O2" (Almost FP16)
1352-
"O3" (Pure FP16).
1353-
For more information, refer to: https://nvidia.github.io/apex/amp.html
1346+
- `use_amp`: Whether to use automatic mixed precision (AMP) natively implemented in PyTorch to improve
1347+
training speed and reduce GPU memory usage.
1348+
For more information, see (Haystack Optimization)[https://haystack.deepset.ai/guides/optimization]
1349+
and (Automatic Mixed Precision Package - Torch.amp)[https://pytorch.org/docs/stable/amp.html].
13541350
- `optimizer_name`: What optimizer to use (default: TransformersAdamW).
13551351
- `num_warmup_steps`: Number of warmup steps.
13561352
- `optimizer_correct_bias`: Whether to correct bias in optimizer.

haystack/modeling/data_handler/processor.py

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -72,7 +72,7 @@ def __init__(
7272
:param dev_filename: The name of the file containing the dev data. If None and 0.0 < dev_split < 1.0 the dev set
7373
will be a slice of the train set.
7474
:param test_filename: The name of the file containing test data.
75-
:param dev_split: The proportion of the train set that will sliced. Only works if dev_filename is set to None
75+
:param dev_split: The proportion of the train set that will be sliced. Only works if `dev_filename` is set to `None`.
7676
:param data_dir: The directory in which the train, test and perhaps dev files can be found.
7777
:param tasks: Tasks for which the processor shall extract labels from the input data.
7878
Usually this includes a single, default task, e.g. text classification.
@@ -137,7 +137,7 @@ def load(
137137
If None and 0.0 < dev_split < 1.0 the dev set
138138
will be a slice of the train set.
139139
:param test_filename: The name of the file containing test data.
140-
:param dev_split: The proportion of the train set that will sliced.
140+
:param dev_split: The proportion of the train set that will be sliced.
141141
Only works if dev_filename is set to None
142142
:param kwargs: placeholder for passing generic parameters
143143
:return: An instance of the specified processor.
@@ -217,6 +217,7 @@ def convert_from_transformers(
217217
tokenizer_class=None,
218218
tokenizer_args=None,
219219
use_fast=True,
220+
max_query_length=64,
220221
**kwargs,
221222
):
222223
tokenizer_args = tokenizer_args or {}
@@ -238,6 +239,7 @@ def convert_from_transformers(
238239
metric="squad",
239240
data_dir="data",
240241
doc_stride=doc_stride,
242+
max_query_length=max_query_length,
241243
)
242244
elif task_type == "embeddings":
243245
processor = InferenceProcessor(tokenizer=tokenizer, max_seq_len=max_seq_len)
@@ -396,7 +398,7 @@ def __init__(
396398
:param dev_filename: The name of the file containing the dev data. If None and 0.0 < dev_split < 1.0 the dev set
397399
will be a slice of the train set.
398400
:param test_filename: None
399-
:param dev_split: The proportion of the train set that will sliced. Only works if dev_filename is set to None
401+
:param dev_split: The proportion of the train set that will be sliced. Only works if `dev_filename` is set to `None`.
400402
:param doc_stride: When the document containing the answer is too long it gets split into part, strided by doc_stride
401403
:param max_query_length: Maximum length of the question (in number of subword tokens)
402404
:param proxies: proxy configuration to allow downloads of remote datasets.

haystack/modeling/infer.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -130,6 +130,7 @@ def load(
130130
multithreading_rust: bool = True,
131131
use_auth_token: Optional[Union[bool, str]] = None,
132132
devices: Optional[List[Union[str, torch.device]]] = None,
133+
max_query_length: int = 64,
133134
**kwargs,
134135
):
135136
"""
@@ -178,6 +179,7 @@ def load(
178179
`transformers-cli login` (stored in ~/.huggingface) will be used.
179180
Additional information can be found here
180181
https://huggingface.co/transformers/main_classes/model.html#transformers.PreTrainedModel.from_pretrained
182+
:param max_query_length: Only QA: Maximum length of the question in number of tokens.
181183
:return: An instance of the Inferencer.
182184
"""
183185
if tokenizer_args is None:
@@ -228,6 +230,7 @@ def load(
228230
tokenizer_args=tokenizer_args,
229231
use_fast=use_fast,
230232
use_auth_token=use_auth_token,
233+
max_query_length=max_query_length,
231234
**kwargs,
232235
)
233236

@@ -241,6 +244,8 @@ def load(
241244
"Please set a lower value for doc_stride (Suggestions: doc_stride=128, max_seq_len=384) "
242245
)
243246
processor.doc_stride = doc_stride
247+
if hasattr(processor, "max_query_length"):
248+
processor.max_query_length = max_query_length
244249

245250
return cls(
246251
model,

haystack/modeling/model/adaptive_model.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -280,7 +280,7 @@ def load( # type: ignore
280280
* vocab.txt vocab file for language model, turning text to Wordpiece Tokens
281281
282282
:param load_dir: Location where the AdaptiveModel is stored.
283-
:param device: To which device we want to sent the model, either torch.device("cpu") or torch.device("cuda").
283+
:param device: Specifies the device to which you want to send the model, either torch.device("cpu") or torch.device("cuda").
284284
:param strict: Whether to strictly enforce that the keys loaded from saved model match the ones in
285285
the PredictionHead (see torch.nn.module.load_state_dict()).
286286
:param processor: Processor to populate prediction head with information coming from tasks.

0 commit comments

Comments
 (0)