You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README_En.md
+3Lines changed: 3 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -16,6 +16,7 @@ This is the Chinese version of CLIP. We use a large-scale Chinese image-text pai
16
16
<br><br>
17
17
18
18
# News
19
+
* 2023.3.20 Support [gradient accumulation](#gradient-accumulation) in contrastive learning to simulate the training effect of a larger batch size.
19
20
* 2023.2.16 Support [FlashAttention](https://github.com/HazyResearch/flash-attention) to improve training speed and reduce memory usage. See [flash_attention_En.md](flash_attention_En.md) for more information.
20
21
* 2023.1.15 Support the conversion of Pytorch models into [ONNX](https://onnx.ai/) or [TensorRT](https://developer.nvidia.com/tensorrt) formats (and provide pretrained TensorRT models) to improve inference speed and meet deployment requirements. See [deployment_En.md](deployment_En.md) for more information.
21
22
* 2022.12.12 Implement [FLIP](https://arxiv.org/abs/2212.00794) strategy, which can be [activated](#FLIP) during finetuning (Thanks [@zwkkk](https://github.com/zwkkk) for [the PR](https://github.com/OFA-Sys/Chinese-CLIP/pull/26) ❤️)
@@ -348,6 +349,8 @@ The configuration for training includes:
348
349
+`grad-checkpointing`: <spanid="checkpointing"></span>use [gradient checkpointing]((https://pytorch.org/docs/stable/checkpoint.html)) which does not keep the activations during forward computation, this strategy trades more computation and iteration time for less GPU memory cost. (`store_true` argument, just add `--grad-checkpointing` in the script to activate it, requires Pytorch>1.8.0)
349
350
+`mask-ratio`: <spanid="FLIP"></span>use [FLIP](https://arxiv.org/abs/2212.00794) strategy which randomly masks a ratio of image patches to save GPU memory and speed up training. Default to 0.0, which disables the strategy.
350
351
+`use-flash-attention`: whether to use [FlashAttention](https://arxiv.org/abs/2205.14135), which can significantly speed up the finetune process and reduce the memory usage. (`store_true` argument, after configuring the environment, just add `--use-flash-attention` in the script to activate it, please see [flash_attention_En.md](flash_attention_En.md) for more information)
352
+
+`accum-freq`: <spanid="gradient-accumulation"></span>Gradient accumulation frequency, default is 1. Specify an integer greater than 1 to enable gradient accumulation to simulate a larger batch size. if the batch size for a worker is `m`, the total batch size is `accum_freq * m * GPUs`.
353
+
+`gather-with-grad`: Whether to enable full distributed gradient for feature gather, off by default.
351
354
+ Ouputs
352
355
+`name`: specified output path. Hyperparameter logs, training logs, and checkpoints will be saved at `${DATAPATH}/experiments/${name}/`.
353
356
+`save-step-frequency` and `save-epoch-frequency`: the intervals for saving checkpoints.
0 commit comments