Merge pull request #73 from DtYXs/master

yangapku · web-flow · commit 0925e08e53b2 · 2023-03-21T19:43:59.000+08:00
Support gradient accumulation
diff --git a/README.md b/README.md
@@ -16,6 +16,7 @@
 <br><br>
 
 # 新闻
+* 2023.3.20 新增对比学习的[梯度累积](#gradient_accumulation)支持，可模拟更大batch size的训练效果
 * 2023.2.16 新增[FlashAttention](https://github.com/HazyResearch/flash-attention)支持，提升训练速度，降低显存占用，详见[flash_attention.md](flash_attention.md)
 * 2023.1.15 新增部署[ONNX](https://onnx.ai/)和[TensorRT](https://developer.nvidia.com/tensorrt)模型支持（并提供预训练TensorRT模型），提升特征推理速度，满足部署需求，详见[deployment.md](deployment.md)
 * 2022.12.12 新增实现[FLIP](https://arxiv.org/abs/2212.00794)训练策略，在finetune训练时可[激活使用](#FLIP)（感谢[@zwkkk](https://github.com/zwkkk)同学[贡献代码](https://github.com/OFA-Sys/Chinese-CLIP/pull/26)❤️）
@@ -345,8 +346,10 @@ bash run_scripts/muge_finetune_vit-b-16_rbt-base.sh ${DATAPATH}
   + `valid-batch-size`: 验证时单机batch-size。（请保证`验证集样本总数 > batch-size * GPU数`，至少满足1个验证batch）
   + `valid-step-interval`和`valid-epoch-interval`: 验证step/epoch频率，指定为-1时则在训练中不进行验证。
   + `grad-checkpointing`: <span id="checkpointing"></span>使用[重计算策略](https://pytorch.org/docs/stable/checkpoint.html)，在前向过程中不保存中间结果，以训练时间换取更小的显存开销，适用于显存不足的情况。（`store_true`参数，直接在脚本中加上`--grad-checkpointing`即可，目前要求Pytorch>1.8.0）
-  + `mask-ratio`: <span id="FLIP"></span>参照[FLIP](https://arxiv.org/abs/2212.00794)的策略，在finetune时可指定随机mask一定比例的图像patch，以降低显存开销、加快训练速度。默认为0.0，即不激活这一策略
+  + `mask-ratio`: <span id="FLIP"></span>参照[FLIP](https://arxiv.org/abs/2212.00794)的策略，在finetune时可指定随机mask一定比例的图像patch，以降低显存开销、加快训练速度。默认为0.0，即不激活这一策略。
   + `use-flash-attention`: 使用[FlashAttention](https://arxiv.org/abs/2205.14135)，可在不影响效果的条件下为Chinese-CLIP的finetune过程显著提速以及降低显存占用。（`store_true`参数，配置好环境后，在脚本中加上`--use-flash-attention`即可，请详见[flash_attention.md](flash_attention.md)）
+  + `accum-freq`: <span id="gradient_accumulation"></span>梯度累积频率，默认为1。指定为大于1的整数时开启对比学习梯度累积，模拟更大的batch size。如果单卡batch size为`m`，则总的batch size为`accum_freq * m * GPU数`。
+  + `gather-with-grad`: 是否在分布式训练时进行带有完整梯度的特征gather，默认关闭。
 + 输出选项
   + `name`: 指定输出路径。超参日志, 训练日志以及产出ckpt均会存放至 `${DATAPATH}/experiments/${name}/`。
   + `save-step-frequency`及`save-epoch-frequency`: 存ckpt的步数或轮数间隔。
diff --git a/README_En.md b/README_En.md
@@ -16,6 +16,7 @@ This is the Chinese version of CLIP. We use a large-scale Chinese image-text pai
 <br><br>
 
 # News
+* 2023.3.20 Support [gradient accumulation](#gradient-accumulation) in contrastive learning to simulate the training effect of a larger batch size.
 * 2023.2.16 Support [FlashAttention](https://github.com/HazyResearch/flash-attention) to improve training speed and reduce memory usage. See [flash_attention_En.md](flash_attention_En.md) for more information.
 * 2023.1.15 Support the conversion of Pytorch models into [ONNX](https://onnx.ai/) or [TensorRT](https://developer.nvidia.com/tensorrt) formats (and provide pretrained TensorRT models) to improve inference speed and meet deployment requirements. See [deployment_En.md](deployment_En.md) for more information.
 * 2022.12.12 Implement [FLIP](https://arxiv.org/abs/2212.00794) strategy, which can be [activated](#FLIP) during finetuning (Thanks [@zwkkk](https://github.com/zwkkk) for [the PR](https://github.com/OFA-Sys/Chinese-CLIP/pull/26) ❤️）
@@ -348,6 +349,8 @@ The configuration for training includes:
   + `grad-checkpointing`: <span id="checkpointing"></span>use [gradient checkpointing]((https://pytorch.org/docs/stable/checkpoint.html)) which does not keep the activations during forward computation, this strategy trades more computation and iteration time for less GPU memory cost. (`store_true` argument, just add `--grad-checkpointing` in the script to activate it, requires Pytorch>1.8.0)
   + `mask-ratio`: <span id="FLIP"></span>use [FLIP](https://arxiv.org/abs/2212.00794) strategy which randomly masks a ratio of image patches to save GPU memory and speed up training. Default to 0.0, which disables the strategy.
   + `use-flash-attention`: whether to use [FlashAttention](https://arxiv.org/abs/2205.14135), which can significantly speed up the finetune process and reduce the memory usage. (`store_true` argument, after configuring the environment, just add `--use-flash-attention` in the script to activate it, please see [flash_attention_En.md](flash_attention_En.md) for more information)
+  + `accum-freq`: <span id="gradient-accumulation"></span>Gradient accumulation frequency, default is 1. Specify an integer greater than 1 to enable gradient accumulation to simulate a larger batch size. if the batch size for a worker is `m`, the total batch size is `accum_freq * m * GPUs`.
+  + `gather-with-grad`: Whether to enable full distributed gradient for feature gather, off by default.
 + Ouputs
   + `name`: specified output path. Hyperparameter logs, training logs, and checkpoints will be saved at `${DATAPATH}/experiments/${name}/`.
   + `save-step-frequency` and `save-epoch-frequency`: the intervals for saving checkpoints.
diff --git a/cn_clip/eval/zeroshot_evaluation.py b/cn_clip/eval/zeroshot_evaluation.py
@@ -192,7 +192,7 @@ def run(model, classifier, dataloader, args):
             model_info[k] = v
 
     model = CLIP(**model_info)
-    convert_weights(model)    
+    convert_weights(model)
 
     # See https://discuss.pytorch.org/t/valueerror-attemting-to-unscale-fp16-gradients/81372
     if args.precision == "amp" or args.precision == "fp32":
diff --git a/cn_clip/training/main.py b/cn_clip/training/main.py
@@ -163,10 +163,10 @@ def main():
         )
         num_batches = data["train"].dataloader.num_batches
         if args.max_steps is not None:
-            args.max_epochs = ceil(args.max_steps / num_batches)
+            args.max_epochs = ceil(args.max_steps * args.accum_freq / num_batches)
         else:
             assert args.max_epochs is not None and args.max_epochs > 0
-            args.max_steps = num_batches * args.max_epochs
+            args.max_steps = (num_batches // args.accum_freq) * args.max_epochs
         total_steps = args.max_steps
         scheduler = cosine_lr(optimizer, args.lr, args.warmup, total_steps)
 
diff --git a/cn_clip/training/params.py b/cn_clip/training/params.py
@@ -174,6 +174,18 @@ def parse_args():
         action="store_true",
         help="Enable flash attention."
     )
+    parser.add_argument(
+        "--accum-freq",
+        type=int,
+        default=1,
+        help="Update the model every --acum-freq steps."
+    )
+    parser.add_argument(
+        "--gather-with-grad",
+        default=False,
+        action="store_true",
+        help="enable full distributed gradient for feature gather"
+    )
     # arguments for distributed training
     parser.add_argument(
         "--local_rank", 
diff --git a/cn_clip/training/train.py b/cn_clip/training/train.py
@@ -8,6 +8,7 @@
 import torch
 import torch.nn as nn
 from torch.cuda.amp import autocast
+import torch.distributed.nn
 import torch.distributed as dist
 
 from cn_clip.clip.model import convert_state_dict
@@ -16,33 +17,45 @@
 def is_master(args):
     return args.rank == 0
 
-def get_loss(model, images, texts, loss_img, loss_txt, args):
-    image_features, text_features, logit_scale = model(images, texts, args.mask_ratio)
+def get_loss(model, images, texts, loss_img, loss_txt, args, accum_image_features=None, accum_text_features=None, accum_idx=-1):
+    if args.accum_freq == 1:
+        image_features, text_features, logit_scale = model(images, texts, args.mask_ratio)
+    else:
+        assert accum_image_features and accum_text_features and accum_idx != -1
+        chunk_image_features, chunk_text_features, logit_scale = model(images, texts, args.mask_ratio)
+        image_features = torch.cat(
+            accum_image_features[:accum_idx] + [chunk_image_features] + accum_image_features[accum_idx + 1:])
+        text_features = torch.cat(
+            accum_text_features[:accum_idx] + [chunk_text_features] + accum_text_features[accum_idx + 1:])
     logit_scale = logit_scale.mean()
     if args.aggregate:
         world_size = dist.get_world_size()
         rank = dist.get_rank()
 
         # We gather tensors from all gpus to get more negatives to contrast with.
-        gathered_image_features = [
-            torch.zeros_like(image_features) for _ in range(world_size)
-        ]
-        gathered_text_features = [
-            torch.zeros_like(text_features) for _ in range(world_size)
-        ]
-        dist.all_gather(gathered_image_features, image_features)
-        dist.all_gather(gathered_text_features, text_features)
-
-        all_image_features = torch.cat(
-            [image_features]
-            + gathered_image_features[:rank]
-            + gathered_image_features[rank + 1 :]
-        )
-        all_text_features = torch.cat(
-            [text_features]
-            + gathered_text_features[:rank]
-            + gathered_text_features[rank + 1 :]
-        )
+        if args.gather_with_grad:
+            all_image_features = torch.cat(torch.distributed.nn.all_gather(image_features), dim=0)
+            all_text_features = torch.cat(torch.distributed.nn.all_gather(text_features), dim=0)
+        else:
+            gathered_image_features = [
+                torch.zeros_like(image_features) for _ in range(world_size)
+            ]
+            gathered_text_features = [
+                torch.zeros_like(text_features) for _ in range(world_size)
+            ]
+            dist.all_gather(gathered_image_features, image_features)
+            dist.all_gather(gathered_text_features, text_features)
+
+            all_image_features = torch.cat(
+                [image_features]
+                + gathered_image_features[:rank]
+                + gathered_image_features[rank + 1 :]
+            )
+            all_text_features = torch.cat(
+                [text_features]
+                + gathered_text_features[:rank]
+                + gathered_text_features[rank + 1 :]
+            )
 
         # this is needed to send gradients back everywhere.
         logits_per_image = logit_scale * all_image_features @ all_text_features.t()
@@ -94,17 +107,22 @@ def train(model, data, epoch, optimizer, scaler, scheduler, args, global_trained
     if sampler is not None:
         sampler.set_epoch(epoch)
 
-    num_batches_per_epoch = dataloader.num_batches
+    num_steps_per_epoch = dataloader.num_batches // args.accum_freq
     data_iter = iter(dataloader)
 
+    if args.accum_freq > 1:
+        accum_images, accum_texts, accum_image_features, accum_text_features = [], [], [], []
+
     end = time.time()
     epoch_trained_steps = 0
-    for i in range(global_trained_steps - num_batches_per_epoch * epoch, num_batches_per_epoch):
+    for i in range(0, dataloader.num_batches):
         batch = next(data_iter)
-        step = num_batches_per_epoch * epoch + i
+
+        i_accum = i // args.accum_freq
+        step = num_steps_per_epoch * epoch + i_accum
         # reach the args.max_steps, exit training:
         if step >= args.max_steps:
-            logging.info("Stopping training due to step {} has reached max_steps {}".format(step, args.max_steps))
+            logging.info("Stopping training due to step {} has reached max_steps {}".format(step, args.max_steps // args.accum_freq))
             return epoch_trained_steps
         scheduler(step)
 
@@ -120,18 +138,60 @@ def train(model, data, epoch, optimizer, scaler, scheduler, args, global_trained
 
         m = model.module
 
-        # with automatic mixed precision.
-        if args.precision == "amp":
-            with autocast():
+        if args.accum_freq == 1:
+            # with automatic mixed precision.
+            if args.precision == "amp":
+                with autocast():
+                    total_loss, acc = get_loss(model, images, texts, loss_img, loss_txt, args)
+                    scaler.scale(total_loss).backward()
+                    scaler.step(optimizer)
+                scaler.update()
+
+            else:
                 total_loss, acc = get_loss(model, images, texts, loss_img, loss_txt, args)
-                scaler.scale(total_loss).backward()
+                total_loss.backward()
+                optimizer.step()
+        else:
+            # First, cache the features without any gradient tracking.
+            with torch.no_grad():
+                with autocast(enabled=(args.precision == "amp")):
+                    chunk_image_features, chunk_text_features, _ = model(images, texts)
+                accum_image_features.append(chunk_image_features)
+                accum_text_features.append(chunk_text_features)
+
+                accum_images.append(images)
+                accum_texts.append(texts)
+
+            # If (i + 1) % accum_freq is not zero, move on to the next batch.
+            if ((i + 1) % args.accum_freq) > 0:
+                # FIXME this makes data time logging unreliable when accumulating
+                continue
+
+            # Now, ready to take gradients for the last accum_freq batches.
+            # Re-do the forward pass for those batches, and use the cached features from the other batches as negatives.
+            # Call backwards each time, but only step optimizer at the end.
+            optimizer.zero_grad()
+            for j in range(args.accum_freq):
+                images = accum_images[j]
+                texts = accum_texts[j]
+                with autocast(enabled=(args.precision == "amp")):
+                    # `total_loss` and `acc` are coarsely sampled, taking only the last result in the loop.
+                    # Although each result should be the same in theory, it will be slightly different in practice
+                    total_loss, acc = get_loss(model, images, texts, loss_img, loss_txt, args, accum_image_features, accum_text_features, j)
+                if args.precision == "amp":
+                    scaler.scale(total_loss).backward()
+                else:
+                    total_loss.backward()
+
+            if args.precision == "amp":
                 scaler.step(optimizer)
-            scaler.update()
+                scaler.update()
+            else:
+                optimizer.step()
 
-        else:
-            total_loss, acc = get_loss(model, images, texts, loss_img, loss_txt, args)
-            total_loss.backward()
-            optimizer.step()
+        # reset gradient accum, if enabled
+        if args.accum_freq > 1:
+            accum_images, accum_texts, accum_image_features, accum_text_features = [], [], [], []
 
         # Note: we clamp to 4.6052 = ln(100), as in the original paper.
         m.logit_scale.data = torch.clamp(m.logit_scale.data, 0, 4.6052)
@@ -142,10 +202,11 @@ def train(model, data, epoch, optimizer, scaler, scheduler, args, global_trained
         epoch_trained_steps += 1
 
         if is_master(args) and ((step + 1) % args.log_interval) == 0:
-            num_samples = (i + 1) * len(images) * args.world_size
+            batch_size = len(images) * args.accum_freq
+            num_samples = (i_accum + 1) * batch_size * args.world_size
             samples_per_epoch = dataloader.num_samples
-            percent_complete = 100.0 * (i + 1) / num_batches_per_epoch
-            
+            percent_complete = 100.0 * (i_accum + 1) / num_steps_per_epoch
+
             logging.info(
                 f"Global Steps: {step + 1}/{args.max_steps} | " +
                 f"Train Epoch: {epoch + 1} [{num_samples}/{samples_per_epoch} ({percent_complete:.0f}%)] | " +
@@ -156,7 +217,7 @@ def train(model, data, epoch, optimizer, scaler, scheduler, args, global_trained
                 f"Batch Time: {batch_time:.3f}s | " +
                 f"LR: {optimizer.param_groups[0]['lr']:5f} | " +
                 f"logit_scale: {m.logit_scale.data:.3f} | " +
-                f"Global Batch Size: {len(images) * args.world_size}"
+                f"Global Batch Size: {batch_size * args.world_size}"
             )
 
         if args.val_data is not None and args.valid_step_interval is not None and ((step + 1) % args.valid_step_interval) == 0:
diff --git a/run_scripts/coco-cn_finetune_vit-b-16_rbt-base.sh b/run_scripts/coco-cn_finetune_vit-b-16_rbt-base.sh
@@ -46,6 +46,7 @@ context_length=52
 warmup=6
 batch_size=1024
 valid_batch_size=128
+accum_freq=1
 lr=3e-5
 wd=0.001
 max_epochs=20
@@ -75,6 +76,7 @@ python3 -m torch.distributed.launch --nproc_per_node=${GPUS_PER_NODE} --nnodes=$
           --valid-batch-size=${valid_batch_size} \
           --valid-step-interval=${valid_step_interval} \
           --valid-epoch-interval=${valid_epoch_interval} \
+          --accum-freq=${accum_freq} \
           --lr=${lr} \
           --wd=${wd} \
           --max-epochs=${max_epochs} \
diff --git a/run_scripts/flickr30k_finetune_vit-b-16_rbt-base.sh b/run_scripts/flickr30k_finetune_vit-b-16_rbt-base.sh
@@ -46,6 +46,7 @@ context_length=52
 warmup=100
 batch_size=128
 valid_batch_size=128
+accum_freq=1
 lr=5e-5
 wd=0.001
 max_epochs=3  # or you can alternatively specify --max-steps
@@ -75,6 +76,7 @@ python3 -m torch.distributed.launch --nproc_per_node=${GPUS_PER_NODE} --nnodes=$
           --valid-batch-size=${valid_batch_size} \
           --valid-step-interval=${valid_step_interval} \
           --valid-epoch-interval=${valid_epoch_interval} \
+          --accum-freq=${accum_freq} \
           --lr=${lr} \
           --wd=${wd} \
           --max-epochs=${max_epochs} \
diff --git a/run_scripts/flickr30k_finetune_vit-b-16_rbt-base_flip.sh b/run_scripts/flickr30k_finetune_vit-b-16_rbt-base_flip.sh
@@ -46,6 +46,7 @@ context_length=52
 warmup=100
 batch_size=128
 valid_batch_size=128
+accum_freq=1
 lr=5e-5
 wd=0.001
 max_epochs=3 # or you can alternatively specify --max-steps
@@ -76,6 +77,7 @@ python3 -m torch.distributed.launch --nproc_per_node=${GPUS_PER_NODE} --nnodes=$
           --valid-batch-size=${valid_batch_size} \
           --valid-step-interval=${valid_step_interval} \
           --valid-epoch-interval=${valid_epoch_interval} \
+          --accum-freq=${accum_freq} \
           --lr=${lr} \
           --wd=${wd} \
           --max-epochs=${max_epochs} \
diff --git a/run_scripts/muge_finetune_vit-b-16_rbt-base.sh b/run_scripts/muge_finetune_vit-b-16_rbt-base.sh
@@ -46,6 +46,7 @@ context_length=52
 warmup=100
 batch_size=128
 valid_batch_size=128
+accum_freq=1
 lr=5e-5
 wd=0.001
 max_epochs=3 # or you can alternatively specify --max-steps
@@ -75,6 +76,7 @@ python3 -m torch.distributed.launch --nproc_per_node=${GPUS_PER_NODE} --nnodes=$
           --valid-batch-size=${valid_batch_size} \
           --valid-step-interval=${valid_step_interval} \
           --valid-epoch-interval=${valid_epoch_interval} \
+          --accum-freq=${accum_freq} \
           --lr=${lr} \
           --wd=${wd} \
           --max-epochs=${max_epochs} \
diff --git a/run_scripts/muge_finetune_vit-b-16_rbt-base_flashattn.sh b/run_scripts/muge_finetune_vit-b-16_rbt-base_flashattn.sh
@@ -46,6 +46,7 @@ context_length=52
 warmup=100
 batch_size=128
 valid_batch_size=128
+accum_freq=1
 lr=5e-5
 wd=0.001
 max_epochs=3 # or you can alternatively specify --max-steps
@@ -75,6 +76,7 @@ python3 -m torch.distributed.launch --nproc_per_node=${GPUS_PER_NODE} --nnodes=$
           --valid-batch-size=${valid_batch_size} \
           --valid-step-interval=${valid_step_interval} \
           --valid-epoch-interval=${valid_epoch_interval} \
+          --accum-freq=${accum_freq} \
           --lr=${lr} \
           --wd=${wd} \
           --max-epochs=${max_epochs} \
diff --git a/run_scripts/muge_finetune_vit-b-16_rbt-base_flip.sh b/run_scripts/muge_finetune_vit-b-16_rbt-base_flip.sh
@@ -46,6 +46,7 @@ context_length=52
 warmup=100
 batch_size=128
 valid_batch_size=128
+accum_freq=1
 lr=5e-5
 wd=0.001
 max_epochs=3 # or you can alternatively specify --max-steps
@@ -76,6 +77,7 @@ python3 -m torch.distributed.launch --nproc_per_node=${GPUS_PER_NODE} --nnodes=$
           --valid-batch-size=${valid_batch_size} \
           --valid-step-interval=${valid_step_interval} \
           --valid-epoch-interval=${valid_epoch_interval} \
+          --accum-freq=${accum_freq} \
           --lr=${lr} \
           --wd=${wd} \
           --max-epochs=${max_epochs} \