Merge pull request #101 from DtYXs/pytorch2.0_adaption

yangapku · web-flow · commit 2c38d03557e5 · 2023-05-09T11:46:30.000+08:00
Adaption to Pytorch2.0
diff --git a/README.md b/README.md
@@ -16,6 +16,7 @@
 <br><br>
 
 # 新闻
+* 2023.5.9 Chinese-CLIP适配Pytorch2.0。
 * 2023.3.20 新增对比学习的[梯度累积](#gradient_accumulation)支持，可模拟更大batch size的训练效果
 * 2023.2.16 新增[FlashAttention](https://github.com/HazyResearch/flash-attention)支持，提升训练速度，降低显存占用，详见[flash_attention.md](flash_attention.md)
 * 2023.1.15 新增部署[ONNX](https://onnx.ai/)和[TensorRT](https://developer.nvidia.com/tensorrt)模型支持（并提供预训练TensorRT模型），提升特征推理速度，满足部署需求，详见[deployment.md](deployment.md)
diff --git a/README_En.md b/README_En.md
@@ -16,6 +16,7 @@ This is the Chinese version of CLIP. We use a large-scale Chinese image-text pai
 <br><br>
 
 # News
+* 2023.5.9 Chinese-CLIP has been adapted to Pytorch2.0.
 * 2023.3.20 Support [gradient accumulation](#gradient-accumulation) in contrastive learning to simulate the training effect of a larger batch size.
 * 2023.2.16 Support [FlashAttention](https://github.com/HazyResearch/flash-attention) to improve training speed and reduce memory usage. See [flash_attention_En.md](flash_attention_En.md) for more information.
 * 2023.1.15 Support the conversion of Pytorch models into [ONNX](https://onnx.ai/) or [TensorRT](https://developer.nvidia.com/tensorrt) formats (and provide pretrained TensorRT models) to improve inference speed and meet deployment requirements. See [deployment_En.md](deployment_En.md) for more information.
diff --git a/cn_clip/training/main.py b/cn_clip/training/main.py
@@ -48,7 +48,7 @@ def main():
     args = parse_args()
 
     # Set distributed group
-    args.local_device_rank = max(args.local_rank, 0)
+    args.local_device_rank = int(os.environ["LOCAL_RANK"])
     torch.cuda.set_device(args.local_device_rank)
     args.device = torch.device("cuda", args.local_device_rank)
 
@@ -108,7 +108,7 @@ def main():
 
     if args.grad_checkpointing:
         assert not torch_version_str_compare_lessequal(torch.__version__, "1.8.0"), \
-            "Currently our grad_checkpointing is not compatible with torch version <= 1.8.0."        
+            "Currently our grad_checkpointing is not compatible with torch version <= 1.8.0."
         model.set_grad_checkpointing()
         logging.info("Grad-checkpointing activated.")
 
@@ -133,6 +133,9 @@ def main():
     # In other cases, set find_unused_parameters to False
     find_unused_parameters = torch_version_str_compare_lessequal(torch.__version__, "1.8.0")
     model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_device_rank], find_unused_parameters=find_unused_parameters)
+    # Have to set this when activating grad checkpointing in Pytorch >= 2.0.0
+    if args.grad_checkpointing and not torch_version_str_compare_lessequal(torch.__version__, "1.14.0"):
+        model._set_static_graph()
 
     if args.precision == "fp16":
         convert_weights(model)
@@ -218,7 +221,7 @@ def main():
             model.load_state_dict(sd)
             # Restore the epoch and steps info, reload the dataset and dataloader for the resume epoch
             if not args.reset_data_offset:
-                start_epoch = checkpoint["epoch"] - 1
+                start_epoch = checkpoint["epoch"]
                 steps = checkpoint["step"]
                 data = get_data(args, 
                                 epoch_id=start_epoch, 
diff --git a/cn_clip/training/params.py b/cn_clip/training/params.py
@@ -187,12 +187,6 @@ def parse_args():
         help="enable full distributed gradient for feature gather"
     )
     # arguments for distributed training
-    parser.add_argument(
-        "--local_rank", 
-        type=int, 
-        default=-1, 
-        help="For distributed training: local_rank."
-    )
     parser.add_argument(
         "--skip-aggregate",
         default=False,
diff --git a/flash_attention.md b/flash_attention.md
@@ -6,9 +6,12 @@ Chinese-CLIP训练现已支持通过[FlashAttention](https://github.com/HazyRese
 
 ## 环境准备
 
-+ **Volta**或**Ampere**架构的Nvidia GPU显卡（如A100、RTX 3090、T4、RTX 2080），Nvidia各架构对应显卡型号可参见[此文档表格](https://en.wikipedia.org/wiki/CUDA#GPUs_supported)。
-+ CUDA 11，NVCC
-+ **FlashAttention**：通过执行`pip install flash-attn`安装FlashAttention，可参见[FlashAttention项目仓库](https://github.com/HazyResearch/flash-attention)。
++ **Turing**、**Ampere**、**Ada**、**Hopper**架构的Nvidia GPU显卡（如H100、A100、RTX 3090、T4、RTX 2080），Nvidia各架构对应显卡型号可参见[此文档表格](https://en.wikipedia.org/wiki/CUDA#GPUs_supported)。
++ CUDA 11.4及以上版本。
++ Pytorch 1.12及以上版本。
++ **FlashAttention**：通过执行`pip install flash-attn`安装FlashAttention。
+
+更多信息可参见[FlashAttention项目仓库](https://github.com/HazyResearch/flash-attention)。
 
 ## 在Chinese-CLIP中用起来！
 
@@ -17,7 +20,7 @@ Chinese-CLIP训练现已支持通过[FlashAttention](https://github.com/HazyRese
 
 ## 训练速度和显存占用对比
 
-启用FlashAttention可在不影响效果的条件下为Chinese-CLIP的finetune过程显著提速以及降低显存占用。我们的实验在一台8卡A100 GPU（80GB显存）机器进行。
+启用FlashAttention可在不影响效果的条件下为Chinese-CLIP的finetune过程显著提速以及降低显存占用。我们的实验在一台8卡A100 GPU（80GB显存）机器进行，FlashAttention 0.2.8，Pytorch 1.10.1。
 
 我们分别列出finetune过程中，相同batch size下启用FlashAttention前后每个规模模型的FP16精度finetune的batch time和显存占用对比，可以看到启用FlashAttention后，训练速度有所提升，也更加节约显存。对于更大规模模型的训练速度提升和显存占用降低更为显著。
 
@@ -31,7 +34,7 @@ Chinese-CLIP训练现已支持通过[FlashAttention](https://github.com/HazyRese
         <td width="120%">CN-CLIP<sub>RN50</sub></td><td>1200*8</td><td>1.710</td><td>1.680</td><td>1.02×</td>
     </tr>  
     <tr align="center">
-        <td width="120%">CN-CLIP<sub>ViT-B/16</sub></td><td>400*8</td><td>1.477</td><td>0.960</td><td>1.54×</td>
+        <td width="120%">CN-CLIP<sub>ViT-B/16</sub></td><td>450*8</td><td>1.477</td><td>0.960</td><td>1.54×</td>
     </tr>  
     <tr align="center">
         <td width="120%">CN-CLIP<sub>ViT-L/14</sub></td><td>128*8</td><td>1.293</td><td>0.785</td><td>1.65×</td>
@@ -55,7 +58,7 @@ Chinese-CLIP训练现已支持通过[FlashAttention](https://github.com/HazyRese
         <td width="120%">CN-CLIP<sub>RN50</sub></td><td>1200*8</td><td>79</td><td>75</td>
     </tr>  
     <tr align="center">
-        <td width="120%">CN-CLIP<sub>ViT-B/16</sub></td><td>400*8</td><td>80</td><td>56</td>
+        <td width="120%">CN-CLIP<sub>ViT-B/16</sub></td><td>450*8</td><td>80</td><td>56</td>
     </tr>  
     <tr align="center">
         <td width="120%">CN-CLIP<sub>ViT-L/14</sub></td><td>128*8</td><td>77</td><td>50</td>
diff --git a/flash_attention_En.md b/flash_attention_En.md
@@ -6,9 +6,12 @@ Chinese-CLIP now supports the acceleration of training process through [FlashAtt
 
 ## Environmental Preparation
 
-+ Nvidia GPUs **with Volta or Ampere architecture** (such as A100, RTX 3090, T4, and RTX 2080). Please refer to [this document](https://en.wikipedia.org/wiki/CUDA#GPUs_supported) for the corresponding GPUs of each Nvidia architecture.
-+ CUDA 11, NVCC
-+ **FlashAttention**：Install FlashAttention by executing `pip install flash-attn`. Please refer to the [FlashAttention project repository](https://github.com/HazyResearch/flash-attention).
++ Nvidia GPUs **with Turning, Ampere, Ada or Hopper architecture** (such as H100, A100, RTX 3090, T4, and RTX 2080). Please refer to [this document](https://en.wikipedia.org/wiki/CUDA#GPUs_supported) for the corresponding GPUs of each Nvidia architecture.
++ CUDA 11.4 and above.
++ PyTorch 1.12 and above.
++ **FlashAttention**：Install FlashAttention by executing `pip install flash-attn`.
+
+Please refer to the [FlashAttention project repository](https://github.com/HazyResearch/flash-attention) for more information.
 
 ## Use it in Chinese-CLIP!
 
@@ -17,7 +20,7 @@ Applying FlashAttention to the finetune process of Chinese-CLIP is very simple,
 
 ## Training Speed and Memory Usage Comparison
 
-Enabling FlashAttention can significantly speed up the finetune process and reduce the memory usage of Chinese-CLIP without affecting the precision. Our experiments are conducted on an 8-card A100 GPU (80GB memory) machine.
+Enabling FlashAttention can significantly speed up the finetune process and reduce the memory usage of Chinese-CLIP without affecting the precision. Our experiments are conducted on an 8-card A100 GPU (80GB memory) machine，FlashAttention 0.2.8，Pytorch 1.10.1.
 
 We present the comparison of the batch time and memory usage of FP16 precision finetune for each scale model. The improvement in training speed and reduction in memory usage are more significant for larger models.
 
@@ -31,7 +34,7 @@ We present the comparison of the batch time and memory usage of FP16 precision f
         <td width="120%">CN-CLIP<sub>RN50</sub></td><td>1200*8</td><td>1.710</td><td>1.680</td><td>1.02×</td>
     </tr>  
     <tr align="center">
-        <td width="120%">CN-CLIP<sub>ViT-B/16</sub></td><td>400*8</td><td>1.477</td><td>0.960</td><td>1.54×</td>
+        <td width="120%">CN-CLIP<sub>ViT-B/16</sub></td><td>450*8</td><td>1.477</td><td>0.960</td><td>1.54×</td>
     </tr>  
     <tr align="center">
         <td width="120%">CN-CLIP<sub>ViT-L/14</sub></td><td>128*8</td><td>1.293</td><td>0.785</td><td>1.65×</td>
@@ -55,7 +58,7 @@ We present the comparison of the batch time and memory usage of FP16 precision f
         <td width="120%">CN-CLIP<sub>RN50</sub></td><td>1200*8</td><td>79</td><td>75</td>
     </tr>  
     <tr align="center">
-        <td width="120%">CN-CLIP<sub>ViT-B/16</sub></td><td>400*8</td><td>80</td><td>56</td>
+        <td width="120%">CN-CLIP<sub>ViT-B/16</sub></td><td>450*8</td><td>80</td><td>56</td>
     </tr>  
     <tr align="center">
         <td width="120%">CN-CLIP<sub>ViT-L/14</sub></td><td>128*8</td><td>77</td><td>50</td>
diff --git a/run_scripts/coco-cn_finetune_vit-b-16_rbt-base.sh b/run_scripts/coco-cn_finetune_vit-b-16_rbt-base.sh
@@ -57,7 +57,7 @@ text_model=RoBERTa-wwm-ext-base-chinese
 use_augment="--use-augment"
 # use_augment=""
 
-python3 -m torch.distributed.launch --nproc_per_node=${GPUS_PER_NODE} --nnodes=${WORKER_CNT} --node_rank=${RANK} \
+python3 -m torch.distributed.launch --use_env --nproc_per_node=${GPUS_PER_NODE} --nnodes=${WORKER_CNT} --node_rank=${RANK} \
           --master_addr=${MASTER_ADDR} --master_port=${MASTER_PORT} cn_clip/training/main.py \
           --train-data=${train_data} \
           --val-data=${val_data} \
diff --git a/run_scripts/flickr30k_finetune_vit-b-16_rbt-base.sh b/run_scripts/flickr30k_finetune_vit-b-16_rbt-base.sh
@@ -57,7 +57,7 @@ text_model=RoBERTa-wwm-ext-base-chinese
 use_augment="--use-augment"
 # use_augment=""
 
-python3 -m torch.distributed.launch --nproc_per_node=${GPUS_PER_NODE} --nnodes=${WORKER_CNT} --node_rank=${RANK} \
+python3 -m torch.distributed.launch --use_env --nproc_per_node=${GPUS_PER_NODE} --nnodes=${WORKER_CNT} --node_rank=${RANK} \
           --master_addr=${MASTER_ADDR} --master_port=${MASTER_PORT} cn_clip/training/main.py \
           --train-data=${train_data} \
           --val-data=${val_data} \
diff --git a/run_scripts/flickr30k_finetune_vit-b-16_rbt-base_flip.sh b/run_scripts/flickr30k_finetune_vit-b-16_rbt-base_flip.sh
@@ -58,7 +58,7 @@ mask_ratio=0.5 # use flip: set mask ratio
 use_augment="--use-augment"
 # use_augment=""
 
-python3 -m torch.distributed.launch --nproc_per_node=${GPUS_PER_NODE} --nnodes=${WORKER_CNT} --node_rank=${RANK} \
+python3 -m torch.distributed.launch --use_env --nproc_per_node=${GPUS_PER_NODE} --nnodes=${WORKER_CNT} --node_rank=${RANK} \
           --master_addr=${MASTER_ADDR} --master_port=${MASTER_PORT} cn_clip/training/main.py \
           --train-data=${train_data} \
           --val-data=${val_data} \
diff --git a/run_scripts/muge_finetune_vit-b-16_rbt-base.sh b/run_scripts/muge_finetune_vit-b-16_rbt-base.sh
@@ -57,7 +57,7 @@ text_model=RoBERTa-wwm-ext-base-chinese
 use_augment="--use-augment"
 # use_augment=""
 
-python3 -m torch.distributed.launch --nproc_per_node=${GPUS_PER_NODE} --nnodes=${WORKER_CNT} --node_rank=${RANK} \
+python3 -m torch.distributed.launch --use_env --nproc_per_node=${GPUS_PER_NODE} --nnodes=${WORKER_CNT} --node_rank=${RANK} \
           --master_addr=${MASTER_ADDR} --master_port=${MASTER_PORT} cn_clip/training/main.py \
           --train-data=${train_data} \
           --val-data=${val_data} \
diff --git a/run_scripts/muge_finetune_vit-b-16_rbt-base_flashattn.sh b/run_scripts/muge_finetune_vit-b-16_rbt-base_flashattn.sh
@@ -57,7 +57,7 @@ text_model=RoBERTa-wwm-ext-base-chinese
 use_augment="--use-augment"
 # use_augment=""
 
-python3 -m torch.distributed.launch --nproc_per_node=${GPUS_PER_NODE} --nnodes=${WORKER_CNT} --node_rank=${RANK} \
+python3 -m torch.distributed.launch --use_env --nproc_per_node=${GPUS_PER_NODE} --nnodes=${WORKER_CNT} --node_rank=${RANK} \
           --master_addr=${MASTER_ADDR} --master_port=${MASTER_PORT} cn_clip/training/main.py \
           --train-data=${train_data} \
           --val-data=${val_data} \
diff --git a/run_scripts/muge_finetune_vit-b-16_rbt-base_flip.sh b/run_scripts/muge_finetune_vit-b-16_rbt-base_flip.sh
@@ -58,7 +58,7 @@ mask_ratio=0.5 # use flip: set mask ratio
 use_augment="--use-augment"
 # use_augment=""
 
-python3 -m torch.distributed.launch --nproc_per_node=${GPUS_PER_NODE} --nnodes=${WORKER_CNT} --node_rank=${RANK} \
+python3 -m torch.distributed.launch --use_env --nproc_per_node=${GPUS_PER_NODE} --nnodes=${WORKER_CNT} --node_rank=${RANK} \
           --master_addr=${MASTER_ADDR} --master_port=${MASTER_PORT} cn_clip/training/main.py \
           --train-data=${train_data} \
           --val-data=${val_data} \