Support hd_num (#1801)

Jintao-Huang · web-flow · commit e29cf5a87504 · 2024-08-23T14:14:59.000+08:00
diff --git a/docs/source/LLM/index.md b/docs/source/LLM/index.md
@@ -1,5 +1,7 @@
 ## LLM文档
 
+[English Documentation](https://swift.readthedocs.io/en/latest/)
+
 ### 📚教程
 
 1. [LLM推理文档](LLM推理文档.md)
diff --git a/docs/source/LLM/命令行参数.md b/docs/source/LLM/命令行参数.md
@@ -37,6 +37,7 @@
 - `--resume_from_checkpoint`: 用于断点续训, 默认为`None`. 你可以将其设置为checkpoint的路径, 例如: `--resume_from_checkpoint output/qwen-7b-chat/vx-xxx/checkpoint-xxx`, 来进行断点续训. 支持调节`--resume_only_model`在断点续训时只读取模型文件.
 - `--resume_only_model`: 默认为`False`, 即为严格的断点续训, 这会读取模型、优化器和lr_scheduler的权重和各个设备存储的随机种子, 并将从上次训练暂停的stpes后继续计数进行训练. 如果设置为`True`, 则只读取模型的权重.
 - `--dtype`: 基模型载入时的torch_dtype, 默认为`'AUTO'`, 即智能选择dtype: 如果机器不支持bf16, 则使用fp16, 如果`MODEL_MAPPING`中对应模型有指定torch_dtype, 则使用其对应dtype, 否则使用bf16. 你可以选择的值包括: 'bf16', 'fp16', 'fp32'.
+- `--model_kwargs`: 用于传入多模态模型中针对于模型的额外参数, 例如: `'{"hd_num": 16}'`. 你可以传入json字符串或者直接传入字典. 默认为`None`. 除了使用该参数，你也可以通过环境变量传入, 例如: `HD_NUM=16`.
 - `--dataset`: 用于选择训练的数据集, 默认为`[]`. 可以选择的数据集可以查看[支持的数据集](支持的模型和数据集.md#数据集). 如果需要使用多个数据集进行训练, 你可以使用','或者' '进行分割, 例如: `--dataset alpaca-en,alpaca-zh` or `--dataset alpaca-en alpaca-zh`. 支持Modelscope Hub/HuggingFace Hub/本地路径、subsets选择与数据集采样, 每个数据集指定格式如下: `[HF or MS::]{dataset_name} or {dataset_id} or {dataset_path}[:subset1/subset2/...][#dataset_sample]`, 最简只需要指定dataset_name、dataset_id或者dataset_path即可. 自定义数据集可以查看[数据集的自定义与拓展文档](自定义与拓展.md#自定义数据集).
    - 支持MS和HF hub, 以及dataset_sample的支持. e.g. 'MS::alpaca-zh#2000', 'HF::jd-sentiment-zh#2000' (默认使用的hub, 由`USE_UF`环境变量控制, 默认MS).
    - 对subsets更细粒度的控制: 默认使用注册时指定的subsets(注册时未指定则使用'default'). e.g. 'sharegpt-gpt4'. 如果指定subsets则使用对应子集的数据集. e.g. 'sharegpt-gpt4:default/V3_format#2000'. 这里使用`default`和`V3_format`子数据集, 使用'/'进行分隔, 并取2000条.
@@ -299,6 +300,7 @@ RLHF参数继承了sft参数, 除此之外增加了以下参数:
 - `--device_max_memory`: 每个设备device_map的最大可用显存, `List`, 默认为`[]`, 传递的值数量必须和可见显卡数量相等. 比如`10GB 10GB`.
 - `--seed`: 默认值为`42`, 具体的参数介绍可以在`sft命令行参数`中查看.
 - `--dtype`: 默认值为`'AUTO`, 具体的参数介绍可以在`sft命令行参数`中查看.
+- `--model_kwargs`: 默认值为`'None`, 具体的参数介绍可以在`sft命令行参数`中查看.
 - `--dataset`: 默认值为`[]`, 具体的参数介绍可以在`sft命令行参数`中查看.
 - `--val_dataset`: 默认为`[]`, 具体的参数介绍可以在`sft命令行参数`中查看.
 - `--dataset_seed`: 默认值为`None`, 具体的参数介绍可以在`sft命令行参数`中查看.
diff --git a/docs/source_en/LLM/Command-line-parameters.md b/docs/source_en/LLM/Command-line-parameters.md
@@ -36,6 +36,7 @@
 - `--resume_from_checkpoint`: Used for resuming training from a checkpoint, default is `None`. You can set it to the path of the checkpoint, for example: `--resume_from_checkpoint output/qwen-7b-chat/vx-xxx/checkpoint-xxx`, to resume training from that point. Supports adjusting `--resume_only_model` to only read the model file during checkpoint continuation.
 - `--resume_only_model`: Default is `False`, which means strict checkpoint continuation, this will read the weights of the model, optimizer, lr_scheduler, and the random seeds stored on each device, and continue training from the last paused steps. If set to `True`, it will only read the weights of the model.
 - `--dtype`: torch_dtype when loading base model, default is `'AUTO'`, i.e. intelligently select dtype: if machine does not support bf16, use fp16; if `MODEL_MAPPING` specifies torch_dtype for corresponding model, use its dtype; otherwise use bf16. Options include: 'bf16', 'fp16', 'fp32'.
+- `--model_kwargs`: Used for passing additional parameters to the multimodal model, for example: `'{"hd_num": 16}'`. You can either pass a JSON string or directly pass a dictionary. The default is `None`. In addition to using this parameter, you can also pass it through environment variables, for example: `HD_NUM=16`.
 - `--dataset`: Used to select the training dataset, default is `[]`. You can see the list of available datasets [here](Supported-models-datasets.md#Datasets). If you need to train with multiple datasets, you can use ',' or ' ' to separate them, for example: `--dataset alpaca-en,alpaca-zh` or `--dataset alpaca-en alpaca-zh`. It supports Modelscope Hub/HuggingFace Hub/local paths, subset selection, and dataset sampling. The specified format for each dataset is as follows: `[HF or MS::]{dataset_name} or {dataset_id} or {dataset_path}[:subset1/subset2/...][#dataset_sample]`. The simplest case requires specifying only dataset_name, dataset_id, or dataset_path. Customizing datasets can be found in the [Customizing and Extending Datasets document](Customization.md#custom-dataset)
   - Supports MS and HF hub, as well as dataset_sample. For example, 'MS::alpaca-zh#2000', 'HF::jd-sentiment-zh#2000' (the default hub used is controlled by the `USE_UF` environment variable, default is MS).
   - More fine-grained control over subsets: It uses the subsets specified during registration by default (if not specified during registration, it uses 'default'). For example, 'sharegpt-gpt4'. If subsets are specified, it uses the corresponding subset of the dataset. For example, 'sharegpt-gpt4:default/V3_format#2000'. Here, the `default` and `V3_format` sub-datasets are used, separated by '/', and 2000 entries are selected.
@@ -301,6 +302,7 @@ RLHF parameters are an extension of the sft parameters, with the addition of the
 - `--device_max_memory`: The max memory of each device can use for `device_map`, `List`, default is `[]`, The number of values must equal to the device count. Like `10GB 10GB`.
 - `--seed`: Default is `42`, see `sft command line arguments` for parameter details.
 - `--dtype`: Default is `'AUTO`, see `sft command line arguments` for parameter details.
+- `--model_kwargs`: Default is `None`, see `sft command line arguments` for parameter details.
 - `--dataset`: Default is `[]`, see `sft command line arguments` for parameter details.
 - `--val_dataset`: Default is `[]`, see `sft command line arguments` for parameter details.
 - `--dataset_seed`: Default is `None`, see `sft command line arguments` for parameter details.
diff --git a/docs/source_en/LLM/index.md b/docs/source_en/LLM/index.md
@@ -1,5 +1,7 @@
 ## LLM Documentation
 
+[中文文档](https://swift.readthedocs.io/zh-cn/latest/LLM/index.html)
+
 ### 📚Tutorials!
 
 1. [LLM Inference](LLM-inference.md)
diff --git a/swift/llm/megatron/utils.py b/swift/llm/megatron/utils.py
@@ -18,7 +18,6 @@ def init_megatron_env() -> None:
     if 'MEGATRON_LM_PATH' not in os.environ:
         megatron_path = git_clone_github(
             'https://github.com/NVIDIA/Megatron-LM', commit_hash='6dbe4cf699880038b1e5cd90b23ee71053c7f2ee')
-        os.environ['MEGATRON_LM_PATH'] = megatron_path
     else:
         megatron_path = os.environ['MEGATRON_LM_PATH']
     if not is_megatron_available():
@@ -28,10 +27,9 @@ def init_megatron_env() -> None:
     if 'PAI_MEGATRON_PATCH_PATH' not in os.environ:
         megatron_patch_path = git_clone_github(
             'https://github.com/alibaba/Pai-Megatron-Patch', commit_hash='6fd5d050b240fd959f0ba69f1e9cd9a053e5a81d')
-        os.environ['PAI_MEGATRON_PATCH_PATH'] = megatron_patch_path
     else:
         megatron_patch_path = os.environ['PAI_MEGATRON_PATCH_PATH']
-    sys.path.append(os.environ['PAI_MEGATRON_PATCH_PATH'])
+    sys.path.append(megatron_patch_path)
 
     # rename qwen1.5->qwen1_5 files
     qwen1_5_folders = ['toolkits/model_checkpoints_convertor/qwen']
diff --git a/swift/llm/utils/__init__.py b/swift/llm/utils/__init__.py
@@ -21,7 +21,7 @@
                        CompletionResponseStreamChoice, CompletionStreamResponse, DeltaMessage, Function, Model,
                        ModelList, UsageInfo, XRequestConfig, random_uuid)
 from .template import (DEFAULT_SYSTEM, TEMPLATE_MAPPING, History, Prompt, StopWords, Template, TemplateType,
-                       get_template, register_template)
+                       get_env_args, get_template, register_template)
 from .utils import (LazyLLMDataset, LLMDataset, dataset_map, download_dataset, find_all_linears, find_embedding,
                     find_ln, get_max_model_len, get_time_info, history_to_messages, inference, inference_stream,
                     is_lmdeploy_available, is_megatron_available, is_quant_model, is_vllm_available,
diff --git a/swift/llm/utils/argument.py b/swift/llm/utils/argument.py
@@ -46,6 +46,18 @@ def is_adapter(sft_type: str) -> bool:
 
 class ArgumentsBase:
 
+    def __post_init__(self) -> None:
+        if self.max_length == -1:
+            self.max_length = None
+        model_kwargs = self.model_kwargs
+        if model_kwargs is None:
+            model_kwargs = {}
+        if isinstance(model_kwargs, str):
+            model_kwargs = json.loads(model_kwargs)
+        for k, v in model_kwargs.items():
+            k = k.upper()
+            os.environ[k] = str(v)
+
     @classmethod
     def _check_path(cls,
                     value: Union[str, List[str]],
@@ -592,6 +604,9 @@ class SftArguments(ArgumentsBase):
     min_lr: Optional[float] = None
     sequence_parallel: bool = False
 
+    # multimodal
+    model_kwargs: Optional[str] = None
+
     # dataset_id or dataset_name or dataset_path or ...
     dataset: List[str] = field(
         default_factory=list, metadata={'help': f'dataset choices: {list(DATASET_MAPPING.keys())}'})
@@ -889,6 +904,7 @@ def _prepare_modules_to_save(self, modules_to_save) -> List[str]:
         return modules_to_save
 
     def __post_init__(self) -> None:
+        super().__post_init__()
         self.handle_compatibility()
         if len(self.val_dataset) > 0:
             self.dataset_test_ratio = 0.0
@@ -1040,8 +1056,6 @@ def __post_init__(self) -> None:
                 self.eval_batch_size = self.batch_size
         if self.save_total_limit == -1:
             self.save_total_limit = None
-        if self.max_length == -1:
-            self.max_length = None
 
         if self.deepspeed is not None:
             if is_mp():
@@ -1276,6 +1290,9 @@ class InferArguments(ArgumentsBase):
     seed: int = 42
     dtype: Literal['bf16', 'fp16', 'fp32', 'AUTO'] = 'AUTO'
 
+    # multimodal
+    model_kwargs: Optional[str] = None
+
     # dataset_id or dataset_name or dataset_path or ...
     dataset: List[str] = field(
         default_factory=list, metadata={'help': f'dataset choices: {list(DATASET_MAPPING.keys())}'})
@@ -1363,6 +1380,7 @@ class InferArguments(ArgumentsBase):
     vllm_lora_modules: List[str] = None
 
     def __post_init__(self) -> None:
+        super().__post_init__()
         if self.ckpt_dir is not None and not self.check_ckpt_dir_correct(self.ckpt_dir):
             logger.warning(f'The checkpoint dir {self.ckpt_dir} passed in is invalid, please make sure'
                            'the dir contains a `configuration.json` file.')
@@ -1419,8 +1437,6 @@ def __post_init__(self) -> None:
 
         self.bnb_4bit_compute_dtype, self.load_in_4bit, self.load_in_8bit = self.select_bnb()
 
-        if self.max_length == -1:
-            self.max_length = None
         if self.overwrite_generation_config is None:
             if self.ckpt_dir is None:
                 self.overwrite_generation_config = False
@@ -1518,9 +1534,6 @@ class DeployArguments(InferArguments):
     verbose: bool = True  # Whether to log request_info
     log_interval: int = 10  # Interval for printing global statistics
 
-    def __post_init__(self):
-        super().__post_init__()
-
 
 @dataclass
 class EvalArguments(InferArguments):
diff --git a/swift/llm/utils/model.py b/swift/llm/utils/model.py
@@ -13,6 +13,7 @@
 import torch.nn.functional as F
 import torch.utils.checkpoint
 import transformers
+from accelerate.utils import find_device
 from modelscope import (AutoConfig, AutoModel, AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig,
                         GenerationConfig, GPTQConfig, snapshot_download)
 from modelscope.hub.utils.utils import get_cache_dir
@@ -28,8 +29,8 @@
 from swift import get_logger
 from swift.utils import get_dist_setting, safe_ddp_context, subprocess_run, use_torchacc
 from swift.utils.module_mapping import get_regex_for_mm_default_lora
-from .template import TemplateType
-from .utils import get_max_model_len, get_rope_scaling, is_unsloth_available, set_rope_scaling
+from .template import TemplateType, get_env_args
+from .utils import get_max_model_len, get_rope_scaling, is_unsloth_available, set_rope_scaling, to_device
 
 logger = get_logger()
 
@@ -1293,7 +1294,7 @@ def get_model_tokenizer_phi3_vision(model_dir: str,
                                     **kwargs):
     processor_kwargs = {}
     if 'num_crops' in kwargs:
-        processor_kwargs['num_crops'] = kwargs['num_crops']
+        processor_kwargs['num_crops'] = get_env_args('num_crops', int, kwargs['num_crops'])
     from transformers import AutoProcessor
     processor = AutoProcessor.from_pretrained(model_dir, trust_remote_code=True, **processor_kwargs)
     model, tokenizer = get_model_tokenizer_with_flash_attn(model_dir, torch_dtype, model_kwargs, load_model, **kwargs)
@@ -4282,19 +4283,27 @@ def _use_submodel_func(model, submodel_name: str, func_list: List[str]) -> None:
     submodel = getattr(model, submodel_name)
 
     def _get_new_func(func_name: str):
-        _old_func = getattr(submodel, func_name)
+        _old_func = getattr(submodel.__class__, func_name)
 
         @wraps(_old_func)
-        def _new_func(*args, **kwargs):
-            return _old_func(*args, **kwargs)
+        def _new_func(self, *args, **kwargs):
+            res = _old_func(self, *args, **kwargs)
+            if func_name == 'forward':
+                device = find_device(args)
+                if device is None:
+                    device = find_device(kwargs)
+                res = res.__class__(**to_device(res, device))
+            return res
 
         return _new_func
 
     for key in func_list:
-        model_key = key
-        if key == 'forward' and hasattr(model, '_old_forward'):  # device_map
-            model_key = '_old_forward'
-        setattr(model, model_key, _get_new_func(key))
+        value = MethodType(_get_new_func(key), submodel)
+        setattr(model, key, value)
+        if key == 'generate' and model.device != submodel.device:
+            submodel.__class__.device = model.device
+        if key == 'forward' and 'generate' in func_list:
+            setattr(submodel, key, value)
 
 
 @register_model(
diff --git a/swift/llm/utils/template.py b/swift/llm/utils/template.py
@@ -1,11 +1,12 @@
 # Copyright (c) Alibaba, Inc. and its affiliates.
 import inspect
+import os
 import re
 from contextlib import contextmanager
 from copy import deepcopy
 from functools import partial, wraps
 from types import MethodType
-from typing import Any, Dict, List, Literal, Optional, Tuple, Union
+from typing import Any, Callable, Dict, List, Literal, Optional, Tuple, TypeVar, Union
 
 import json
 import torch
@@ -1539,6 +1540,28 @@ class Llama3Template(Llama3TemplateMixin, Template):
     Template(['<s>'], ['<|User|>:{{QUERY}}\n<|Bot|>:'], ['<eoa>\n'], ['<eoa>'], INTERNLM_SYSTEM,
              ['<s><|System|>:{{SYSTEM}}\n']))
 
+_T = TypeVar('_T')
+
+_log_set = set()  # log once
+
+
+def get_env_args(args_name: str,
+                 type_func: Callable[[str], _T] = int,
+                 default_value: Optional[_T] = None) -> Optional[_T]:
+    args_name_upper = args_name.upper()
+    value = os.getenv(args_name_upper)
+    if value is None:
+        value = default_value
+        log_info = (f'Setting {args_name}: {default_value}. '
+                    f'You can adjust this hyperparameter through the environment variable: `{args_name_upper}`.')
+    else:
+        value = type_func(value)
+        log_info = f'Using environment variable `{args_name_upper}`, Setting {args_name}: {value}.'
+    if log_info not in _log_set:
+        _log_set.add(log_info)
+        logger.info(log_info)
+    return value
+
 
 class Internlm2Template(ChatmlTemplate):
     system = INTERNLM_SYSTEM
@@ -1595,12 +1618,14 @@ def _encode(self, example: Dict[str, Any]) -> Tuple[Dict[str, Any], Dict[str, An
 
         if self.version == 'v2.5':
             hd_num = 24
-            Image_transform = get_class_from_dynamic_module('ixc_utils.Image_transform', self.tokenizer.model_dir)
             if len(images) > 1:
                 hd_num = 6
+            hd_num = get_env_args('hd_num', int, hd_num)
+            Image_transform = get_class_from_dynamic_module('ixc_utils.Image_transform', self.tokenizer.model_dir)
             images = [Image_transform(image, hd_num=hd_num) for image in images]
         elif self.version == 'v2-4khd':
             hd_num = 55
+            hd_num = get_env_args('hd_num', int, hd_num)
             HD_transform = get_class_from_dynamic_module('ixc_utils.HD_transform', self.tokenizer.model_dir)
             images = [HD_transform(image, hd_num=hd_num) for image in images]
         images = [self.model.vis_processor(image).to(dtype) for image in images]
@@ -1723,7 +1748,9 @@ def _encode(self, example: Dict[str, Any]) -> Tuple[Dict[str, Any], Dict[str, An
         images = example.get('images')
         if images:
             labels = inputs.get('labels')
-            pixel_values_images = [transform_image(image) for image in images]
+            input_size = get_env_args('input_size', int, 448)
+            max_num = get_env_args('max_num', int, 12)
+            pixel_values_images = [transform_image(image, input_size, max_num) for image in images]
             pixel_values = torch.cat(pixel_values_images, dim=0).to(self.model.dtype)
             image_bs = pixel_values.shape[0]
 
@@ -1784,7 +1811,8 @@ def replace_tag(self, media_type, index, example) -> List[Context]:
         if media_type == 'image':
             return image_context
         elif media_type == 'video':
-            load_video = partial(load_video_internvl, num_segments=self.video_segments)
+            video_segments = get_env_args('video_segments', int, self.video_segments)
+            load_video = partial(load_video_internvl, num_segments=video_segments)
             return _replace_video2image(load_video, example, lambda i: [f'Frame{i + 1}: '] + image_context)
 
     def replace_object(self, index: int, example: Dict[str, Any]) -> List[Context]:
@@ -1816,7 +1844,9 @@ def _encode(self, example: Dict[str, Any]) -> Tuple[Dict[str, Any], Dict[str, An
         images = example.get('images')
         if images:
             has_video = bool(example.get('videos'))
-            pixel_values = [transform_image(image, max_num=1 if has_video else 12) for image in images]
+            input_size = get_env_args('input_size', int, 448)
+            max_num = get_env_args('max_num', int, 1 if has_video else 12)
+            pixel_values = [transform_image(image, input_size, max_num) for image in images]
             num_patches = [pv.shape[0] for pv in pixel_values]
             pixel_values = torch.cat(pixel_values).to(self.model.dtype)
         else:
@@ -1924,7 +1954,9 @@ def _encode(self, example: Dict[str, Any]) -> Tuple[Dict[str, Any], Dict[str, An
         processor = self.tokenizer.processor
         images = example.get('images') or []
         assert len(images) == 1, 'Florence series models only supports input with a single image.'
-        image_tensors = transform_image(images[0])
+        input_size = get_env_args('input_size', int, 448)
+        max_num = get_env_args('max_num', int, 12)
+        image_tensors = transform_image(images[0], input_size, max_num)
         example['_image'] = image_tensors
 
         # process bbox
@@ -2789,6 +2821,7 @@ def _encode(self, example: Dict[str, Any]) -> Tuple[Dict[str, Any], Dict[str, An
             use_image_id = False
             max_slice_nums = 1  # or 2
 
+        max_slice_nums = get_env_args('max_slice_nums', int, max_slice_nums)
         input_ids = inputs['input_ids']
         labels = inputs['labels']
         idx_list = _findall(input_ids, -100)
diff --git a/swift/llm/utils/vision_utils.py b/swift/llm/utils/vision_utils.py