fix torch_dtype (#954)

Jintao-Huang · web-flow · commit 9074a2fd3665 · 2024-05-17T17:59:24.000+08:00
diff --git a/docs/source/LLM/命令行参数.md b/docs/source/LLM/命令行参数.md
@@ -32,10 +32,10 @@
 - `--seed`: 全局的seed, 默认使用`42`. 用于复现训练效果.
 - `--resume_from_checkpoint`: 用于断点续训, 默认为`None`. 你可以将其设置为checkpoint的路径, 例如: `'output/qwen-7b-chat/vx-xxx/checkpoint-xxx'`, 来进行断点续训.
 - `--dtype`: 基模型载入时的torch_dtype, 默认为`'AUTO'`, 即智能选择dtype: 如果机器不支持bf16, 则使用fp16, 如果`MODEL_MAPPING`中对应模型有指定torch_dtype, 则使用其对应dtype, 否则使用bf16. 你可以选择的值包括: 'bf16', 'fp16', 'fp32'.
-- `--dataset`: 用于选择训练的数据集, 默认为`[]`. 可以选择的数据集可以查看[支持的数据集](支持的模型和数据集.md#数据集). 如果需要使用多个数据集进行训练, 你可以使用','或者' '进行分割, 例如: `--dataset alpaca-en,alpaca-zh` or `--dataset alpaca-en alpaca-zh`. 支持Modelscope Hub/HuggingFace Hub/本地路径、subsets选择与数据集采样, 每个数据集指定格式如下: `[HF or MS:]{dataset_name} or {dataset_id} or {dataset_path}[:subset1/subset2/...][#dataset_sample]`, 最简只需要指定dataset_name、dataset_id或者dataset_path即可. 自定义数据集可以查看[数据集的自定义与拓展文档](自定义与拓展.md#自定义数据集).
-   - 支持MS和HF hub, 以及dataset_sample的支持. e.g. 'MS::alpaca-zh#200', 'HF::jd-sentiment-zh#200' (默认使用的hub, 由`USE_UF`环境变量控制, 默认MS).
+- `--dataset`: 用于选择训练的数据集, 默认为`[]`. 可以选择的数据集可以查看[支持的数据集](支持的模型和数据集.md#数据集). 如果需要使用多个数据集进行训练, 你可以使用','或者' '进行分割, 例如: `--dataset alpaca-en,alpaca-zh` or `--dataset alpaca-en alpaca-zh`. 支持Modelscope Hub/HuggingFace Hub/本地路径、subsets选择与数据集采样, 每个数据集指定格式如下: `[HF or MS::]{dataset_name} or {dataset_id} or {dataset_path}[:subset1/subset2/...][#dataset_sample]`, 最简只需要指定dataset_name、dataset_id或者dataset_path即可. 自定义数据集可以查看[数据集的自定义与拓展文档](自定义与拓展.md#自定义数据集).
+   - 支持MS和HF hub, 以及dataset_sample的支持. e.g. 'MS::alpaca-zh#2000', 'HF::jd-sentiment-zh#2000' (默认使用的hub, 由`USE_UF`环境变量控制, 默认MS).
    - 对subsets更细粒度的控制: 默认使用注册时指定的subsets(注册时未指定则使用'default'). e.g. 'sharegpt-gpt4'. 如果指定subsets则使用对应子集的数据集. e.g. 'sharegpt-gpt4:default/V3_format#2000'. 使用'/'进行分隔.
-   - dataset_id的支持. e.g. 'AI-ModelScope/alpaca-gpt4-data-zh#20', 'HF::llm-wizard/alpaca-gpt4-data-zh#20', hurner/alpaca-gpt4-data-zh#20, HF::shibing624/alpaca-zh#20. 如果dataset_id已经注册，则会使用注册时的预处理函数、subsets、split等. 否则使用`SmartPreprocessor`, 支持4种数据集格式, 并使用'default'的subsets, split设置为'train'. 支持的数据集格式可以查看[数据集的自定义与拓展文档](自定义与拓展.md#自定义数据集).
+   - dataset_id的支持. e.g. 'AI-ModelScope/alpaca-gpt4-data-zh#2000', 'HF::llm-wizard/alpaca-gpt4-data-zh#2000', 'hurner/alpaca-gpt4-data-zh#2000', 'HF::shibing624/alpaca-zh#2000'. 如果dataset_id已经注册，则会使用注册时的预处理函数、subsets、split等. 否则使用`SmartPreprocessor`, 支持4种数据集格式, 并使用'default'的subsets, split设置为'train'. 支持的数据集格式可以查看[数据集的自定义与拓展文档](自定义与拓展.md#自定义数据集).
    - dataset_path的支持. e.g. '1.jsonl#5000'. (如果是相对路径，则为相对于运行目录的相对路径).
 - `--val_dataset`: 用于指定单独的验证集, 格式和`dataset`参数相同, 如果使用本参数, 则`dataset_test_ratio`不再生效.
 - `--dataset_seed`: 用于指定数据集处理的seed, 默认为`42`. 以random_state形式存在, 不影响全局seed.
diff --git a/docs/source_en/LLM/Command-line-parameters.md b/docs/source_en/LLM/Command-line-parameters.md
@@ -30,10 +30,10 @@
 - `--seed`: Global seed, default is `42`. Used to reproduce training results.
 - `--resume_from_checkpoint`: For resuming training from checkpoint, default is `None`. You can set this to the path of a checkpoint, e.g. `'output/qwen-7b-chat/vx-xxx/checkpoint-xxx'`, to resume training.
 - `--dtype`: torch_dtype when loading base model, default is `'AUTO'`, i.e. intelligently select dtype: if machine does not support bf16, use fp16; if `MODEL_MAPPING` specifies torch_dtype for corresponding model, use its dtype; otherwise use bf16. Options include: 'bf16', 'fp16', 'fp32'.
-- `--dataset`: Used to select the training dataset, default is `[]`. You can see the list of available datasets [here](Supported-models-datasets.md#Datasets). If you need to train with multiple datasets, you can use ',' or ' ' to separate them, for example: `--dataset alpaca-en,alpaca-zh` or `--dataset alpaca-en alpaca-zh`. It supports Modelscope Hub/HuggingFace Hub/local paths, subset selection, and dataset sampling. The specified format for each dataset is as follows: `[HF or MS:]{dataset_name} or {dataset_id} or {dataset_path}[:subset1/subset2/...][#dataset_sample]`. The simplest case requires specifying only dataset_name, dataset_id, or dataset_path. Customizing datasets can be found in the [Customizing and Extending Datasets document](Customization.md#custom-dataset)
-  - Supports MS and HF hub, as well as dataset_sample. For example, 'MS::alpaca-zh#200', 'HF::jd-sentiment-zh#200' (the default hub used is controlled by the `USE_UF` environment variable, default is MS).
+- `--dataset`: Used to select the training dataset, default is `[]`. You can see the list of available datasets [here](Supported-models-datasets.md#Datasets). If you need to train with multiple datasets, you can use ',' or ' ' to separate them, for example: `--dataset alpaca-en,alpaca-zh` or `--dataset alpaca-en alpaca-zh`. It supports Modelscope Hub/HuggingFace Hub/local paths, subset selection, and dataset sampling. The specified format for each dataset is as follows: `[HF or MS::]{dataset_name} or {dataset_id} or {dataset_path}[:subset1/subset2/...][#dataset_sample]`. The simplest case requires specifying only dataset_name, dataset_id, or dataset_path. Customizing datasets can be found in the [Customizing and Extending Datasets document](Customization.md#custom-dataset)
+  - Supports MS and HF hub, as well as dataset_sample. For example, 'MS::alpaca-zh#2000', 'HF::jd-sentiment-zh#2000' (the default hub used is controlled by the `USE_UF` environment variable, default is MS).
   - More fine-grained control over subsets: It uses the subsets specified during registration by default (if not specified during registration, it uses 'default'). For example, 'sharegpt-gpt4'. If subsets are specified, it uses the corresponding subset of the dataset. For example, 'sharegpt-gpt4:default/V3_format#2000'. Separated by '/'.
-  - Support for dataset_id. For example, 'AI-ModelScope/alpaca-gpt4-data-zh#20', 'HF::llm-wizard/alpaca-gpt4-data-zh#20', hurner/alpaca-gpt4-data-zh#20, HF::shibing624/alpaca-zh#20. If the dataset_id has been registered, it will use the preprocessing function, subsets, split, etc. specified during registration. Otherwise, it will use `SmartPreprocessor`, support 4 dataset formats, and use 'default' subsets, with split set to 'train'. The supported dataset formats can be found in the [Customizing and Extending Datasets document](Customization.md#custom-dataset).
+  - Support for dataset_id. For example, 'AI-ModelScope/alpaca-gpt4-data-zh#2000', 'HF::llm-wizard/alpaca-gpt4-data-zh#2000', 'hurner/alpaca-gpt4-data-zh#2000', 'HF::shibing624/alpaca-zh#2000'. If the dataset_id has been registered, it will use the preprocessing function, subsets, split, etc. specified during registration. Otherwise, it will use `SmartPreprocessor`, support 4 dataset formats, and use 'default' subsets, with split set to 'train'. The supported dataset formats can be found in the [Customizing and Extending Datasets document](Customization.md#custom-dataset).
   - Support for dataset_path. For example, '1.jsonl#5000' (if it is a relative path, it is relative to the running directory).
 - `--val_dataset`: Specify separate validation datasets with the same format of the `dataset` argument. If using `val_dataset`, the `dataset_test_ratio` will be ignored.
 - `--dataset_seed`: Seed for dataset processing, default is `42`. Exists as random_state, does not affect global seed.
diff --git a/swift/llm/utils/model.py b/swift/llm/utils/model.py
@@ -4141,7 +4141,7 @@ def get_torch_dtype(model_dir: str) -> Dtype:
     torch_dtype = model_config.get('torch_dtype', None)
     if isinstance(torch_dtype, str):
         torch_dtype = eval(f'torch.{torch_dtype}')
-    if torch_dtype == torch.float32:
+    if torch_dtype in {torch.float32, None}:
         torch_dtype = torch.float16
     return torch_dtype
 
diff --git a/tests/llm/test_run.py b/tests/llm/test_run.py
@@ -444,7 +444,7 @@ def test_trainer(self):
         dataset = MsDataset.load('clue', subset_name='tnews')
         num_labels = max(dataset['train']['label']) + 1
         model = Model.from_pretrained(model_dir, task='text-classification', num_labels=num_labels)
-        train_dataset, val_dataset = dataset['train'].to_hf_dataset(), dataset['validation']
+        train_dataset, val_dataset = dataset['train'].to_hf_dataset(), dataset['validation'].to_hf_dataset()
         train_dataset: HfDataset = train_dataset.select(range(100))
         val_dataset: HfDataset = val_dataset.select(range(20))