Skip to content

Commit 9074a2f

Browse files
authored
fix torch_dtype (#954)
1 parent 1133773 commit 9074a2f

File tree

4 files changed

+8
-8
lines changed

4 files changed

+8
-8
lines changed

docs/source/LLM/命令行参数.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -32,10 +32,10 @@
3232
- `--seed`: 全局的seed, 默认使用`42`. 用于复现训练效果.
3333
- `--resume_from_checkpoint`: 用于断点续训, 默认为`None`. 你可以将其设置为checkpoint的路径, 例如: `'output/qwen-7b-chat/vx-xxx/checkpoint-xxx'`, 来进行断点续训.
3434
- `--dtype`: 基模型载入时的torch_dtype, 默认为`'AUTO'`, 即智能选择dtype: 如果机器不支持bf16, 则使用fp16, 如果`MODEL_MAPPING`中对应模型有指定torch_dtype, 则使用其对应dtype, 否则使用bf16. 你可以选择的值包括: 'bf16', 'fp16', 'fp32'.
35-
- `--dataset`: 用于选择训练的数据集, 默认为`[]`. 可以选择的数据集可以查看[支持的数据集](支持的模型和数据集.md#数据集). 如果需要使用多个数据集进行训练, 你可以使用','或者' '进行分割, 例如: `--dataset alpaca-en,alpaca-zh` or `--dataset alpaca-en alpaca-zh`. 支持Modelscope Hub/HuggingFace Hub/本地路径、subsets选择与数据集采样, 每个数据集指定格式如下: `[HF or MS:]{dataset_name} or {dataset_id} or {dataset_path}[:subset1/subset2/...][#dataset_sample]`, 最简只需要指定dataset_name、dataset_id或者dataset_path即可. 自定义数据集可以查看[数据集的自定义与拓展文档](自定义与拓展.md#自定义数据集).
36-
- 支持MS和HF hub, 以及dataset_sample的支持. e.g. 'MS::alpaca-zh#200', 'HF::jd-sentiment-zh#200' (默认使用的hub, 由`USE_UF`环境变量控制, 默认MS).
35+
- `--dataset`: 用于选择训练的数据集, 默认为`[]`. 可以选择的数据集可以查看[支持的数据集](支持的模型和数据集.md#数据集). 如果需要使用多个数据集进行训练, 你可以使用','或者' '进行分割, 例如: `--dataset alpaca-en,alpaca-zh` or `--dataset alpaca-en alpaca-zh`. 支持Modelscope Hub/HuggingFace Hub/本地路径、subsets选择与数据集采样, 每个数据集指定格式如下: `[HF or MS::]{dataset_name} or {dataset_id} or {dataset_path}[:subset1/subset2/...][#dataset_sample]`, 最简只需要指定dataset_name、dataset_id或者dataset_path即可. 自定义数据集可以查看[数据集的自定义与拓展文档](自定义与拓展.md#自定义数据集).
36+
- 支持MS和HF hub, 以及dataset_sample的支持. e.g. 'MS::alpaca-zh#2000', 'HF::jd-sentiment-zh#2000' (默认使用的hub, 由`USE_UF`环境变量控制, 默认MS).
3737
- 对subsets更细粒度的控制: 默认使用注册时指定的subsets(注册时未指定则使用'default'). e.g. 'sharegpt-gpt4'. 如果指定subsets则使用对应子集的数据集. e.g. 'sharegpt-gpt4:default/V3_format#2000'. 使用'/'进行分隔.
38-
- dataset_id的支持. e.g. 'AI-ModelScope/alpaca-gpt4-data-zh#20', 'HF::llm-wizard/alpaca-gpt4-data-zh#20', hurner/alpaca-gpt4-data-zh#20, HF::shibing624/alpaca-zh#20. 如果dataset_id已经注册,则会使用注册时的预处理函数、subsets、split等. 否则使用`SmartPreprocessor`, 支持4种数据集格式, 并使用'default'的subsets, split设置为'train'. 支持的数据集格式可以查看[数据集的自定义与拓展文档](自定义与拓展.md#自定义数据集).
38+
- dataset_id的支持. e.g. 'AI-ModelScope/alpaca-gpt4-data-zh#2000', 'HF::llm-wizard/alpaca-gpt4-data-zh#2000', 'hurner/alpaca-gpt4-data-zh#2000', 'HF::shibing624/alpaca-zh#2000'. 如果dataset_id已经注册,则会使用注册时的预处理函数、subsets、split等. 否则使用`SmartPreprocessor`, 支持4种数据集格式, 并使用'default'的subsets, split设置为'train'. 支持的数据集格式可以查看[数据集的自定义与拓展文档](自定义与拓展.md#自定义数据集).
3939
- dataset_path的支持. e.g. '1.jsonl#5000'. (如果是相对路径,则为相对于运行目录的相对路径).
4040
- `--val_dataset`: 用于指定单独的验证集, 格式和`dataset`参数相同, 如果使用本参数, 则`dataset_test_ratio`不再生效.
4141
- `--dataset_seed`: 用于指定数据集处理的seed, 默认为`42`. 以random_state形式存在, 不影响全局seed.

docs/source_en/LLM/Command-line-parameters.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -30,10 +30,10 @@
3030
- `--seed`: Global seed, default is `42`. Used to reproduce training results.
3131
- `--resume_from_checkpoint`: For resuming training from checkpoint, default is `None`. You can set this to the path of a checkpoint, e.g. `'output/qwen-7b-chat/vx-xxx/checkpoint-xxx'`, to resume training.
3232
- `--dtype`: torch_dtype when loading base model, default is `'AUTO'`, i.e. intelligently select dtype: if machine does not support bf16, use fp16; if `MODEL_MAPPING` specifies torch_dtype for corresponding model, use its dtype; otherwise use bf16. Options include: 'bf16', 'fp16', 'fp32'.
33-
- `--dataset`: Used to select the training dataset, default is `[]`. You can see the list of available datasets [here](Supported-models-datasets.md#Datasets). If you need to train with multiple datasets, you can use ',' or ' ' to separate them, for example: `--dataset alpaca-en,alpaca-zh` or `--dataset alpaca-en alpaca-zh`. It supports Modelscope Hub/HuggingFace Hub/local paths, subset selection, and dataset sampling. The specified format for each dataset is as follows: `[HF or MS:]{dataset_name} or {dataset_id} or {dataset_path}[:subset1/subset2/...][#dataset_sample]`. The simplest case requires specifying only dataset_name, dataset_id, or dataset_path. Customizing datasets can be found in the [Customizing and Extending Datasets document](Customization.md#custom-dataset)
34-
- Supports MS and HF hub, as well as dataset_sample. For example, 'MS::alpaca-zh#200', 'HF::jd-sentiment-zh#200' (the default hub used is controlled by the `USE_UF` environment variable, default is MS).
33+
- `--dataset`: Used to select the training dataset, default is `[]`. You can see the list of available datasets [here](Supported-models-datasets.md#Datasets). If you need to train with multiple datasets, you can use ',' or ' ' to separate them, for example: `--dataset alpaca-en,alpaca-zh` or `--dataset alpaca-en alpaca-zh`. It supports Modelscope Hub/HuggingFace Hub/local paths, subset selection, and dataset sampling. The specified format for each dataset is as follows: `[HF or MS::]{dataset_name} or {dataset_id} or {dataset_path}[:subset1/subset2/...][#dataset_sample]`. The simplest case requires specifying only dataset_name, dataset_id, or dataset_path. Customizing datasets can be found in the [Customizing and Extending Datasets document](Customization.md#custom-dataset)
34+
- Supports MS and HF hub, as well as dataset_sample. For example, 'MS::alpaca-zh#2000', 'HF::jd-sentiment-zh#2000' (the default hub used is controlled by the `USE_UF` environment variable, default is MS).
3535
- More fine-grained control over subsets: It uses the subsets specified during registration by default (if not specified during registration, it uses 'default'). For example, 'sharegpt-gpt4'. If subsets are specified, it uses the corresponding subset of the dataset. For example, 'sharegpt-gpt4:default/V3_format#2000'. Separated by '/'.
36-
- Support for dataset_id. For example, 'AI-ModelScope/alpaca-gpt4-data-zh#20', 'HF::llm-wizard/alpaca-gpt4-data-zh#20', hurner/alpaca-gpt4-data-zh#20, HF::shibing624/alpaca-zh#20. If the dataset_id has been registered, it will use the preprocessing function, subsets, split, etc. specified during registration. Otherwise, it will use `SmartPreprocessor`, support 4 dataset formats, and use 'default' subsets, with split set to 'train'. The supported dataset formats can be found in the [Customizing and Extending Datasets document](Customization.md#custom-dataset).
36+
- Support for dataset_id. For example, 'AI-ModelScope/alpaca-gpt4-data-zh#2000', 'HF::llm-wizard/alpaca-gpt4-data-zh#2000', 'hurner/alpaca-gpt4-data-zh#2000', 'HF::shibing624/alpaca-zh#2000'. If the dataset_id has been registered, it will use the preprocessing function, subsets, split, etc. specified during registration. Otherwise, it will use `SmartPreprocessor`, support 4 dataset formats, and use 'default' subsets, with split set to 'train'. The supported dataset formats can be found in the [Customizing and Extending Datasets document](Customization.md#custom-dataset).
3737
- Support for dataset_path. For example, '1.jsonl#5000' (if it is a relative path, it is relative to the running directory).
3838
- `--val_dataset`: Specify separate validation datasets with the same format of the `dataset` argument. If using `val_dataset`, the `dataset_test_ratio` will be ignored.
3939
- `--dataset_seed`: Seed for dataset processing, default is `42`. Exists as random_state, does not affect global seed.

swift/llm/utils/model.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4141,7 +4141,7 @@ def get_torch_dtype(model_dir: str) -> Dtype:
41414141
torch_dtype = model_config.get('torch_dtype', None)
41424142
if isinstance(torch_dtype, str):
41434143
torch_dtype = eval(f'torch.{torch_dtype}')
4144-
if torch_dtype == torch.float32:
4144+
if torch_dtype in {torch.float32, None}:
41454145
torch_dtype = torch.float16
41464146
return torch_dtype
41474147

tests/llm/test_run.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -444,7 +444,7 @@ def test_trainer(self):
444444
dataset = MsDataset.load('clue', subset_name='tnews')
445445
num_labels = max(dataset['train']['label']) + 1
446446
model = Model.from_pretrained(model_dir, task='text-classification', num_labels=num_labels)
447-
train_dataset, val_dataset = dataset['train'].to_hf_dataset(), dataset['validation']
447+
train_dataset, val_dataset = dataset['train'].to_hf_dataset(), dataset['validation'].to_hf_dataset()
448448
train_dataset: HfDataset = train_dataset.select(range(100))
449449
val_dataset: HfDataset = val_dataset.select(range(20))
450450

0 commit comments

Comments
 (0)