Replies: 1 comment
-
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
设备:3090 24G*3
环境:
windows 10
python 3.11.7
torch 2.2.1
cuda 12.4
因为window无法使用nccl,因此在state.py 226行修改为self.backend = "gloo"
运行指令如下:
torchrun --standalone --nnodes=1 --nproc_per_node=3 finetune_demo/finetune_hf.py finetune_demo/data/ THUDM/chatglm3-6b finetune_demo/configs/lora.yaml finetune_demo/configs/ds_zero_2.json
报错如下:
(base) PS C:\Users\ZETTAKIT\Desktop\chatglm_lora> torchrun --standalone --nnodes=1 --nproc_per_node=3 finetune_demo/finetune_hf.py finetune_demo/data/ THUDM/chatglm3-6b finetune_demo/configs/lora.yaml finetune_demo/configs/ds_zero_2.json
[2024-03-26 19:08:34,133] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs.
[2024-03-26 19:08:34,237] torch.distributed.run: [WARNING]
[2024-03-26 19:08:34,237] torch.distributed.run: [WARNING] *****************************************
[2024-03-26 19:08:34,237] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded,
please further tune the variable for optimal performance in your application as needed.
[2024-03-26 19:08:34,237] torch.distributed.run: [WARNING] *****************************************
Setting eos_token is not supported, use the default one.
Setting pad_token is not supported, use the default one.
Setting unk_token is not supported, use the default one.
Setting eos_token is not supported, use the default one.
Setting pad_token is not supported, use the default one.
Setting unk_token is not supported, use the default one.
Setting eos_token is not supported, use the default one.
Setting pad_token is not supported, use the default one.
Setting unk_token is not supported, use the default one.
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:47<00:00, 6.73s/it]
trainable params: 1,949,696 || all params: 6,245,533,696 || trainable%: 0.031217444255383614
--> Model
--> model has 1.949696M params
Loading checkpoint shards: 57%|██████████████████████████████████████████████████████████████████▊ | 4/7 [00:37<00:28, 9.39s/it]train_dataset: Dataset({
features: ['input_ids', 'labels'],
num_rows: 65992
})
val_dataset: Dataset({
features: ['input_ids', 'output_ids'],
num_rows: 1223
})
test_dataset: Dataset({
features: ['input_ids', 'output_ids'],
num_rows: 1223
})
C:\Users\ZETTAKIT\AppData\Roaming\Python\Python311\site-packages\accelerate\accelerator.py:432: FutureWarning: Passing the following arguments to
Accelerator
is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches', 'split_batches', 'even_batches', 'use_seedable_sampler']). Please pass anaccelerate.DataLoaderConfiguration
instead:dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)
warnings.warn(
Loading checkpoint shards: 86%|████████████████████████████████████████████████████████████████████████████████████████████████████▎ | 6/7 [01:03<00:11, 11.86s/it]finetune_demo/configs/ds_zero_2.json The specified checkpoint sn(finetune_demo/configs/ds_zero_2.json) has not been saved. Please search for the correct chkeckpoint in the model output directory
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [01:12<00:00, 10.39s/it]
trainable params: 1,949,696 || all params: 6,245,533,696 || trainable%: 0.031217444255383614
--> Model
--> model has 1.949696M params
Loading checkpoint shards: 86%|████████████████████████████████████████████████████████████████████████████████████████████████████▎ | 6/7 [01:06<00:11, 11.81s/it]train_dataset: Dataset({
features: ['input_ids', 'labels'],
num_rows: 65992
})
val_dataset: Dataset({
features: ['input_ids', 'output_ids'],
num_rows: 1223
})
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [01:10<00:00, 10.13s/it]
Dataset({
features: ['input_ids', 'output_ids'],
num_rows: 1223
})
C:\Users\ZETTAKIT\AppData\Roaming\Python\Python311\site-packages\accelerate\accelerator.py:432: FutureWarning: Passing the following arguments to
Accelerator
is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches', 'split_batches', 'even_batches', 'use_seedable_sampler']). Please pass anaccelerate.DataLoaderConfiguration
instead:dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)
warnings.warn(
trainable params: 1,949,696 || all params: 6,245,533,696 || trainable%: 0.031217444255383614
--> Model
--> model has 1.949696M params
train_dataset: Dataset({
features: ['input_ids', 'labels'],
num_rows: 65992
})
val_dataset: Dataset({
features: ['input_ids', 'output_ids'],
num_rows: 1223
})
test_dataset: Dataset({
features: ['input_ids', 'output_ids'],
num_rows: 1223
})
C:\Users\ZETTAKIT\AppData\Roaming\Python\Python311\site-packages\accelerate\accelerator.py:432: FutureWarning: Passing the following arguments to
Accelerator
is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches', 'split_batches', 'even_batches', 'use_seedable_sampler']). Please pass anaccelerate.DataLoaderConfiguration
instead:dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)
warnings.warn(
finetune_demo/configs/ds_zero_2.json The specified checkpoint sn(finetune_demo/configs/ds_zero_2.json) has not been saved. Please search for the correct chkeckpoint in the model output directory
max_steps is given, it will override any value given in num_train_epochs
finetune_demo/configs/ds_zero_2.json The specified checkpoint sn(finetune_demo/configs/ds_zero_2.json) has not been saved. Please search for the correct chkeckpoint in the model output directory
***** Running Prediction *****
Num examples = 1223
Batch size = 16
[W socket.cpp:464] [c10d] The server socket has failed to bind to [win10-4]:50049 (system error: 10048 - 通常每个套接字地址(协议/网络地址/端口)只允许使用一次。).
[W socket.cpp:464] [c10d] The server socket has failed to bind to win10-4:50049 (system error: 10013 - 以一种访问权限不允许的方式做了一个访问套接字的尝试。).
[E socket.cpp:500] [c10d] The server socket has failed to listen on any local network address.
Traceback (most recent call last):
File "", line 1, in
File "C:\ProgramData\anaconda3\Lib\multiprocessing\spawn.py", line 122, in spawn_main
exitcode = _main(fd, parent_sentinel)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ProgramData\anaconda3\Lib\multiprocessing\spawn.py", line 131, in _main
prepare(preparation_data)
File "C:\ProgramData\anaconda3\Lib\multiprocessing\spawn.py", line 246, in prepare
_fixup_main_from_path(data['init_main_from_path'])
File "C:\ProgramData\anaconda3\Lib\multiprocessing\spawn.py", line 297, in _fixup_main_from_path
main_content = runpy.run_path(main_path,
^^^^^^^^^^^^^^^^^^^^^^^^^
File "", line 291, in run_path
File "", line 98, in _run_module_code
File "", line 88, in _run_code
File "C:\Users\ZETTAKIT\Desktop\chatglm_lora\finetune_demo\finetune_hf.py", line 148, in
class FinetuningConfig(object):
File "C:\Users\ZETTAKIT\Desktop\chatglm_lora\finetune_demo\finetune_hf.py", line 155, in FinetuningConfig
default_factory=Seq2SeqTrainingArguments(output_dir='./output')
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "", line 129, in init
File "C:\Users\ZETTAKIT\AppData\Roaming\Python\Python311\site-packages\transformers\training_args.py", line 1551, in post_init
and (self.device.type != "cuda")
^^^^^^^^^^^
File "C:\Users\ZETTAKIT\AppData\Roaming\Python\Python311\site-packages\transformers\training_args.py", line 2027, in device
return self._setup_devices
^^^^^^^^^^^^^^^^^^^
File "C:\Users\ZETTAKIT\AppData\Roaming\Python\Python311\site-packages\transformers\utils\generic.py", line 63, in get
cached = self.fget(obj)
^^^^^^^^^^^^^^
File "C:\Users\ZETTAKIT\AppData\Roaming\Python\Python311\site-packages\transformers\training_args.py", line 1963, in _setup_devices
self.distributed_state = PartialState(
^^^^^^^^^^^^^
File "C:\Users\ZETTAKIT\AppData\Roaming\Python\Python311\site-packages\accelerate\state.py", line 227, in init
torch.distributed.init_process_group(backend=self.backend, **kwargs)
File "C:\ProgramData\anaconda3\Lib\site-packages\torch\distributed\c10d_logger.py", line 86, in wrapper
func_return = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "C:\ProgramData\anaconda3\Lib\site-packages\torch\distributed\distributed_c10d.py", line 1177, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ProgramData\anaconda3\Lib\site-packages\torch\distributed\rendezvous.py", line 246, in _env_rendezvous_handler
store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout, use_libuv)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ProgramData\anaconda3\Lib\site-packages\torch\distributed\rendezvous.py", line 174, in _create_c10d_store
return TCPStore(
^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb3 in position 150: invalid start byte
感觉存在两处问题:
1.
[W socket.cpp:464] [c10d] The server socket has failed to bind to [win10-4]:50049 (system error: 10048 - 通常每个套接字地址(协议/网络地址/端口)只允许使用一次。).
[W socket.cpp:464] [c10d] The server socket has failed to bind to win10-4:50049 (system error: 10013 - 以一种访问权限不允许的方式做了一个访问套接字的尝试。).
[E socket.cpp:500] [c10d] The server socket has failed to listen on any local network address.
2.
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb3 in position 150: invalid start byte
请求帮助!
Beta Was this translation helpful? Give feedback.
All reactions