chatglm3-6b lora微调时端口占用错误和编码错误 #1038

awelldone · 2024-03-26T11:26:29Z

awelldone
Mar 26, 2024

设备：3090 24G*3

环境：
windows 10
python 3.11.7
torch 2.2.1
cuda 12.4
因为window无法使用nccl，因此在state.py 226行修改为self.backend = "gloo"

运行指令如下：
torchrun --standalone --nnodes=1 --nproc_per_node=3 finetune_demo/finetune_hf.py finetune_demo/data/ THUDM/chatglm3-6b finetune_demo/configs/lora.yaml finetune_demo/configs/ds_zero_2.json

报错如下：
(base) PS C:\Users\ZETTAKIT\Desktop\chatglm_lora> torchrun --standalone --nnodes=1 --nproc_per_node=3 finetune_demo/finetune_hf.py finetune_demo/data/ THUDM/chatglm3-6b finetune_demo/configs/lora.yaml finetune_demo/configs/ds_zero_2.json
[2024-03-26 19:08:34,133] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs.
[2024-03-26 19:08:34,237] torch.distributed.run: [WARNING]
[2024-03-26 19:08:34,237] torch.distributed.run: [WARNING] *****************************************
[2024-03-26 19:08:34,237] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded,
please further tune the variable for optimal performance in your application as needed.
[2024-03-26 19:08:34,237] torch.distributed.run: [WARNING] *****************************************
Setting eos_token is not supported, use the default one.
Setting pad_token is not supported, use the default one.
Setting unk_token is not supported, use the default one.
Setting eos_token is not supported, use the default one.
Setting pad_token is not supported, use the default one.
Setting unk_token is not supported, use the default one.
Setting eos_token is not supported, use the default one.
Setting pad_token is not supported, use the default one.
Setting unk_token is not supported, use the default one.
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:47<00:00, 6.73s/it]
trainable params: 1,949,696 || all params: 6,245,533,696 || trainable%: 0.031217444255383614
--> Model

--> model has 1.949696M params

Loading checkpoint shards: 57%|██████████████████████████████████████████████████████████████████▊ | 4/7 [00:37<00:28, 9.39s/it]train_dataset: Dataset({
features: ['input_ids', 'labels'],
num_rows: 65992
})
val_dataset: Dataset({
features: ['input_ids', 'output_ids'],
num_rows: 1223
})
test_dataset: Dataset({
features: ['input_ids', 'output_ids'],
num_rows: 1223
})
C:\Users\ZETTAKIT\AppData\Roaming\Python\Python311\site-packages\accelerate\accelerator.py:432: FutureWarning: Passing the following arguments to Accelerator is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches', 'split_batches', 'even_batches', 'use_seedable_sampler']). Please pass an accelerate.DataLoaderConfiguration instead:
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)
warnings.warn(
Loading checkpoint shards: 86%|████████████████████████████████████████████████████████████████████████████████████████████████████▎ | 6/7 [01:03<00:11, 11.86s/it]finetune_demo/configs/ds_zero_2.json The specified checkpoint sn(finetune_demo/configs/ds_zero_2.json) has not been saved. Please search for the correct chkeckpoint in the model output directory
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [01:12<00:00, 10.39s/it]
trainable params: 1,949,696 || all params: 6,245,533,696 || trainable%: 0.031217444255383614
--> Model

--> model has 1.949696M params

Loading checkpoint shards: 86%|████████████████████████████████████████████████████████████████████████████████████████████████████▎ | 6/7 [01:06<00:11, 11.81s/it]train_dataset: Dataset({
features: ['input_ids', 'labels'],
num_rows: 65992
})
val_dataset: Dataset({
features: ['input_ids', 'output_ids'],
num_rows: 1223
})
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [01:10<00:00, 10.13s/it]
Dataset({
features: ['input_ids', 'output_ids'],
num_rows: 1223
})
C:\Users\ZETTAKIT\AppData\Roaming\Python\Python311\site-packages\accelerate\accelerator.py:432: FutureWarning: Passing the following arguments to Accelerator is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches', 'split_batches', 'even_batches', 'use_seedable_sampler']). Please pass an accelerate.DataLoaderConfiguration instead:
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)
warnings.warn(
trainable params: 1,949,696 || all params: 6,245,533,696 || trainable%: 0.031217444255383614
--> Model

--> model has 1.949696M params

train_dataset: Dataset({
features: ['input_ids', 'labels'],
num_rows: 65992
})
val_dataset: Dataset({
features: ['input_ids', 'output_ids'],
num_rows: 1223
})
test_dataset: Dataset({
features: ['input_ids', 'output_ids'],
num_rows: 1223
})
C:\Users\ZETTAKIT\AppData\Roaming\Python\Python311\site-packages\accelerate\accelerator.py:432: FutureWarning: Passing the following arguments to Accelerator is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches', 'split_batches', 'even_batches', 'use_seedable_sampler']). Please pass an accelerate.DataLoaderConfiguration instead:
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)
warnings.warn(
finetune_demo/configs/ds_zero_2.json The specified checkpoint sn(finetune_demo/configs/ds_zero_2.json) has not been saved. Please search for the correct chkeckpoint in the model output directory
max_steps is given, it will override any value given in num_train_epochs
finetune_demo/configs/ds_zero_2.json The specified checkpoint sn(finetune_demo/configs/ds_zero_2.json) has not been saved. Please search for the correct chkeckpoint in the model output directory
***** Running Prediction *****
Num examples = 1223
Batch size = 16
[W socket.cpp:464] [c10d] The server socket has failed to bind to [win10-4]:50049 (system error: 10048 - 通常每个套接字地址(协议/网络地址/端口)只允许使用一次。).
[W socket.cpp:464] [c10d] The server socket has failed to bind to win10-4:50049 (system error: 10013 - 以一种访问权限不允许的方式做了一个访问套接字的尝试。).
[E socket.cpp:500] [c10d] The server socket has failed to listen on any local network address.
Traceback (most recent call last):
File "", line 1, in
File "C:\ProgramData\anaconda3\Lib\multiprocessing\spawn.py", line 122, in spawn_main
exitcode = _main(fd, parent_sentinel)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ProgramData\anaconda3\Lib\multiprocessing\spawn.py", line 131, in _main
prepare(preparation_data)
File "C:\ProgramData\anaconda3\Lib\multiprocessing\spawn.py", line 246, in prepare
_fixup_main_from_path(data['init_main_from_path'])
File "C:\ProgramData\anaconda3\Lib\multiprocessing\spawn.py", line 297, in _fixup_main_from_path
main_content = runpy.run_path(main_path,
^^^^^^^^^^^^^^^^^^^^^^^^^
File "", line 291, in run_path
File "", line 98, in _run_module_code
File "", line 88, in _run_code
File "C:\Users\ZETTAKIT\Desktop\chatglm_lora\finetune_demo\finetune_hf.py", line 148, in
class FinetuningConfig(object):
File "C:\Users\ZETTAKIT\Desktop\chatglm_lora\finetune_demo\finetune_hf.py", line 155, in FinetuningConfig
default_factory=Seq2SeqTrainingArguments(output_dir='./output')
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "", line 129, in init
File "C:\Users\ZETTAKIT\AppData\Roaming\Python\Python311\site-packages\transformers\training_args.py", line 1551, in post_init
and (self.device.type != "cuda")
^^^^^^^^^^^
File "C:\Users\ZETTAKIT\AppData\Roaming\Python\Python311\site-packages\transformers\training_args.py", line 2027, in device
return self._setup_devices
^^^^^^^^^^^^^^^^^^^
File "C:\Users\ZETTAKIT\AppData\Roaming\Python\Python311\site-packages\transformers\utils\generic.py", line 63, in get
cached = self.fget(obj)
^^^^^^^^^^^^^^
File "C:\Users\ZETTAKIT\AppData\Roaming\Python\Python311\site-packages\transformers\training_args.py", line 1963, in _setup_devices
self.distributed_state = PartialState(
^^^^^^^^^^^^^
File "C:\Users\ZETTAKIT\AppData\Roaming\Python\Python311\site-packages\accelerate\state.py", line 227, in init
torch.distributed.init_process_group(backend=self.backend, **kwargs)
File "C:\ProgramData\anaconda3\Lib\site-packages\torch\distributed\c10d_logger.py", line 86, in wrapper
func_return = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "C:\ProgramData\anaconda3\Lib\site-packages\torch\distributed\distributed_c10d.py", line 1177, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ProgramData\anaconda3\Lib\site-packages\torch\distributed\rendezvous.py", line 246, in _env_rendezvous_handler
store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout, use_libuv)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ProgramData\anaconda3\Lib\site-packages\torch\distributed\rendezvous.py", line 174, in _create_c10d_store
return TCPStore(
^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb3 in position 150: invalid start byte

感觉存在两处问题：
1.
[W socket.cpp:464] [c10d] The server socket has failed to bind to [win10-4]:50049 (system error: 10048 - 通常每个套接字地址(协议/网络地址/端口)只允许使用一次。).
[W socket.cpp:464] [c10d] The server socket has failed to bind to win10-4:50049 (system error: 10013 - 以一种访问权限不允许的方式做了一个访问套接字的尝试。).
[E socket.cpp:500] [c10d] The server socket has failed to listen on any local network address.
2.
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb3 in position 150: invalid start byte

请求帮助！

awelldone · 2024-03-26T12:41:12Z

awelldone
Mar 26, 2024
Author

@yyq @longmans @xunkai55 @dodobaba

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

chatglm3-6b lora微调时端口占用错误和编码错误 #1038

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

chatglm3-6b lora微调时端口占用错误和编码错误 #1038

Uh oh!

awelldone Mar 26, 2024

Replies: 1 comment

Uh oh!

awelldone Mar 26, 2024 Author

awelldone
Mar 26, 2024

awelldone
Mar 26, 2024
Author