Checklist / 检查清单
Bug Description / Bug 描述
在训练脚本中增加:NPROC_PER_NODE=8 \,就会报错:torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address. port: 29500, useIpv6: false, code: -98, name: EADDRINUSE, message: address already in use
如果去掉NPROC_PER_NODE=8和--deepspeed zero2 ,就能够正常运行
How to Reproduce / 如何复现
2 * 21GiB
PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True'
IMAGE_MAX_TOKEN_NUM=1024
VIDEO_MAX_TOKEN_NUM=128
FPS_MAX_FRAMES=16
NPROC_PER_NODE=2
CUDA_VISIBLE_DEVICES=0,1
swift sft
--model Qwen/Qwen3-VL-4B-Instruct
--dataset 'AI-ModelScope/alpaca-gpt4-data-zh#10000'
'AI-ModelScope/LaTeX_OCR:human_handwrite#5000'
'swift/VideoChatGPT:Generic#2000'
--load_from_cache_file true
--split_dataset_ratio 0.01
--tuner_type lora
--torch_dtype bfloat16
--num_train_epochs 1
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--attn_impl flash_attn
--padding_free true
--learning_rate 1e-4
--lora_rank 8
--lora_alpha 32
--target_modules all-linear
--freeze_vit true
--freeze_aligner true
--packing true
--gradient_checkpointing true
--vit_gradient_checkpointing false
--gradient_accumulation_steps 2
--eval_steps 100
--save_steps 100
--save_total_limit 2
--logging_steps 5
--max_length 4096
--output_dir output
--warmup_ratio 0.05
--deepspeed zero2
--dataset_num_proc 4
--dataloader_num_workers 4
Additional Information / 补充信息
No response
Checklist / 检查清单
Bug Description / Bug 描述
在训练脚本中增加:NPROC_PER_NODE=8 \,就会报错:torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address. port: 29500, useIpv6: false, code: -98, name: EADDRINUSE, message: address already in use
如果去掉NPROC_PER_NODE=8和--deepspeed zero2 ,就能够正常运行
How to Reproduce / 如何复现
2 * 21GiB
PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True'
IMAGE_MAX_TOKEN_NUM=1024
VIDEO_MAX_TOKEN_NUM=128
FPS_MAX_FRAMES=16
NPROC_PER_NODE=2
CUDA_VISIBLE_DEVICES=0,1
swift sft
--model Qwen/Qwen3-VL-4B-Instruct
--dataset 'AI-ModelScope/alpaca-gpt4-data-zh#10000'
'AI-ModelScope/LaTeX_OCR:human_handwrite#5000'
'swift/VideoChatGPT:Generic#2000'
--load_from_cache_file true
--split_dataset_ratio 0.01
--tuner_type lora
--torch_dtype bfloat16
--num_train_epochs 1
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--attn_impl flash_attn
--padding_free true
--learning_rate 1e-4
--lora_rank 8
--lora_alpha 32
--target_modules all-linear
--freeze_vit true
--freeze_aligner true
--packing true
--gradient_checkpointing true
--vit_gradient_checkpointing false
--gradient_accumulation_steps 2
--eval_steps 100
--save_steps 100
--save_total_limit 2
--logging_steps 5
--max_length 4096
--output_dir output
--warmup_ratio 0.05
--deepspeed zero2
--dataset_num_proc 4
--dataloader_num_workers 4
Additional Information / 补充信息
No response