Skip to content

[Bug]: torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address. port: 29500, useIpv6: false, code: -98, name: EADDRINUSE, message: address already in use #9018

@qy0720

Description

@qy0720

Checklist / 检查清单

  • I have searched existing issues, and this is a new bug report. / 我已经搜索过现有的 issues,确认这是一个新的 bug report。

Bug Description / Bug 描述

在训练脚本中增加:NPROC_PER_NODE=8 \,就会报错:torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address. port: 29500, useIpv6: false, code: -98, name: EADDRINUSE, message: address already in use

如果去掉NPROC_PER_NODE=8和--deepspeed zero2 ,就能够正常运行

How to Reproduce / 如何复现

2 * 21GiB

PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True'
IMAGE_MAX_TOKEN_NUM=1024
VIDEO_MAX_TOKEN_NUM=128
FPS_MAX_FRAMES=16
NPROC_PER_NODE=2
CUDA_VISIBLE_DEVICES=0,1
swift sft
--model Qwen/Qwen3-VL-4B-Instruct
--dataset 'AI-ModelScope/alpaca-gpt4-data-zh#10000'
'AI-ModelScope/LaTeX_OCR:human_handwrite#5000'
'swift/VideoChatGPT:Generic#2000'
--load_from_cache_file true
--split_dataset_ratio 0.01
--tuner_type lora
--torch_dtype bfloat16
--num_train_epochs 1
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--attn_impl flash_attn
--padding_free true
--learning_rate 1e-4
--lora_rank 8
--lora_alpha 32
--target_modules all-linear
--freeze_vit true
--freeze_aligner true
--packing true
--gradient_checkpointing true
--vit_gradient_checkpointing false
--gradient_accumulation_steps 2
--eval_steps 100
--save_steps 100
--save_total_limit 2
--logging_steps 5
--max_length 4096
--output_dir output
--warmup_ratio 0.05
--deepspeed zero2
--dataset_num_proc 4
--dataloader_num_workers 4

Additional Information / 补充信息

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions