Skip to content

[Question]: 有没有 paddlepaddle-xpu 和 paddlenlp 的版本对应关系呢? #11089

@YoctoHan

Description

@YoctoHan

请提出你的问题

我当前机器的环境为:

pip list | grep paddle
paddle2onnx          2.0.1
paddlefsl            1.1.0
paddlenlp            3.0.0b4.post20250825
paddlepaddle-xpu     3.3.0.dev20250912

执行以下指令:

#!/bin/bash

# 定义日志目录
LOG_DIR="/workspace/pre_training/logs"
LOG_FILE="${LOG_DIR}/pre_trainning_$(date +%Y%m%d_%H%M%S).log"

# 创建日志目录(如果不存在)
mkdir -p ${LOG_DIR}

# 切换到工作目录
cd /workspace/PaddleNLP/llm || exit 1

# 执行脚本,后台运行并将日志存储到文件
nohup python -u -m paddle.distributed.launch \
    --gpus "0,1,2,3,4,5,6,7" \
    run_pretrain.py /workspace/PaddleNLP/llm/config/aiXcoder/pretrain_argument.json \
    > "${LOG_FILE}" 2>&1 &

# 提示用户日志位置
echo "Training started. Logs are being written to: ${LOG_FILE}"

# 实时查看日志
tail -f "${LOG_FILE}"

报错为:

Traceback (most recent call last):
  File "/workspace/PaddleNLP/llm/run_pretrain.py", line 598, in <module>
    main()
  File "/workspace/PaddleNLP/llm/run_pretrain.py", line 576, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/usr/local/lib/python3.10/dist-packages/paddlenlp/trainer/trainer.py", line 863, in train
    train_dataloader = self.get_train_dataloader()
  File "/usr/local/lib/python3.10/dist-packages/paddlenlp/trainer/trainer.py", line 1761, in get_train_dataloader
    return _DataLoader(
  File "/usr/local/lib/python3.10/dist-packages/paddlenlp/data/dist_dataloader.py", line 96, in __init__
    self._dataloader = paddle.io.DataLoader(
  File "/usr/local/lib/python3.10/dist-packages/paddle/io/reader.py", line 517, in __init__
    assert timeout >= 0, "timeout should be a non-negative value"
TypeError: '>=' not supported between instances of 'NoneType' and 'int'
LAUNCH INFO 2025-09-14 16:55:11,700 Exit code 1

原因为 PaddleNLP 和 PaddlePaddle 版本对应错误,paddle.io.DataLoader 的参数出现错位,导致 timeoutNone,修改paddlenlp/data/dist_dataloader.py 中初始化paddle.io.DataLoader 的内容后可以正常运行,但是还是担心会有其他的 mismatch 问题。

self._dataloader = paddle.io.DataLoader(
    dataset=dataset,
    feed_list=feed_list,
    places=places,
    return_list=return_list,
    batch_sampler=batch_sampler,
    batch_size=batch_size,
    shuffle=shuffle,
    drop_last=drop_last,
    collate_fn=collate_fn,
    num_workers=num_workers,
    use_buffer_reader=use_buffer_reader,
    reader_buffer_size=2,
    prefetch_factor=prefetch_factor,
    use_shared_memory=use_shared_memory,
    timeout=timeout,
    worker_init_fn=worker_init_fn,
    persistent_workers=persistent_workers,
)

Metadata

Metadata

Assignees

Labels

questionFurther information is requested

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions