-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Open
Labels
questionFurther information is requestedFurther information is requested
Description
请提出你的问题
我当前机器的环境为:
pip list | grep paddle
paddle2onnx 2.0.1
paddlefsl 1.1.0
paddlenlp 3.0.0b4.post20250825
paddlepaddle-xpu 3.3.0.dev20250912
执行以下指令:
#!/bin/bash
# 定义日志目录
LOG_DIR="/workspace/pre_training/logs"
LOG_FILE="${LOG_DIR}/pre_trainning_$(date +%Y%m%d_%H%M%S).log"
# 创建日志目录(如果不存在)
mkdir -p ${LOG_DIR}
# 切换到工作目录
cd /workspace/PaddleNLP/llm || exit 1
# 执行脚本,后台运行并将日志存储到文件
nohup python -u -m paddle.distributed.launch \
--gpus "0,1,2,3,4,5,6,7" \
run_pretrain.py /workspace/PaddleNLP/llm/config/aiXcoder/pretrain_argument.json \
> "${LOG_FILE}" 2>&1 &
# 提示用户日志位置
echo "Training started. Logs are being written to: ${LOG_FILE}"
# 实时查看日志
tail -f "${LOG_FILE}"
报错为:
Traceback (most recent call last):
File "/workspace/PaddleNLP/llm/run_pretrain.py", line 598, in <module>
main()
File "/workspace/PaddleNLP/llm/run_pretrain.py", line 576, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/usr/local/lib/python3.10/dist-packages/paddlenlp/trainer/trainer.py", line 863, in train
train_dataloader = self.get_train_dataloader()
File "/usr/local/lib/python3.10/dist-packages/paddlenlp/trainer/trainer.py", line 1761, in get_train_dataloader
return _DataLoader(
File "/usr/local/lib/python3.10/dist-packages/paddlenlp/data/dist_dataloader.py", line 96, in __init__
self._dataloader = paddle.io.DataLoader(
File "/usr/local/lib/python3.10/dist-packages/paddle/io/reader.py", line 517, in __init__
assert timeout >= 0, "timeout should be a non-negative value"
TypeError: '>=' not supported between instances of 'NoneType' and 'int'
LAUNCH INFO 2025-09-14 16:55:11,700 Exit code 1
原因为 PaddleNLP 和 PaddlePaddle 版本对应错误,paddle.io.DataLoader
的参数出现错位,导致 timeout
为 None
,修改paddlenlp/data/dist_dataloader.py
中初始化paddle.io.DataLoader
的内容后可以正常运行,但是还是担心会有其他的 mismatch 问题。
self._dataloader = paddle.io.DataLoader(
dataset=dataset,
feed_list=feed_list,
places=places,
return_list=return_list,
batch_sampler=batch_sampler,
batch_size=batch_size,
shuffle=shuffle,
drop_last=drop_last,
collate_fn=collate_fn,
num_workers=num_workers,
use_buffer_reader=use_buffer_reader,
reader_buffer_size=2,
prefetch_factor=prefetch_factor,
use_shared_memory=use_shared_memory,
timeout=timeout,
worker_init_fn=worker_init_fn,
persistent_workers=persistent_workers,
)
Metadata
Metadata
Assignees
Labels
questionFurther information is requestedFurther information is requested