Replies: 2 comments
-
#253 在这里问可以获得更多人回答 |
Beta Was this translation helpful? Give feedback.
0 replies
-
请问解决了吗 |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
/lib/python3.10/site-packages/transformers/trainer.py", line 2383, in _save_checkpoint
os.rename(staging_output_dir, output_dir)
FileExistsError: [Errno 17] File exists:xxxx 多卡微调就会报这个错,但是单卡微调就不会报错,就可以保存checkpoint
以下是我的训练脚本
set -ex
PRE_SEQ_LEN=128
LR=2e-2
NUM_GPUS=4
MAX_SEQ_LEN=2048
DEV_BATCH_SIZE=1
GRAD_ACCUMULARION_STEPS=16
MAX_STEP=1000
SAVE_INTERVAL=500
DATESTR=
date +%Y%m%d-%H%M%S
RUN_NAME=test1
BASE_MODEL_PATH=/data/resources/chatglm3_6B
DATASET_PATH=medical_prompt.json
OUTPUT_DIR=output/${RUN_NAME}-${DATESTR}-${PRE_SEQ_LEN}-${LR}
mkdir -p $OUTPUT_DIR
torchrun --standalone --nnodes=1 --nproc_per_node=$NUM_GPUS finetune.py$PRE_SEQ_LEN 2>&1 | tee $ {OUTPUT_DIR}/train.log
--train_format multi-turn
--train_file $DATASET_PATH
--max_seq_length $MAX_SEQ_LEN
--preprocessing_num_workers 1
--model_name_or_path $BASE_MODEL_PATH
--output_dir $OUTPUT_DIR
--per_device_train_batch_size $DEV_BATCH_SIZE
--gradient_accumulation_steps $GRAD_ACCUMULARION_STEPS
--max_steps $MAX_STEP
--logging_steps 1
--save_steps $SAVE_INTERVAL
--learning_rate $LR
--pre_seq_len
Beta Was this translation helpful? Give feedback.
All reactions