微调一直卡在0/3000 [00:00<?, ?it/s] #1093

happye · 2024-04-06T17:43:39Z

happye
Apr 6, 2024

执行如下命令：
!CUDA_VISIBLE_DEVICES=1 /home/crux/miniconda3/envs/transformers/bin/python finetune_hf.py data/AdvertiseGen_fix /home/crux/AI/LLM/LLM-quickstart/ChatGLM3-6B/ChatGLM3/chatglm3-6b configs/lora.yaml
结果：
Setting eos_token is not supported, use the default one.
Setting pad_token is not supported, use the default one.
Setting unk_token is not supported, use the default one.
Loading checkpoint shards: 100%|██████████████████| 7/7 [00:01<00:00, 5.89it/s]
/home/crux/miniconda3/envs/transformers/lib/python3.12/site-packages/bitsandbytes/cextension.py:31: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
warn("The installed version of bitsandbytes was compiled without GPU support. "
/home/crux/miniconda3/envs/transformers/lib/python3.12/site-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cadam32bit_grad_fp32
trainable params: 1,949,696 || all params: 6,245,533,696 || trainable%: 0.031217444255383614
--> Model

--> model has 1.949696M params

train_dataset: Dataset({
features: ['input_ids', 'labels'],
num_rows: 114599
})
val_dataset: Dataset({
features: ['input_ids', 'output_ids'],
num_rows: 1070
})
test_dataset: Dataset({
features: ['input_ids', 'output_ids'],
num_rows: 1070
})
--> Sanity check
'[gMASK]': 64790 -> -100
'sop': 64792 -> -100
'<|user|>': 64795 -> -100
……
……
……
'萌': 56842 -> 56842
'。': 31155 -> 31155
'': 2 -> 2
/home/crux/miniconda3/envs/transformers/lib/python3.12/site-packages/accelerate/accelerator.py:436: FutureWarning: Passing the following arguments to Accelerator is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches', 'split_batches', 'even_batches', 'use_seedable_sampler']). Please pass an accelerate.DataLoaderConfiguration instead:
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)
warnings.warn(
max_steps is given, it will override any value given in num_train_epochs
***** Running training *****
Num examples = 114,599
Num Epochs = 1
Instantaneous batch size per device = 4
Total train batch size (w. parallel, distributed & accumulation) = 4
Gradient Accumulation steps = 1
Total optimization steps = 3,000
Number of trainable parameters = 1,949,696
0%| | 0/3000 [00:00<?, ?it/s]

happye · 2024-04-06T17:45:07Z

happye
Apr 6, 2024
Author

观察后台占用发现gpu貌似没在工作，只有内存占了17g左右。。。（貌似还在涨，发完帖发现已经18.6G）

2 replies

mochaphev Apr 9, 2024

解决了吗？我也遇到了同样的问题

happye Apr 9, 2024
Author

可能有报错但是jupyter notebook没显示。试着用终端跑一下看看，我是因为显存不够，调了下参数解决了。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

微调一直卡在0/3000 [00:00<?, ?it/s] #1093

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

微调一直卡在0/3000 [00:00<?, ?it/s] #1093

Uh oh!

happye Apr 6, 2024

Replies: 1 comment · 2 replies

Uh oh!

Uh oh!

happye Apr 6, 2024 Author

Uh oh!

mochaphev Apr 9, 2024

Uh oh!

happye Apr 9, 2024 Author

happye
Apr 6, 2024

Replies: 1 comment 2 replies

happye
Apr 6, 2024
Author

happye Apr 9, 2024
Author