使用 seeklhy/codes-7b  模型进行sql generate 微调，8*A800 *80G 显存溢出问题

执行脚本：
#!/bin/bash

#SBATCH --job-name=sft_sql_codes        # name
#SBATCH --nodes=1                                      # nodes
#SBATCH -w wuhan-gpu-[17] 
#SBATCH --ntasks-per-node=1                            # crucial - only 1 task per dist per node!
#SBATCH --cpus-per-task=8                              # number of cores per tasks
#SBATCH --gres=gpu:8                                  # number of gpus
#SBATCH --gpus-per-task=8                           # number of gpus

export GPUS_PER_NODE=8

export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
export MASTER_PORT=9901
#


    srun --jobid $SLURM_JOBID bash -c '
    accelerate launch train_causal_lm.py \
    --per_device_train_batch_size 4  \
    --block_size 4096 \
    --seed 42  \
    --pretrained_model_name_or_path seeklhy/codes-7b  \
    --epochs 4 \
    --lr 5e-6  \
    --warmup_ratio 0.05  \
    --checkpointing_steps 100000  \
    --mode sft  \
    --output_ckpt_dir ./ckpts/codes-7b-bird-with-evidence  \
    --text2sql_data_dir ./sft_bird_with_evidence_train_text2sql.json  \
    --table_num 6  \
    --column_num 10

'




运行日志结果部分日志：

The following values were not passed to `accelerate launch` and had defaults used instead:
	`--num_processes` was set to a value of `8`
		More than one GPU was found, enabling multi-GPU training.
		If this was unintended please pass in `--num_processes=1`.
	`--num_machines` was set to a value of `1`
	`--mixed_precision` was set to a value of `'no'`
	`--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
accelerator.is_main_process: True
accelerator.device: cuda:0
Namespace(per_device_train_batch_size=4, block_size=4096, seed=42, pretrained_model_name_or_path='seeklhy/codes-7b', epochs=4, lr=5e-06, warmup_ratio=0.05, checkpointing_steps=100000, tensorboard_log_dir='./train_logs', mode='sft', output_ckpt_dir='./ckpts/codes-7b-bird-with-evidence', save_all_states=False, pt_data_dir='./data/corpus.bin', resume_from_checkpoint=None, resume_tag=None, text2sql_data_dir='./sft_bird_with_evidence_train_text2sql.json', table_num=6, column_num=10)
tokens per batch: 131072
sequences per batch: 32
using LLM from: seeklhy/codes-7b
accelerator.is_main_process: False
accelerator.device: cuda:1
accelerator.is_main_process: False
accelerator.device: cuda:5
accelerator.is_main_process: False
accelerator.device: cuda:2
accelerator.is_main_process: False
accelerator.device: cuda:3
accelerator.is_main_process: False
accelerator.device: cuda:6
accelerator.is_main_process: False
accelerator.device: cuda:7
accelerator.is_main_process: False
accelerator.device: cuda:4



torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 8.00 GiB. GPU 0 has a total capacty of 79.33 GiB of which 5.53 GiB is free. Including non-PyTorch memory, this process has 73.78 GiB memory in use. Of the allocated memory 66.49 GiB is allocated by PyTorch, and 6.31 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

使用 seeklhy/codes-7b 模型进行sql generate 微调，8A800 80G 显存溢出问题 #21

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

使用 seeklhy/codes-7b 模型进行sql generate 微调，8*A800 *80G 显存溢出问题 #21

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

使用 seeklhy/codes-7b 模型进行sql generate 微调，8A800 80G 显存溢出问题 #21