-
Notifications
You must be signed in to change notification settings - Fork 3.5k
Open
Labels
Description
Describe the bug
When saving checkpoint using the latest megatron version (dd7c9f4) and training scripts be
#!/bin/bash
# Runs the "175B" parameter model
export CUDA_DEVICE_MAX_CONNECTIONS=1
GPUS_PER_NODE=8
# Change for multinode config
MASTER_ADDR=localhost
MASTER_PORT=6000
NUM_NODES=1
NODE_RANK=0
WORLD_SIZE=$(($GPUS_PER_NODE*$NUM_NODES))
CHECKPOINT_PATH=./outputs/ #<Specify path>
TENSORBOARD_LOGS_PATH=$2 #<Specify path>
VOCAB_FILE=$3 #<Specify path to file>/gpt2-vocab.json
MERGE_FILE=$4 #<Specify path to file>/gpt2-merges.txt
DATA_PATH=$5 #<Specify path and file prefix>_text_document
DISTRIBUTED_ARGS=(
--nproc_per_node $GPUS_PER_NODE
--nnodes $NUM_NODES
--master_addr $MASTER_ADDR
--master_port $MASTER_PORT
)
GPT_MODEL_ARGS=(
--num-layers 4
--hidden-size 4096
--num-attention-heads 32
--seq-length 2048
--max-position-embeddings 2048
--attention-backend auto # Can use (flash/fused/unfused/local)
)
TRAINING_ARGS=(
--micro-batch-size 1
# --global-batch-size 1536
# --rampup-batch-size 16 16 5859375
--train-iters 500000
--weight-decay 0.1
--adam-beta1 0.9
--adam-beta2 0.95
--init-method-std 0.006
--clip-grad 1.0
--lr 6.0e-5
--lr-decay-style cosine
--min-lr 6.0e-6
--lr-warmup-fraction .001
--lr-decay-iters 430000
--bf16
--use-distributed-optimizer
# --optimizer-cpu-offload
# --overlap-cpu-optimizer-d2h-h2d
# --use-precision-aware-optimizer
)
MODEL_PARALLEL_ARGS=(
--tensor-model-parallel-size 2
--pipeline-model-parallel-size 1
)
DATA_ARGS=(
--data-path /lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_mcore/users/jinliangl/jinliang_data/meg-gpt2_text_document
--vocab-file /lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_mcore/users/jinliangl/jinliang_data/gpt2-vocab.json
--merge-file /lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_mcore/users/jinliangl/jinliang_data/gpt2-merges.txt
# --vocab-file $VOCAB_FILE
# --merge-file $MERGE_FILE
# --split 949,50,1
)
EVAL_AND_LOGGING_ARGS=(
--log-interval 10
--save-interval 20
--eval-interval 1000
--save $CHECKPOINT_PATH
--load $CHECKPOINT_PATH
--eval-iters 10
# --tensorboard-dir $TENSORBOARD_LOGS_PATH
)
torchrun ${DISTRIBUTED_ARGS[@]} pretrain_gpt.py \
${GPT_MODEL_ARGS[@]} \
${TRAINING_ARGS[@]} \
${MODEL_PARALLEL_ARGS[@]} \
${DATA_ARGS[@]} \
${EVAL_AND_LOGGING_ARGS[@]}
The bug can be reproduced like that:
[rank1]: assert tensors[key].shape == (gbuf_local_end - gbuf_local_start,), (
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: AssertionError: (torch.Size([]), 0, 4096)
This bug is from @lilei199908.
fsygd