-
Notifications
You must be signed in to change notification settings - Fork 228
MLM adaptation and Multitask Finetuning #284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
Show all changes
150 commits
Select commit
Hold shift + click to select a range
d2c35fc
added train script but with prefix manually declared
f977b85
made new dataset
fcfbf17
minor adjustments
870dfd8
added capabilities for padding and prefix lm index
lintangsutawika 791bbd0
added finetune script
lintangsutawika 0f44b92
removed script
lintangsutawika 2ff0815
added adjustments and new dataset
f0a79f6
try mlm dataset
eb416c7
minor changes
c0bc21b
minor addition of import packages
82e824c
minor error fix
7bb17ec
minor error fix
9929766
samples follow how gpt dataset is loaded
861c41f
added masked_lm_prob
fe95115
fixed tokenizer abstractions for HF tokenizer
8ea5943
added mask id
aa0d146
added mask id
215e8cc
added mask id
b6eef43
added mask id
bfc73a5
added fix
1890f87
added bos and eos token id
01392a9
no need for sentinal token
923decb
add aux functions
4611d67
add aux functions
4356de3
add aux functions
f31c686
add pad_id
a3951e8
changed lm predictions to t5
97b9a92
changed lm predictions to t5
fe73a73
changed lm predictions to t5
6a9cb75
changed lm predictions to t5
469848f
changed lm predictions to t5
e68283f
tokenizer add mask, cls, sep tokens
476ae94
commit latest changes
72ff575
commit latest changes
3647291
added sentinal tokens
fcdc987
added sentinal tokens
d6fbe78
added sentinal tokens
c44daba
added additional_special_tokens
a2725d8
added additional_special_tokens
0e94245
check t5_input and output
b599ab6
check decoder in and decoder out
626b0ae
made into input and output tokens
6008937
made into input and output tokens
c1524db
made into input and output tokens
c59c061
made into input and output tokens
e677e16
made into input and output tokens
9ffaeb9
made into input and output tokens
d0a6a2f
made into input and output tokens
47fd987
made into input and output tokens
4f377e8
made into input and output tokens
5c0bf76
added eos
7c63e4b
added eos
871124c
test text_token
55a593d
test text_token
adb59ca
test text_token
d71afb4
test text_token
7b99bb7
test text_token
922b09d
assigned array
469a02d
assigned array
15cb6a0
assigned array
5b0bc17
hardcoded sequence length
0671c79
check again
6db5c9b
show sentinal tokens
lintangsutawika 8a58007
show sentinal tokens
lintangsutawika 8b0bbc2
show sentinal tokens
lintangsutawika 3d1b256
show sentinal tokens
lintangsutawika ce00fd9
add more special tokens
lintangsutawika 3bcc50c
changed how mlm data is loaded
lintangsutawika 76960f7
changed how mlm data is loaded
lintangsutawika 229d661
changed how mlm data is loaded
lintangsutawika 55e3df7
changed how mlm data is loaded
lintangsutawika 05dea6d
changed how mlm data is loaded
lintangsutawika 661c8bb
added new script
lintangsutawika 97d3810
added new script
lintangsutawika 71388ee
added new script
lintangsutawika b0f04d5
try t5 dataset
lintangsutawika cd43a54
try t5 dataset
lintangsutawika e0dc666
try t5 dataset
lintangsutawika 866cee1
try t5 dataset
lintangsutawika 0b56a7d
try t5 dataset
lintangsutawika 5bb512b
try t5 dataset
lintangsutawika 31d844f
try t5 dataset
lintangsutawika 1d21963
try t5 dataset
lintangsutawika 1429645
try t5 dataset
lintangsutawika f5341f8
try t5 dataset
lintangsutawika b05b175
try t5 dataset
lintangsutawika 59a6e32
try t5 dataset
lintangsutawika ab76d49
developing
lintangsutawika 0d8dfac
developing
lintangsutawika e629224
developing
lintangsutawika efcf50f
developing
lintangsutawika e5eb615
developing
lintangsutawika 2eee807
developing
lintangsutawika 5840a11
developing
lintangsutawika 6d38f73
test to see output of get_ltor_masks_and_position_ids
lintangsutawika 430fa6f
test to see output of get_ltor_masks_and_position_ids
lintangsutawika 444314f
add new script
26c837d
add new script
feb023c
add new script
f30b9b1
changed settings
0a9203a
changed settings
672a866
tidy up
3780e61
changed tokenizer and position embedding
2130c31
modifying mlm to reflect original implementation
26afe43
minor fix
c1b9816
minor fix
453822f
minor fix
a62266a
minor fix
02dda79
minor fix
80331cb
minor fix
350227d
minor fix
d0eecd4
minor fix
243cebe
minor fix
da22e0b
minor fix
083dce7
minor fix
541e9d6
minor fix
86bfc8a
minor fix
e21a448
minor fix
f47d678
minor fix
415b8bc
minor fix
79bd6f8
minor fix
ba19fdf
minor fix
d200f4d
minor fix
102a461
minor fix
e530440
minor fix
2568039
minor fix
e6b4120
minor fix
fd7fe97
minor fix
861fc7b
minor fix
21c1984
minor fix
14e8d0f
minor fix
920343f
minor fix
a68873d
minor fix
5d43986
minor fix
79e8c1a
set correct seq len
786d252
refined sampling method
9110520
refined sampling method
7db34b9
refined sampling method
d946515
refined sampling method
bb4e656
refined sampling method
2e7161d
refined sampling method
00473e4
first commit, adding non causal mlm dataset
5992776
fixed mlm dataset
83f5dee
fixed mlm dataset
3235c2d
fixed mlm dataset
5449978
fixed mlm dataset
95c9851
fixed mlm dataset
9ff6172
Merge branch 'bigscience-workshop:main' into mt0
451318f
minor changes
edfaa19
Merge branch 'mt0' of https://github.com/lintangsutawika/Megatron-Dee…
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,157 @@ | ||
| #!/bin/bash | ||
|
|
||
| EXPERIMENT_NAME=4B8-en-CD-FLM | ||
| REPO_PATH=experiments/$EXPERIMENT_NAME | ||
| CHECKPOINT_PATH=$REPO_PATH/checkpoints | ||
| TENSORBOARD_PATH=$REPO_PATH/tensorboard | ||
| CODECARBON_PATH=$REPO_PATH/codecarbon | ||
| LOGS_PATH=$REPO_PATH/logs | ||
|
|
||
| DATA_PATH=data/meg-gpt2-oscar-en-10k_text_document | ||
|
|
||
| # XXX: edit me | ||
| GPUS_PER_NODE=8 | ||
| NNODES=1 | ||
| PP_SIZE=2 # NLAYERS must be a multiple of PP_SIZE here | ||
| TP_SIZE=1 # always fixed to the size of a single node | ||
| DP_SIZE=$((NNODES*GPUS_PER_NODE/(PP_SIZE*TP_SIZE))) # will get derived automatically by trainer | ||
|
|
||
| MICRO_BATCH_SIZE=32 | ||
| GLOBAL_BATCH_SIZE=2048 | ||
| TRAIN_ITER=131_072 | ||
| SEQ_LEN=626 | ||
|
|
||
|
|
||
| NLAYERS=24 | ||
| NHIDDEN=4096 | ||
| NHEADS=64 | ||
| FFN_HIDDEN_SIZE=10240 | ||
| MAX_POSITION_EMBEDDING=1280 | ||
|
|
||
| SAVE_INTERVAL=1500 | ||
|
|
||
| OPTIMIZER_ARGS=" \ | ||
| --optimizer adam \ | ||
| --adam-beta1 0.9 \ | ||
| --adam-beta2 0.999 \ | ||
| --adam-eps 1e-8 \ | ||
| --lr 2e-4 \ | ||
| --min-lr 1e-5 \ | ||
| --lr-decay-style cosine \ | ||
| --clip-grad 1.0 \ | ||
| --weight-decay 1e-1 \ | ||
| " | ||
|
|
||
| EXIT_OPTS=" \ | ||
| --exit-duration-in-mins 1190 \ | ||
| " | ||
|
|
||
| GPT_ARGS=" \ | ||
| --num-layers $NLAYERS \ | ||
| --hidden-size $NHIDDEN \ | ||
| --num-attention-heads $NHEADS \ | ||
| --ffn-hidden-size $FFN_HIDDEN_SIZE \ | ||
| --max-position-embeddings $SEQ_LEN \ | ||
| --position-embedding-type alibi \ | ||
| --seq-length $SEQ_LEN \ | ||
| --micro-batch-size $MICRO_BATCH_SIZE \ | ||
| --global-batch-size $GLOBAL_BATCH_SIZE \ | ||
| --train-iters $TRAIN_ITER \ | ||
| --tokenizer-type PretrainedFromHF \ | ||
| --tokenizer-name-or-path bigscience/tokenizer \ | ||
| --loss-scale 12 \ | ||
| --clip-grad 1.0 \ | ||
| --fp16 \ | ||
| --checkpoint-activations \ | ||
| $OPTIMIZER_ARGS \ | ||
| $EXIT_OPTS \ | ||
| " | ||
|
|
||
| OUTPUT_ARGS=" \ | ||
| --log-interval 1 \ | ||
| --save-interval $SAVE_INTERVAL \ | ||
| --eval-interval $TRAIN_ITER \ | ||
| --eval-iters 1 \ | ||
| --tensorboard-dir $TENSORBOARD_PATH \ | ||
| --tensorboard-queue-size 5 \ | ||
| --log-timers-to-tensorboard \ | ||
| --log-batch-size-to-tensorboard \ | ||
| --log-validation-ppl-to-tensorboard \ | ||
| " | ||
|
|
||
| ZERO_STAGE=1 | ||
|
|
||
| config_json="./ds_config.json" | ||
|
|
||
| # Deepspeed figures out GAS dynamically from dynamic GBS via set_train_batch_size() | ||
| cat <<EOT > $config_json | ||
| { | ||
| "train_micro_batch_size_per_gpu": $MICRO_BATCH_SIZE, | ||
| "train_batch_size": $GLOBAL_BATCH_SIZE, | ||
| "gradient_clipping": 1.0, | ||
| "zero_optimization": { | ||
| "stage": $ZERO_STAGE | ||
| }, | ||
| "fp16": { | ||
| "enabled": true, | ||
| "loss_scale": 0, | ||
| "loss_scale_window": 500, | ||
| "hysteresis": 2, | ||
| "min_loss_scale": 1, | ||
| "initial_scale_power": 12 | ||
| }, | ||
| "steps_per_print": 2000, | ||
| "wall_clock_breakdown": false | ||
| } | ||
| EOT | ||
|
|
||
|
|
||
| DEEPSPEED_ARGS=" \ | ||
| --deepspeed \ | ||
| --deepspeed_config ${config_json} \ | ||
| --zero-stage ${ZERO_STAGE} \ | ||
| --deepspeed-activation-checkpointing \ | ||
| " | ||
|
|
||
| # export LAUNCHER="python -u -m torch.distributed.launch \ | ||
| # --nproc_per_node $GPUS_PER_NODE \ | ||
| # " | ||
| # # --nnodes $NNODES \ | ||
| # # --master_addr $MASTER_ADDR \ | ||
| # # --master_port $MASTER_PORT \ | ||
|
|
||
| export CMD=" \ | ||
| `pwd`/pretrain_gpt.py \ | ||
| --tensor-model-parallel-size $TP_SIZE \ | ||
| --pipeline-model-parallel-size $PP_SIZE \ | ||
| $GPT_ARGS \ | ||
| $OUTPUT_ARGS \ | ||
| --save $CHECKPOINT_PATH \ | ||
| --load $CHECKPOINT_PATH \ | ||
| --data-path $DATA_PATH \ | ||
| --data-impl mmap \ | ||
| --split 949,50,1 \ | ||
| --distributed-backend nccl \ | ||
| $DEEPSPEED_ARGS \ | ||
| " | ||
|
|
||
|
|
||
| # # clear old checkpoint as it'd mismatch while we sort things out | ||
| # rm -rf $SAVE_CHECKPOINT_PATH | ||
|
|
||
|
|
||
| echo $CMD | ||
|
|
||
| # We create the folder where the logs and codecarbon will be stored. | ||
| mkdir -p $REPO_PATH | ||
| mkdir -p $LOGS_PATH | ||
| # to debug - add echo (it exits and prints what it would have launched) | ||
|
|
||
| # python -u -m torch.distributed.launch \ | ||
| # --nproc_per_node $GPUS_PER_NODE \ | ||
| # $CMD | ||
|
|
||
| deepspeed --num_gpus $GPUS_PER_NODE \ | ||
| $CMD | ||
|
|
||
| # srun '$LAUNCHER --node_rank $SLURM_PROCID $CMD' 2>&1 | tee -a $LOGS_PATH/main_log.txt |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,156 @@ | ||
| #!/bin/bash | ||
|
|
||
| EXPERIMENT_NAME=4B8-en-ND-MLM | ||
| REPO_PATH=experiments/$EXPERIMENT_NAME | ||
| CHECKPOINT_PATH=$REPO_PATH/checkpoints | ||
| TENSORBOARD_PATH=$REPO_PATH/tensorboard | ||
| CODECARBON_PATH=$REPO_PATH/codecarbon | ||
| LOGS_PATH=$REPO_PATH/logs | ||
|
|
||
| DATA_PATH=data/meg-gpt2-oscar-en-10k_text_document | ||
| TOKENIZER_PATH=bigscience-tokenizer-padded | ||
|
|
||
| # XXX: edit me | ||
| GPUS_PER_NODE=8 | ||
| NNODES=1 | ||
| PP_SIZE=2 # NLAYERS must be a multiple of PP_SIZE here | ||
| TP_SIZE=1 # always fixed to the size of a single node | ||
| DP_SIZE=$((NNODES*GPUS_PER_NODE/(PP_SIZE*TP_SIZE))) # will get derived automatically by trainer | ||
|
|
||
| MICRO_BATCH_SIZE=1 | ||
| GLOBAL_BATCH_SIZE=512 | ||
| TRAIN_ITER=48_562 | ||
| INPUT_LEN=1675 | ||
| TARGET_LEN=373 | ||
| SEQ_LEN=$((INPUT_LEN+TARGET_LEN)) | ||
|
|
||
| NLAYERS=24 | ||
| NHIDDEN=4096 | ||
| NHEADS=64 | ||
| FFN_HIDDEN_SIZE=10240 | ||
|
|
||
|
|
||
| SAVE_INTERVAL=1500 | ||
|
|
||
| OPTIMIZER_ARGS=" \ | ||
| --optimizer adam \ | ||
| --adam-beta1 0.9 \ | ||
| --adam-beta2 0.999 \ | ||
| --adam-eps 1e-8 \ | ||
| --lr 2e-4 \ | ||
| --min-lr 1e-5 \ | ||
| --lr-decay-style cosine \ | ||
| --clip-grad 1.0 \ | ||
| --weight-decay 1e-1 \ | ||
| " | ||
|
|
||
| EXIT_OPTS=" \ | ||
| --exit-duration-in-mins 1190 \ | ||
| " | ||
|
|
||
| GPT_ARGS=" \ | ||
| --num-layers $NLAYERS \ | ||
| --hidden-size $NHIDDEN \ | ||
| --num-attention-heads $NHEADS \ | ||
| --ffn-hidden-size $FFN_HIDDEN_SIZE \ | ||
| --max-position-embeddings $SEQ_LEN \ | ||
| --position-embedding-type alibi \ | ||
| --seq-length $SEQ_LEN \ | ||
| --input-length $INPUT_LEN \ | ||
| --micro-batch-size $MICRO_BATCH_SIZE \ | ||
| --global-batch-size $GLOBAL_BATCH_SIZE \ | ||
| --train-iters $TRAIN_ITER \ | ||
| --tokenizer-type PretrainedFromHF \ | ||
| --tokenizer-name-or-path $TOKENIZER_PATH \ | ||
| --loss-scale 12 \ | ||
| --clip-grad 1.0 \ | ||
| --fp16 \ | ||
| --checkpoint-activations \ | ||
| $OPTIMIZER_ARGS \ | ||
| $EXIT_OPTS \ | ||
| " | ||
|
|
||
| OUTPUT_ARGS=" \ | ||
| --log-interval 1 \ | ||
| --save-interval $SAVE_INTERVAL \ | ||
| --eval-interval $TRAIN_ITER \ | ||
| --eval-iters 1 \ | ||
| --tensorboard-dir $TENSORBOARD_PATH \ | ||
| --tensorboard-queue-size 5 \ | ||
| --log-timers-to-tensorboard \ | ||
| --log-batch-size-to-tensorboard \ | ||
| --log-validation-ppl-to-tensorboard \ | ||
| " | ||
|
|
||
| ZERO_STAGE=1 | ||
|
|
||
| config_json="./ds_config.json" | ||
|
|
||
| # Deepspeed figures out GAS dynamically from dynamic GBS via set_train_batch_size() | ||
| cat <<EOT > $config_json | ||
| { | ||
| "train_micro_batch_size_per_gpu": $MICRO_BATCH_SIZE, | ||
| "train_batch_size": $GLOBAL_BATCH_SIZE, | ||
| "gradient_clipping": 1.0, | ||
| "zero_optimization": { | ||
| "stage": $ZERO_STAGE | ||
| }, | ||
| "fp16": { | ||
| "enabled": true, | ||
| "loss_scale": 0, | ||
| "loss_scale_window": 500, | ||
| "hysteresis": 2, | ||
| "min_loss_scale": 1, | ||
| "initial_scale_power": 12 | ||
| }, | ||
| "steps_per_print": 2000, | ||
| "wall_clock_breakdown": false | ||
| } | ||
| EOT | ||
|
|
||
|
|
||
| DEEPSPEED_ARGS=" \ | ||
| --deepspeed \ | ||
| --deepspeed_config ${config_json} \ | ||
| --zero-stage ${ZERO_STAGE} \ | ||
| --deepspeed-activation-checkpointing \ | ||
| " | ||
|
|
||
| # export LAUNCHER="python -u -m torch.distributed.launch \ | ||
| # --nproc_per_node $GPUS_PER_NODE \ | ||
| # " | ||
| # # --nnodes $NNODES \ | ||
| # # --master_addr $MASTER_ADDR \ | ||
| # # --master_port $MASTER_PORT \ | ||
|
|
||
| export CMD=" \ | ||
| `pwd`/train_ND_MLM_gpt.py \ | ||
| --tensor-model-parallel-size $TP_SIZE \ | ||
| --pipeline-model-parallel-size $PP_SIZE \ | ||
| $GPT_ARGS \ | ||
| $OUTPUT_ARGS \ | ||
| --save $CHECKPOINT_PATH \ | ||
| --load $CHECKPOINT_PATH \ | ||
| --data-path $DATA_PATH \ | ||
| --data-impl mmap \ | ||
| --split 949,50,1 \ | ||
| --distributed-backend nccl \ | ||
| $DEEPSPEED_ARGS \ | ||
| " | ||
|
|
||
|
|
||
| # # clear old checkpoint as it'd mismatch while we sort things out | ||
| # rm -rf $SAVE_CHECKPOINT_PATH | ||
|
|
||
|
|
||
| echo $CMD | ||
|
|
||
| # We create the folder where the logs and codecarbon will be stored. | ||
| mkdir -p $REPO_PATH | ||
| mkdir -p $LOGS_PATH | ||
| # to debug - add echo (it exits and prints what it would have launched) | ||
|
|
||
| deepspeed --num_gpus $GPUS_PER_NODE \ | ||
| $CMD | ||
|
|
||
| # srun '$LAUNCHER --node_rank $SLURM_PROCID $CMD' 2>&1 | tee -a $LOGS_PATH/main_log.txt | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking we need a way to compute these values from a given SEQ_LEN. Typically given a
noise_density,mean_noise_span_length, andsequence_lengthwe should be able to compute an input and target no? The reason why, is because what we really care about is that SEQ_LEN is 2048 (for performance), the rest we don';t really care as they are implementation details.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree. But I'm not sure where to put this function in.