-
Notifications
You must be signed in to change notification settings - Fork 123
Closed
Labels
Description
BioNeMo Framework Version
Bug Description
The ESM2 training breaks when it resumes from the checkpoint
Steps to Reproduce
- run first training with num_steps=20 and val_check_interval=10
python sub-packages/bionemo-esm2/src/bionemo/esm2/scripts/train_esm2.py --train-cluster-path=/data/train_clusters.parquet --train-database-path=/data/train.db --valid-cluster-path=/data/valid_clusters.parquet --valid-database-path=/data/validation.db --micro-batch-size=16 --num-nodes=1 --num-gpus=1 --limit-val-batches=1 --min-seq-length=1024 --max-seq-length=1024 --num-layers=33 --hidden-size=1280 --num-attention-heads=20 --ffn-hidden-size=5120 --create-tensorboard-logger --wandb-project=fix-tensorboard-logs --val-check-interval 10 --wandb-group main --num-steps=20 --resume-if-exists
- Run again with --num-steps 30
python sub-packages/bionemo-esm2/src/bionemo/esm2/scripts/train_esm2.py --train-cluster-path=/data/train_clusters.parquet --train-database-path=/data/train.db --valid-cluster-path=/data/valid_clusters.parquet --valid-database-path=/data/validation.db --micro-batch-size=16 --num-nodes=1 --num-gpus=1 --limit-val-batches=1 --min-seq-length=1024 --max-seq-length=1024 --num-layers=33 --hidden-size=1280 --num-attention-heads=20 --ffn-hidden-size=5120 --create-tensorboard-logger --wandb-project=fix-tensorboard-logs --val-check-interval 10 --wandb-group main --num-steps=30 --resume-if-exists
Error Messages and Logs
megatron.core.dist_checkpointing.core.CheckpointingException: Cannot find global shape metadata for N-D flattened tensor ShardedTensor(key='optimizer.state.exp_avg.module.lm_head.dense.weight', dtype=torch.float32, local_shape=(1280, 1280), global_shape=(1280, 1280), global_offset=(0, 0), axis_fragmentations=(1, 1), replica_id=(0, 0, 0), prepend_axis_num=0, allow_shape_mismatch=False, flattened_range=slice(0, 1638400, None)) in checkpoint metadata: {'module.embedding.word_embeddings.weight': {}, 'module.encoder.layers.self_attention.linear_proj.weight': {}, 'module.encoder.layers.self_attention.linear_proj.bias': {}, 'module.encoder.layers.self_attention.linear_qkv.layer_norm_weight': {}, 'module.encoder.layers.self_attention.linear_qkv.layer_norm_bias': {}, 'module.encoder.layers.self_attention.linear_qkv.weight': {}, 'module.encoder.layers.self_attention.linear_qkv.bias': {}, 'module.encoder.layers.mlp.linear_fc1.layer_norm_weight': {}, 'module.encoder.layers.mlp.linear_fc1.layer_norm_bias': {}, 'module.encoder.layers.mlp.linear_fc1.weight': {}, 'module.encoder.layers.mlp.linear_fc1.bias': {}, 'module.encoder.layers.mlp.linear_fc2.weight': {}, 'module.encoder.layers.mlp.linear_fc2.bias': {}, 'module.encoder.final_layernorm.weight': {}, 'module.encoder.final_layernorm.bias': {}, 'module.lm_head.dense.weight': {}, 'module.lm_head.dense.bias': {}, 'module.lm_head.layer_norm.weight': {}, 'module.lm_head.layer_norm.bias': {}, 'module.output_layer.bias': {}}Docker Image
No response
System Information
Environment Details:
- OS: [e.g., Ubuntu 20.04]
- CPU: [e.g., Intel i9-12900K]
- RAM: [e.g., 64GB]
GPU Details:
- GPU Model: [e.g., NVIDIA RTX 4090]
- GPU Memory: [e.g., 24GB]
- CUDA Version: [e.g., 12.1]
- CUDA Driver: [e.g., 525.85.05]
- cuDNN Version: [e.g., 8.9.0]
Additional Context
log_resume_training_esm2_c61ef42b0bff5efc79b3f873c71f39893921f7a9.txt
Reactions are currently unavailable