forked from NVIDIA/Megatron-LM
-
Notifications
You must be signed in to change notification settings - Fork 52
Open
Description
I am trying to run the Starcoder pretraining code(/examples/pretrain_bigcode_model.slurm). I created a custom pretrain_starcoder.sh file
#!/bin/bash
GPUS_PER_NODE=2
# Change for multinode config
MASTER_ADDR=localhost
MASTER_PORT=6000
NNODES=1
NODE_RANK=0
WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))
# File path setup
CHECKPOINT_PATH=/home/jupyter/Satya/Megatron/Model_starcoder/
TOKENIZER_FILE=/home/jupyter/Satya/Megatron/tokenizer_starcoder/tokenizer.json
#WEIGHTS_TRAIN=/fsx/loubna/code/bigcode-data-mix/data/train_data_paths.txt.tmp
#WEIGHTS_VALID=/fsx/loubna/code/bigcode-data-mix/data/valid_data_paths.txt.tmp
mkdir -p $CHECKPOINT_PATH/tensorboard
DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE --nnodes $NNODES --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
GPT_ARGS="\
--tensor-model-parallel-size 1 \
--pipeline-model-parallel-size 1 \
--sequence-parallel \
--num-layers 40 \
--hidden-size 6144 \
--num-attention-heads 48 \
--attention-head-type multiquery \
--init-method-std 0.01275 \
--seq-length 8192 \
--max-position-embeddings 8192 \
--attention-dropout 0.1 \
--hidden-dropout 0.1 \
--micro-batch-size 1 \
--global-batch-size 512 \
--lr 0.0003 \
--min-lr 0.00003 \
--train-iters 250000 \
--lr-decay-iters 250000 \
--lr-decay-style cosine \
--lr-warmup-iters 2000 \
--weight-decay .1 \
--adam-beta2 .95 \
--clip-grad 1.0 \
--bf16 \
--use-flash-attn \
--fim-rate 0.5 \
--log-interval 10 \
--save-interval 2500 \
--eval-interval 2500 \
--eval-iters 2 \
--use-distributed-optimizer \
--valid-num-workers 0 \
"
TENSORBOARD_ARGS="--tensorboard-dir ${CHECKPOINT_PATH}/tensorboard"
export NCCL_DEBUG=INFO
python -m torch.distributed.launch $DISTRIBUTED_ARGS \
pretrain_gpt.py \
$GPT_ARGS \
--tokenizer-type TokenizerFromFile \
--tokenizer-file $TOKENIZER_FILE \
--save $CHECKPOINT_PATH \
--load $CHECKPOINT_PATH \
#--train-weighted-split-paths-path $WEIGHTS_TRAIN \
#--valid-weighted-split-paths-path $WEIGHTS_VALID \
--structured-logs \
--structured-logs-dir $CHECKPOINT_PATH/logs \
$TENSORBOARD_ARGS \
--wandb-entity-name loubnabnl \
--wandb-project-name bigcode-pretraining \
i didn't set the datapath yet.
My current versions are
CUDA - 11.0
pytorch - 1.7.0 (i only found 1.7.1 and 1.7.0 for cuda 11.0).
apex - 1.0
gcc --version
gcc (Ubuntu 9.4.0-1ubuntu1~18.04) 9.4.0
Copyright (C) 2019 Free Software Foundation, Inc.
nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Wed_Jul_22_19:09:09_PDT_2020
Cuda compilation tools, release 11.0, V11.0.221
Build cuda_11.0_bu.TC445_37.28845127_0
2 AWS A100 GPUs.
nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02 Driver Version: 450.80.02 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 A100-SXM4-40GB On | 00000000:20:1C.0 Off | 0 |
| N/A 24C P0 53W / 400W | 3MiB / 40537MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 A100-SXM4-40GB On | 00000000:A0:1D.0 Off | 0 |
| N/A 25C P0 50W / 400W | 3MiB / 40537MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
when i run $ bash ./examples/pretrain_starcoder.sh
Wandb import failed
Wandb import failed
using world size: 1, data-parallel-size: 1, tensor-model-parallel size: 1, pipeline-model-parallel size: 1
WARNING: overriding default arguments for tokenizer_type:GPT2BPETokenizer with tokenizer_type:TokenizerFromFile
accumulate and all-reduce gradients in fp32 for bfloat16 data type.
using torch.bfloat16 for parameters ...
Persistent fused layer norm kernel is supported from pytorch v1.11 (nvidia pytorch container paired with v1.11). Defaulting to no_persist_layer_norm=True
------------------------ arguments ------------------------
accumulate_allreduce_grads_in_fp32 .............. True
adam_beta1 ...................................... 0.9
adam_beta2 ...................................... 0.95
adam_eps ........................................ 1e-08
adlr_autoresume ................................. False
adlr_autoresume_interval ........................ 1000
apply_query_key_layer_scaling ................... True
apply_residual_connection_post_layernorm ........ False
async_tensor_model_parallel_allreduce ........... True
attention_dropout ............................... 0.1
attention_head_type ............................. multiquery
attention_softmax_in_fp32 ....................... False
bert_binary_head ................................ True
bert_load ....................................... None
bf16 ............................................ True
bias_dropout_fusion ............................. True
bias_gelu_fusion ................................ True
biencoder_projection_dim ........................ 0
biencoder_shared_query_context_model ............ False
block_data_path ................................. None
classes_fraction ................................ 1.0
clip_grad ....................................... 1.0
consumed_train_samples .......................... 0
consumed_valid_samples .......................... 0
data_impl ....................................... infer
data_parallel_random_init ....................... False
data_parallel_size .............................. 1
data_path ....................................... None
data_per_class_fraction ......................... 1.0
data_sharding ................................... True
dataloader_type ................................. single
DDP_impl ........................................ local
decoder_seq_length .............................. None
dino_bottleneck_size ............................ 256
dino_freeze_last_layer .......................... 1
dino_head_hidden_size ........................... 2048
dino_local_crops_number ......................... 10
dino_local_img_size ............................. 96
dino_norm_last_layer ............................ False
dino_teacher_temp ............................... 0.07
dino_warmup_teacher_temp ........................ 0.04
dino_warmup_teacher_temp_epochs ................. 30
distribute_saved_activations .................... False
distributed_backend ............................. nccl
distributed_timeout ............................. 600
embedding_path .................................. None
empty_unused_memory_level ....................... 0
encoder_seq_length .............................. 8192
end_weight_decay ................................ 0.1
eod_mask_loss ................................... False
eval_interval ................................... 2500
eval_iters ...................................... 2
evidence_data_path .............................. None
exit_duration_in_mins ........................... None
exit_interval ................................... None
exit_signal_handler ............................. False
ffn_hidden_size ................................. 24576
fim_rate ........................................ 0.5
fim_spm_rate .................................... 0.5
finetune ........................................ False
finetune_from ................................... None
fp16 ............................................ False
fp16_lm_cross_entropy ........................... False
fp32_residual_connection ........................ False
global_batch_size ............................... 512
glu_activation .................................. None
gradient_accumulation_fusion .................... True
head_lr_mult .................................... 1.0
hidden_dropout .................................. 0.1
hidden_size ..................................... 6144
hysteresis ...................................... 2
ict_head_size ................................... None
ict_load ........................................ None
img_h ........................................... 224
img_w ........................................... 224
indexer_batch_size .............................. 128
indexer_log_interval ............................ 1000
inference_batch_times_seqlen_threshold .......... 512
init_method_std ................................. 0.01275
init_method_xavier_uniform ...................... False
initial_loss_scale .............................. 4294967296
iter_per_epoch .................................. 1250
kv_channels ..................................... 128
layernorm_epsilon ............................... 1e-05
lazy_mpu_init ................................... None
load ............................................ /home/jupyter/Satya/Megatron/Model_starcoder/
local_rank ...................................... 0
log_batch_size_to_tensorboard ................... False
log_interval .................................... 10
log_learning_rate_to_tensorboard ................ True
log_loss_scale_to_tensorboard ................... True
log_memory_to_tensorboard ....................... False
log_num_zeros_in_grad ........................... False
log_params_norm ................................. False
log_timers_to_tensorboard ....................... False
log_validation_ppl_to_tensorboard ............... False
log_world_size_to_tensorboard ................... False
loss_scale ...................................... None
loss_scale_window ............................... 1000
lr .............................................. 0.0003
lr_decay_iters .................................. 250000
lr_decay_samples ................................ None
lr_decay_style .................................. cosine
lr_warmup_fraction .............................. None
lr_warmup_iters ................................. 2000
lr_warmup_samples ............................... 0
make_vocab_size_divisible_by .................... 128
mask_factor ..................................... 1.0
mask_prob ....................................... 0.15
mask_type ....................................... random
masked_softmax_fusion ........................... True
max_position_embeddings ......................... 8192
merge_file ...................................... None
micro_batch_size ................................ 1
min_loss_scale .................................. 1.0
min_lr .......................................... 3e-05
mmap_warmup ..................................... False
no_load_optim ................................... None
no_load_rng ..................................... None
no_persist_layer_norm ........................... True
no_save_optim ................................... None
no_save_rng ..................................... None
num_attention_heads ............................. 48
num_channels .................................... 3
num_classes ..................................... 1000
num_experts ..................................... None
num_layers ...................................... 40
num_layers_per_virtual_pipeline_stage ........... None
num_workers ..................................... 2
onnx_safe ....................................... None
openai_gelu ..................................... False
optimizer ....................................... adam
override_opt_param_scheduler .................... False
params_dtype .................................... torch.bfloat16
patch_dim ....................................... 16
perform_initialization .......................... True
pipeline_model_parallel_size .................... 1
pipeline_model_parallel_split_rank .............. None
position_embedding_type ......................... PositionEmbeddingType.absolute
query_in_block_prob ............................. 0.1
rampup_batch_size ............................... None
rank ............................................ 0
recompute_granularity ........................... None
recompute_method ................................ None
recompute_num_layers ............................ 1
reset_attention_mask ............................ False
reset_position_ids .............................. False
retriever_report_topk_accuracies ................ []
retriever_score_scaling ......................... False
retriever_seq_length ............................ 256
sample_rate ..................................... 1.0
save ............................................ /home/jupyter/Satya/Megatron/Model_starcoder/
save_interval ................................... 2500
scatter_gather_tensors_in_pipeline .............. True
seed ............................................ 1234
seq_length ...................................... 8192
sequence_parallel ............................... False
sgd_momentum .................................... 0.9
short_seq_prob .................................. 0.1
split ........................................... None
standalone_embedding_stage ...................... False
start_weight_decay .............................. 0.1
structured_logs ................................. False
structured_logs_dir ............................. None
swin_backbone_type .............................. tiny
tensor_model_parallel_size ...................... 1
tensorboard_dir ................................. None
tensorboard_log_interval ........................ 1
tensorboard_queue_size .......................... 1000
test_weighted_split_paths ....................... None
test_weighted_split_paths_path .................. None
titles_data_path ................................ None
tokenizer_file .................................. /home/jupyter/Satya/Megatron/tokenizer_starcoder/tokenizer.json
tokenizer_type .................................. TokenizerFromFile
train_iters ..................................... 250000
train_samples ................................... None
train_weighted_split_paths ...................... None
train_weighted_split_paths_path ................. None
transformer_pipeline_model_parallel_size ........ 1
transformer_timers .............................. False
use_checkpoint_args ............................. False
use_checkpoint_opt_param_scheduler .............. False
use_contiguous_buffers_in_local_ddp ............. True
use_cpu_initialization .......................... None
use_distributed_optimizer ....................... True
use_flash_attn .................................. True
use_one_sent_docs ............................... False
valid_num_workers ............................... 0
valid_weighted_split_paths ...................... None
valid_weighted_split_paths_path ................. None
virtual_pipeline_model_parallel_size ............ None
vision_backbone_type ............................ vit
vision_pretraining .............................. False
vision_pretraining_type ......................... classify
vocab_extra_ids ................................. 0
vocab_file ...................................... None
wandb_entity_name ............................... None
wandb_project_name .............................. None
weight_decay .................................... 0.1
weight_decay_incr_style ......................... constant
world_size ...................................... 1
-------------------- end of arguments ---------------------
setting number of micro-batches to constant 512
> building TokenizerFromFile tokenizer ...
> padded vocab (size: 49152) with 0 dummy tokens (new size: 49152)
05:15:56.69 >>> Call to _initialize_distributed in File "/tmp/Megatron/megatron/initialize.py", line 220
05:15:56.69 220 | def _initialize_distributed():
05:15:56.69 222 | args = get_args()
05:15:56.69 .......... args = Namespace(DDP_impl='local', accumulate_allreduce...weight_decay_incr_style='constant', world_size=1)
05:15:56.69 224 | device_count = torch.cuda.device_count()
05:15:56.69 .......... device_count = 2
05:15:56.69 225 | if torch.distributed.is_initialized():
05:15:56.69 235 | if args.rank == 0:
05:15:56.69 236 | print('> initializing torch distributed ...', flush=True)
> initializing torch distributed ...
05:15:56.69 238 | if device_count > 0:
05:15:56.69 239 | device = args.rank % device_count
05:15:56.69 .................. device = 0
05:15:56.69 240 | if args.local_rank is not None:
05:15:56.69 241 | assert args.local_rank == device, \
05:15:56.69 245 | torch.cuda.set_device(device)
05:15:56.70 249 | torch.distributed.init_process_group(
05:15:56.70 250 | backend="gloo",#args.distributed_backend,
05:15:56.70 251 | world_size=args.world_size, rank=args.rank,
05:15:56.70 252 | timeout=timedelta(seconds=args.distributed_timeout))
05:15:56.70 249 | torch.distributed.init_process_group(
05:15:56.70 256 | if device_count > 0:
05:15:56.70 257 | if mpu.model_parallel_is_initialized():
05:15:56.70 260 | mpu.initialize_model_parallel(args.tensor_model_parallel_size,
05:15:56.70 261 | args.pipeline_model_parallel_size,
05:15:56.70 262 | args.virtual_pipeline_model_parallel_size,
05:15:56.70 263 | args.pipeline_model_parallel_split_rank)
05:15:56.70 260 | mpu.initialize_model_parallel(args.tensor_model_parallel_size,
> initializing tensor model parallel with size 1
> initializing pipeline model parallel with size 1
05:15:56.70 <<< Return value from _initialize_distributed: None
> setting random seeds to 1234 ...
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
05:15:56.70 >>> Call to _compile_dependencies in File "/tmp/Megatron/megatron/initialize.py", line 160
05:15:56.70 160 | def _compile_dependencies():
05:15:56.70 162 | args = get_args()
05:15:56.73 >>> Call to get_args in File "/tmp/Megatron/megatron/global_vars.py", line 38
05:15:56.73 38 | def get_args():
05:15:56.73 40 | _ensure_var_is_initialized(_GLOBAL_ARGS, 'args')
05:15:56.73 41 | return _GLOBAL_ARGS
05:15:56.73 <<< Return value from get_args: Namespace(DDP_impl='local', accumulate_allreduce...weight_decay_incr_style='constant', world_size=1)
05:15:56.73 162 | args = get_args()
05:15:56.73 .......... args = Namespace(DDP_impl='local', accumulate_allreduce...weight_decay_incr_style='constant', world_size=1)
05:15:56.73 168 | if torch.distributed.get_rank() == 0:
05:15:56.84 >>> Call to get_rank in File "/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 584
05:15:56.84 ...... group = <object object at 0x7fe25503e6c0>
05:15:56.84 584 | def get_rank(group=group.WORLD):
05:15:56.84 600 | if _rank_not_in_group(group):
05:15:56.84 603 | _check_default_pg()
05:15:56.84 604 | if group == GroupMember.WORLD:
05:15:56.84 605 | return _default_pg.rank()
05:15:56.84 <<< Return value from get_rank: 0
05:15:56.84 168 | if torch.distributed.get_rank() == 0:
05:15:56.84 169 | start_time = time.time()
05:15:56.84 .............. start_time = 1686719756.846662
05:15:56.84 170 | print('> compiling dataset index builder ...')
> compiling dataset index builder ...
05:15:56.84 171 | from megatron.data.dataset_utils import compile_helper
05:15:56.84 .............. compile_helper = <function compile_helper at 0x7fe24b749280>
05:15:56.84 172 | compile_helper()
05:15:56.92 >>> Call to compile_helper in File "/tmp/Megatron/megatron/data/dataset_utils.py", line 81
05:15:56.92 81 | def compile_helper():
05:15:56.92 84 | import os
05:15:56.92 .......... os = <module 'os' from '/opt/conda/envs/starcoder/lib/python3.8/os.py'>
05:15:56.92 85 | import subprocess
05:15:56.92 .......... subprocess = <module 'subprocess' from '/opt/conda/envs/starcoder/lib/python3.8/subprocess.py'>
05:15:56.92 86 | path = os.path.abspath(os.path.dirname(__file__))
05:15:56.92 .......... path = '/tmp/Megatron/megatron/data'
05:15:56.92 87 | ret = subprocess.run(['make', '-C', path])
make: Entering directory '/tmp/Megatron/megatron/data'
make: Nothing to be done for 'default'.
make: Leaving directory '/tmp/Megatron/megatron/data'
05:15:56.96 .......... ret = CompletedProcess(args=['make', '-C', '/tmp/Megatron/megatron/data'], returncode=0)
05:15:56.96 88 | if ret.returncode != 0:
05:15:56.96 <<< Return value from compile_helper: None
05:15:56.96 172 | compile_helper()
05:15:56.96 173 | print('>>> done with dataset index builder. Compilation time: {:.3f} '
05:15:56.96 174 | 'seconds'.format(time.time() - start_time), flush=True)
05:15:56.96 173 | print('>>> done with dataset index builder. Compilation time: {:.3f} '
05:15:56.96 174 | 'seconds'.format(time.time() - start_time), flush=True)
05:15:56.96 173 | print('>>> done with dataset index builder. Compilation time: {:.3f} '
>>> done with dataset index builder. Compilation time: 0.114 seconds
05:15:56.96 181 | seq_len = args.seq_length
05:15:56.96 .......... seq_len = 8192
05:15:56.96 182 | attn_batch_size = \
05:15:56.96 183 | (args.num_attention_heads / args.tensor_model_parallel_size) * \
05:15:56.96 184 | args.micro_batch_size
05:15:56.96 183 | (args.num_attention_heads / args.tensor_model_parallel_size) * \
05:15:56.96 182 | attn_batch_size = \
05:15:56.96 .......... attn_batch_size = 48.0
05:15:56.96 187 | custom_kernel_constraint = seq_len > 16 and seq_len <=8192 and \
05:15:56.96 188 | seq_len % 4 == 0 and attn_batch_size % 4 == 0
05:15:56.96 187 | custom_kernel_constraint = seq_len > 16 and seq_len <=8192 and \
05:15:56.96 188 | seq_len % 4 == 0 and attn_batch_size % 4 == 0
05:15:56.96 187 | custom_kernel_constraint = seq_len > 16 and seq_len <=8192 and \
05:15:56.96 .......... custom_kernel_constraint = True
05:15:56.96 190 | if not ((args.fp16 or args.bf16) and
05:15:56.96 191 | custom_kernel_constraint and
05:15:56.96 190 | if not ((args.fp16 or args.bf16) and
05:15:56.96 192 | args.masked_softmax_fusion):
05:15:56.96 190 | if not ((args.fp16 or args.bf16) and
05:15:56.96 199 | if torch.distributed.get_rank() == 0:
05:15:56.96 >>> Call to get_rank in File "/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 584
05:15:56.96 ...... group = <object object at 0x7fe25503e6c0>
05:15:56.96 584 | def get_rank(group=group.WORLD):
05:15:56.96 600 | if _rank_not_in_group(group):
05:15:56.96 603 | _check_default_pg()
05:15:56.96 604 | if group == GroupMember.WORLD:
05:15:56.96 605 | return _default_pg.rank()
05:15:56.96 <<< Return value from get_rank: 0
05:15:56.96 199 | if torch.distributed.get_rank() == 0:
05:15:56.96 200 | start_time = time.time()
05:15:56.96 .............. start_time = 1686719756.9662645
05:15:56.96 201 | print('> compiling and loading fused kernels ...', flush=True)
> compiling and loading fused kernels ...
05:15:56.96 202 | fused_kernels.load(args)
05:15:56.96 >>> Call to load in File "/tmp/Megatron/megatron/fused_kernels/__init__.py", line 4
05:15:56.96 ...... args = Namespace(DDP_impl='local', accumulate_allreduce...weight_decay_incr_style='constant', world_size=1)
05:15:56.96 4 | def load(args):
05:15:56.96 5 | if torch.version.hip is None:
05:15:56.96 6 | print("running on CUDA devices")
running on CUDA devices
05:15:56.96 7 | from megatron.fused_kernels.cuda import load as load_kernels
05:15:58.87 .............. load_kernels = <function load at 0x7fe2422201f0>
05:15:58.87 12 | load_kernels(args)
Detected CUDA files, patching ldflags
Emitting ninja build file /tmp/Megatron/megatron/fused_kernels/cuda/build/build.ninja...
Building extension module scaled_upper_triang_masked_softmax_cuda...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/2] /usr/local/cuda-11.0/bin/nvcc -DTORCH_EXTENSION_NAME=scaled_upper_triang_masked_softmax_cuda -DTORCH_API_INCLUDE_EXTENSION_H -isystem /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include -isystem /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/TH -isystem /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda-11.0/include -isystem /opt/conda/envs/starcoder/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 -std=c++17 -gencode arch=compute_70,code=sm_70 --use_fast_math -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda -gencode arch=compute_80,code=sm_80 -c /tmp/Megatron/megatron/fused_kernels/cuda/scaled_upper_triang_masked_softmax_cuda.cu -o scaled_upper_triang_masked_softmax_cuda.cuda.o
FAILED: scaled_upper_triang_masked_softmax_cuda.cuda.o
/usr/local/cuda-11.0/bin/nvcc -DTORCH_EXTENSION_NAME=scaled_upper_triang_masked_softmax_cuda -DTORCH_API_INCLUDE_EXTENSION_H -isystem /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include -isystem /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/TH -isystem /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda-11.0/include -isystem /opt/conda/envs/starcoder/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 -std=c++17 -gencode arch=compute_70,code=sm_70 --use_fast_math -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda -gencode arch=compute_80,code=sm_80 -c /tmp/Megatron/megatron/fused_kernels/cuda/scaled_upper_triang_masked_softmax_cuda.cu -o scaled_upper_triang_masked_softmax_cuda.cuda.o
/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/cast.h(2108): error: no instance of overloaded function "pybind11::detail::collect_arguments" matches the argument list
argument types are: (const char *const)
detected during:
instantiation of "pybind11::object pybind11::detail::object_api<Derived>::operator()(Args &&...) const [with Derived=pybind11::detail::accessor<pybind11::detail::accessor_policies::str_attr>, policy=pybind11::return_value_policy::automatic_reference, Args=<const char *const &>]"
/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/pytypes.h(1375): here
instantiation of "__nv_bool pybind11::detail::object_api<Derived>::contains(T &&) const [with Derived=pybind11::handle, T=const char *const &]"
/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/detail/internals.h(176): here
/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/cast.h(2108): error: no instance of overloaded function "pybind11::detail::collect_arguments" matches the argument list
detected during instantiation of "pybind11::object pybind11::detail::object_api<Derived>::operator()(Args &&...) const [with Derived=pybind11::detail::accessor<pybind11::detail::accessor_policies::str_attr>, policy=pybind11::return_value_policy::automatic_reference, Args=<>]"
/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/pybind11.h(201): here
/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/cast.h(2108): error: no instance of overloaded function "pybind11::detail::collect_arguments" matches the argument list
argument types are: (pybind11::handle, pybind11::handle)
detected during:
instantiation of "pybind11::object pybind11::detail::object_api<Derived>::operator()(Args &&...) const [with Derived=pybind11::detail::accessor<pybind11::detail::accessor_policies::str_attr>, policy=pybind11::return_value_policy::automatic_reference, Args=<pybind11::handle &, pybind11::handle &>]"
/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/pytypes.h(923): here
instantiation of "pybind11::str pybind11::str::format(Args &&...) const [with Args=<pybind11::handle &, pybind11::handle &>]"
/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/pybind11.h(755): here
/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/cast.h(2108): error: no instance of overloaded function "pybind11::detail::collect_arguments" matches the argument list
argument types are: (pybind11::handle, pybind11::handle, pybind11::none, pybind11::str)
detected during instantiation of "pybind11::object pybind11::detail::object_api<Derived>::operator()(Args &&...) const [with Derived=pybind11::handle, policy=pybind11::return_value_policy::automatic_reference, Args=<pybind11::handle, pybind11::handle, pybind11::none, pybind11::str>]"
/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/pybind11.h(971): here
/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/cast.h(2108): error: no instance of overloaded function "pybind11::detail::collect_arguments" matches the argument list
argument types are: (pybind11::object, const pybind11::handle)
detected during:
instantiation of "pybind11::object pybind11::detail::object_api<Derived>::operator()(Args &&...) const [with Derived=pybind11::detail::accessor<pybind11::detail::accessor_policies::str_attr>, policy=pybind11::return_value_policy::automatic_reference, Args=<pybind11::object &, const pybind11::handle &>]"
/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/pytypes.h(923): here
instantiation of "pybind11::str pybind11::str::format(Args &&...) const [with Args=<pybind11::object &, const pybind11::handle &>]"
/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/pybind11.h(1401): here
/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/cast.h(2108): error: no instance of overloaded function "pybind11::detail::collect_arguments" matches the argument list
argument types are: (pybind11::cpp_function)
detected during instantiation of "pybind11::object pybind11::detail::object_api<Derived>::operator()(Args &&...) const [with Derived=pybind11::handle, policy=pybind11::return_value_policy::automatic_reference, Args=<pybind11::cpp_function>]"
/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/pybind11.h(1407): here
/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/cast.h(2108): error: no instance of overloaded function "pybind11::detail::collect_arguments" matches the argument list
argument types are: (pybind11::cpp_function, pybind11::none, pybind11::none, const char [1])
detected during instantiation of "pybind11::object pybind11::detail::object_api<Derived>::operator()(Args &&...) const [with Derived=pybind11::handle, policy=pybind11::return_value_policy::automatic_reference, Args=<pybind11::cpp_function, pybind11::none, pybind11::none, const char (&)[1]>]"
/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/pybind11.h(1418): here
/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/cast.h(2108): error: no instance of overloaded function "pybind11::detail::collect_arguments" matches the argument list
argument types are: (pybind11::tuple)
detected during instantiation of "pybind11::object pybind11::detail::object_api<Derived>::operator()(Args &&...) const [with Derived=pybind11::detail::accessor<pybind11::detail::accessor_policies::str_attr>, policy=pybind11::return_value_policy::automatic_reference, Args=<pybind11::tuple &>]"
/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/pybind11.h(1812): here
/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/cast.h(2108): error: no instance of overloaded function "pybind11::detail::collect_arguments" matches the argument list
argument types are: (pybind11::object)
detected during instantiation of "pybind11::object pybind11::detail::object_api<Derived>::operator()(Args &&...) const [with Derived=pybind11::detail::accessor<pybind11::detail::accessor_policies::str_attr>, policy=pybind11::return_value_policy::automatic_reference, Args=<pybind11::object &>]"
/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/pybind11.h(1830): here
/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/cast.h(2108): error: no instance of overloaded function "pybind11::detail::collect_arguments" matches the argument list
argument types are: (pybind11::object)
detected during instantiation of "pybind11::object pybind11::detail::object_api<Derived>::operator()(Args &&...) const [with Derived=pybind11::detail::accessor<pybind11::detail::accessor_policies::str_attr>, policy=pybind11::return_value_policy::automatic_reference, Args=<pybind11::object>]"
/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/pybind11.h(1831): here
10 errors detected in the compilation of "/tmp/Megatron/megatron/fused_kernels/cuda/scaled_upper_triang_masked_softmax_cuda.cu".
ninja: build stopped: subcommand failed.
05:16:05.35 !!! RuntimeError: Error building extension 'scaled_upper_triang_masked_softmax_cuda'
05:16:05.35 !!! When calling: load_kernels(args)
05:16:05.35 !!! Call ended by exception
05:16:05.35 202 | fused_kernels.load(args)
05:16:05.39 !!! RuntimeError: Error building extension 'scaled_upper_triang_masked_softmax_cuda'
05:16:05.39 !!! When calling: fused_kernels.load(args)
05:16:05.39 !!! Call ended by exception
Traceback (most recent call last):
File "/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1516, in _run_ninja_build
subprocess.run(
File "/opt/conda/envs/starcoder/lib/python3.8/subprocess.py", line 516, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "pretrain_gpt.py", line 158, in <module>
pretrain(train_valid_test_datasets_provider, model_provider,
File "/tmp/Megatron/megatron/training.py", line 107, in pretrain
initialize_megatron(extra_args_provider=extra_args_provider,
File "/tmp/Megatron/megatron/initialize.py", line 106, in initialize_megatron
_compile_dependencies()
File "/opt/conda/envs/starcoder/lib/python3.8/site-packages/snoop/tracer.py", line 173, in simple_wrapper
return function(*args, **kwargs)
File "/tmp/Megatron/megatron/initialize.py", line 202, in _compile_dependencies
fused_kernels.load(args)
File "/tmp/Megatron/megatron/fused_kernels/__init__.py", line 12, in load
load_kernels(args)
File "/tmp/Megatron/megatron/fused_kernels/cuda/__init__.py", line 70, in load
scaled_upper_triang_masked_softmax_cuda = _cpp_extention_load_helper(
File "/tmp/Megatron/megatron/fused_kernels/cuda/__init__.py", line 42, in _cpp_extention_load_helper
return cpp_extension.load(
File "/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 969, in load
return _jit_compile(
File "/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1176, in _jit_compile
_write_ninja_file_and_build_library(
File "/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1280, in _write_ninja_file_and_build_library
_run_ninja_build(
File "/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1538, in _run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error building extension 'scaled_upper_triang_masked_softmax_cuda'
Traceback (most recent call last):
File "/opt/conda/envs/starcoder/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/envs/starcoder/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/distributed/launch.py", line 260, in <module>
main()
File "/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/distributed/launch.py", line 255, in main
raise subprocess.CalledProcessError(returncode=process.returncode,
subprocess.CalledProcessError: Command '['/opt/conda/envs/starcoder/bin/python', '-u', 'pretrain_gpt.py', '--local_rank=0', '--tensor-model-parallel-size', '1', '--pipeline-model-parallel-size', '1', '--num-layers', '40', '--hidden-size', '6144', '--num-attention-heads', '48', '--attention-head-type', 'multiquery', '--init-method-std', '0.01275', '--seq-length', '8192', '--max-position-embeddings', '8192', '--attention-dropout', '0.1', '--hidden-dropout', '0.1', '--micro-batch-size', '1', '--global-batch-size', '512', '--lr', '0.0003', '--min-lr', '0.00003', '--train-iters', '250000', '--lr-decay-iters', '250000', '--lr-decay-style', 'cosine', '--lr-warmup-iters', '2000', '--weight-decay', '.1', '--adam-beta2', '.95', '--clip-grad', '1.0', '--bf16', '--use-flash-attn', '--fim-rate', '0.5', '--log-interval', '10', '--save-interval', '2500', '--eval-interval', '2500', '--eval-iters', '2', '--use-distributed-optimizer', '--valid-num-workers', '0', '--tokenizer-type', 'TokenizerFromFile', '--tokenizer-file', '/home/jupyter/Satya/Megatron/tokenizer_starcoder/tokenizer.json', '--save', '/home/jupyter/Satya/Megatron/Model_starcoder/', '--load', '/home/jupyter/Satya/Megatron/Model_starcoder/']' returned non-zero exit status 1.
examples/pretrain_starcoder.sh: line 75: --structured-logs: command not found
in the above code i also tried using snoop trace. Below is the main error.
Detected CUDA files, patching ldflags
Emitting ninja build file /tmp/Megatron/megatron/fused_kernels/cuda/build/build.ninja...
Building extension module scaled_upper_triang_masked_softmax_cuda...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/2] /usr/local/cuda-11.0/bin/nvcc -DTORCH_EXTENSION_NAME=scaled_upper_triang_masked_softmax_cuda -DTORCH_API_INCLUDE_EXTENSION_H -isystem /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include -isystem /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/TH -isystem /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda-11.0/include -isystem /opt/conda/envs/starcoder/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 -std=c++17 -gencode arch=compute_70,code=sm_70 --use_fast_math -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda -gencode arch=compute_80,code=sm_80 -c /tmp/Megatron/megatron/fused_kernels/cuda/scaled_upper_triang_masked_softmax_cuda.cu -o scaled_upper_triang_masked_softmax_cuda.cuda.o
FAILED: scaled_upper_triang_masked_softmax_cuda.cuda.o
Metadata
Metadata
Assignees
Labels
No labels