RuntimeError: Error building extension 'scaled_upper_triang_masked_softmax_cuda'

I am trying to run the Starcoder pretraining code(/examples/pretrain_bigcode_model.slurm).  I created a custom pretrain_starcoder.sh file 

          #!/bin/bash
          
          GPUS_PER_NODE=2
          # Change for multinode config
          MASTER_ADDR=localhost
          MASTER_PORT=6000
          NNODES=1
          NODE_RANK=0
          WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))
          
          
          # File path setup
          CHECKPOINT_PATH=/home/jupyter/Satya/Megatron/Model_starcoder/
          TOKENIZER_FILE=/home/jupyter/Satya/Megatron/tokenizer_starcoder/tokenizer.json
          #WEIGHTS_TRAIN=/fsx/loubna/code/bigcode-data-mix/data/train_data_paths.txt.tmp
          #WEIGHTS_VALID=/fsx/loubna/code/bigcode-data-mix/data/valid_data_paths.txt.tmp
          
          mkdir -p $CHECKPOINT_PATH/tensorboard
          
          DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE --nnodes $NNODES --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
          
          GPT_ARGS="\
                 --tensor-model-parallel-size 1 \
                 --pipeline-model-parallel-size 1 \
                 --sequence-parallel \
                 --num-layers 40 \
                 --hidden-size 6144 \
                 --num-attention-heads 48 \
                 --attention-head-type multiquery \
                 --init-method-std 0.01275 \
                 --seq-length 8192 \
                 --max-position-embeddings 8192 \
                 --attention-dropout 0.1 \
                 --hidden-dropout 0.1 \
                 --micro-batch-size 1 \
                 --global-batch-size 512 \
                 --lr 0.0003 \
                 --min-lr 0.00003 \
                 --train-iters 250000 \
                 --lr-decay-iters 250000 \
                 --lr-decay-style cosine \
                 --lr-warmup-iters 2000 \
                 --weight-decay .1 \
                 --adam-beta2 .95 \
                 --clip-grad 1.0 \
                 --bf16 \
                 --use-flash-attn \
                 --fim-rate 0.5 \
                 --log-interval 10 \
                 --save-interval 2500 \
                 --eval-interval 2500 \
                 --eval-iters 2 \
                 --use-distributed-optimizer \
                 --valid-num-workers 0 \
          "
          
          TENSORBOARD_ARGS="--tensorboard-dir ${CHECKPOINT_PATH}/tensorboard"
          
          export NCCL_DEBUG=INFO
          python -m torch.distributed.launch $DISTRIBUTED_ARGS \
                  pretrain_gpt.py \
                  $GPT_ARGS \
              --tokenizer-type TokenizerFromFile \
              --tokenizer-file $TOKENIZER_FILE \
              --save $CHECKPOINT_PATH \
              --load $CHECKPOINT_PATH \
              #--train-weighted-split-paths-path $WEIGHTS_TRAIN \
              #--valid-weighted-split-paths-path $WEIGHTS_VALID \
              --structured-logs \
              --structured-logs-dir $CHECKPOINT_PATH/logs \
              $TENSORBOARD_ARGS \
              --wandb-entity-name loubnabnl \
              --wandb-project-name bigcode-pretraining \
              

i didn't set the datapath yet.

My current versions are
    
     CUDA - 11.0
     pytorch - 1.7.0 (i only found 1.7.1 and 1.7.0 for cuda 11.0).
     apex - 1.0
      gcc --version
          gcc (Ubuntu 9.4.0-1ubuntu1~18.04) 9.4.0
          Copyright (C) 2019 Free Software Foundation, Inc.
     nvcc --version
          nvcc: NVIDIA (R) Cuda compiler driver
          Copyright (c) 2005-2020 NVIDIA Corporation
          Built on Wed_Jul_22_19:09:09_PDT_2020
          Cuda compilation tools, release 11.0, V11.0.221
          Build cuda_11.0_bu.TC445_37.28845127_0
      2 AWS A100 GPUs.
     nvidia-smi
          +-----------------------------------------------------------------------------+
          | NVIDIA-SMI 450.80.02    Driver Version: 450.80.02    CUDA Version: 11.0     |
          |-------------------------------+----------------------+----------------------+
          | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
          | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
          |                               |                      |               MIG M. |
          |===============================+======================+======================|
          |   0  A100-SXM4-40GB      On   | 00000000:20:1C.0 Off |                    0 |
          | N/A   24C    P0    53W / 400W |      3MiB / 40537MiB |      0%      Default |
          |                               |                      |             Disabled |
          +-------------------------------+----------------------+----------------------+
          |   1  A100-SXM4-40GB      On   | 00000000:A0:1D.0 Off |                    0 |
          | N/A   25C    P0    50W / 400W |      3MiB / 40537MiB |      0%      Default |
          |                               |                      |             Disabled |
          +-------------------------------+----------------------+----------------------+
          
                    
when i run $ bash ./examples/pretrain_starcoder.sh
 
              Wandb import failed
              Wandb import failed
              using world size: 1, data-parallel-size: 1, tensor-model-parallel size: 1, pipeline-model-parallel size: 1 
              WARNING: overriding default arguments for tokenizer_type:GPT2BPETokenizer                        with tokenizer_type:TokenizerFromFile
              accumulate and all-reduce gradients in fp32 for bfloat16 data type.
              using torch.bfloat16 for parameters ...
              Persistent fused layer norm kernel is supported from pytorch v1.11 (nvidia pytorch container paired with v1.11). Defaulting to no_persist_layer_norm=True
              ------------------------ arguments ------------------------
                accumulate_allreduce_grads_in_fp32 .............. True
                adam_beta1 ...................................... 0.9
                adam_beta2 ...................................... 0.95
                adam_eps ........................................ 1e-08
                adlr_autoresume ................................. False
                adlr_autoresume_interval ........................ 1000
                apply_query_key_layer_scaling ................... True
                apply_residual_connection_post_layernorm ........ False
                async_tensor_model_parallel_allreduce ........... True
                attention_dropout ............................... 0.1
                attention_head_type ............................. multiquery
                attention_softmax_in_fp32 ....................... False
                bert_binary_head ................................ True
                bert_load ....................................... None
                bf16 ............................................ True
                bias_dropout_fusion ............................. True
                bias_gelu_fusion ................................ True
                biencoder_projection_dim ........................ 0
                biencoder_shared_query_context_model ............ False
                block_data_path ................................. None
                classes_fraction ................................ 1.0
                clip_grad ....................................... 1.0
                consumed_train_samples .......................... 0
                consumed_valid_samples .......................... 0
                data_impl ....................................... infer
                data_parallel_random_init ....................... False
                data_parallel_size .............................. 1
                data_path ....................................... None
                data_per_class_fraction ......................... 1.0
                data_sharding ................................... True
                dataloader_type ................................. single
                DDP_impl ........................................ local
                decoder_seq_length .............................. None
                dino_bottleneck_size ............................ 256
                dino_freeze_last_layer .......................... 1
                dino_head_hidden_size ........................... 2048
                dino_local_crops_number ......................... 10
                dino_local_img_size ............................. 96
                dino_norm_last_layer ............................ False
                dino_teacher_temp ............................... 0.07
                dino_warmup_teacher_temp ........................ 0.04
                dino_warmup_teacher_temp_epochs ................. 30
                distribute_saved_activations .................... False
                distributed_backend ............................. nccl
                distributed_timeout ............................. 600
                embedding_path .................................. None
                empty_unused_memory_level ....................... 0
                encoder_seq_length .............................. 8192
                end_weight_decay ................................ 0.1
                eod_mask_loss ................................... False
                eval_interval ................................... 2500
                eval_iters ...................................... 2
                evidence_data_path .............................. None
                exit_duration_in_mins ........................... None
                exit_interval ................................... None
                exit_signal_handler ............................. False
                ffn_hidden_size ................................. 24576
                fim_rate ........................................ 0.5
                fim_spm_rate .................................... 0.5
                finetune ........................................ False
                finetune_from ................................... None
                fp16 ............................................ False
                fp16_lm_cross_entropy ........................... False
                fp32_residual_connection ........................ False
                global_batch_size ............................... 512
                glu_activation .................................. None
                gradient_accumulation_fusion .................... True
                head_lr_mult .................................... 1.0
                hidden_dropout .................................. 0.1
                hidden_size ..................................... 6144
                hysteresis ...................................... 2
                ict_head_size ................................... None
                ict_load ........................................ None
                img_h ........................................... 224
                img_w ........................................... 224
                indexer_batch_size .............................. 128
                indexer_log_interval ............................ 1000
                inference_batch_times_seqlen_threshold .......... 512
                init_method_std ................................. 0.01275
                init_method_xavier_uniform ...................... False
                initial_loss_scale .............................. 4294967296
                iter_per_epoch .................................. 1250
                kv_channels ..................................... 128
                layernorm_epsilon ............................... 1e-05
                lazy_mpu_init ................................... None
                load ............................................ /home/jupyter/Satya/Megatron/Model_starcoder/
                local_rank ...................................... 0
                log_batch_size_to_tensorboard ................... False
                log_interval .................................... 10
                log_learning_rate_to_tensorboard ................ True
                log_loss_scale_to_tensorboard ................... True
                log_memory_to_tensorboard ....................... False
                log_num_zeros_in_grad ........................... False
                log_params_norm ................................. False
                log_timers_to_tensorboard ....................... False
                log_validation_ppl_to_tensorboard ............... False
                log_world_size_to_tensorboard ................... False
                loss_scale ...................................... None
                loss_scale_window ............................... 1000
                lr .............................................. 0.0003
                lr_decay_iters .................................. 250000
                lr_decay_samples ................................ None
                lr_decay_style .................................. cosine
                lr_warmup_fraction .............................. None
                lr_warmup_iters ................................. 2000
                lr_warmup_samples ............................... 0
                make_vocab_size_divisible_by .................... 128
                mask_factor ..................................... 1.0
                mask_prob ....................................... 0.15
                mask_type ....................................... random
                masked_softmax_fusion ........................... True
                max_position_embeddings ......................... 8192
                merge_file ...................................... None
                micro_batch_size ................................ 1
                min_loss_scale .................................. 1.0
                min_lr .......................................... 3e-05
                mmap_warmup ..................................... False
                no_load_optim ................................... None
                no_load_rng ..................................... None
                no_persist_layer_norm ........................... True
                no_save_optim ................................... None
                no_save_rng ..................................... None
                num_attention_heads ............................. 48
                num_channels .................................... 3
                num_classes ..................................... 1000
                num_experts ..................................... None
                num_layers ...................................... 40
                num_layers_per_virtual_pipeline_stage ........... None
                num_workers ..................................... 2
                onnx_safe ....................................... None
                openai_gelu ..................................... False
                optimizer ....................................... adam
                override_opt_param_scheduler .................... False
                params_dtype .................................... torch.bfloat16
                patch_dim ....................................... 16
                perform_initialization .......................... True
                pipeline_model_parallel_size .................... 1
                pipeline_model_parallel_split_rank .............. None
                position_embedding_type ......................... PositionEmbeddingType.absolute
                query_in_block_prob ............................. 0.1
                rampup_batch_size ............................... None
                rank ............................................ 0
                recompute_granularity ........................... None
                recompute_method ................................ None
                recompute_num_layers ............................ 1
                reset_attention_mask ............................ False
                reset_position_ids .............................. False
                retriever_report_topk_accuracies ................ []
                retriever_score_scaling ......................... False
                retriever_seq_length ............................ 256
                sample_rate ..................................... 1.0
                save ............................................ /home/jupyter/Satya/Megatron/Model_starcoder/
                save_interval ................................... 2500
                scatter_gather_tensors_in_pipeline .............. True
                seed ............................................ 1234
                seq_length ...................................... 8192
                sequence_parallel ............................... False
                sgd_momentum .................................... 0.9
                short_seq_prob .................................. 0.1
                split ........................................... None
                standalone_embedding_stage ...................... False
                start_weight_decay .............................. 0.1
                structured_logs ................................. False
                structured_logs_dir ............................. None
                swin_backbone_type .............................. tiny
                tensor_model_parallel_size ...................... 1
                tensorboard_dir ................................. None
                tensorboard_log_interval ........................ 1
                tensorboard_queue_size .......................... 1000
                test_weighted_split_paths ....................... None
                test_weighted_split_paths_path .................. None
                titles_data_path ................................ None
                tokenizer_file .................................. /home/jupyter/Satya/Megatron/tokenizer_starcoder/tokenizer.json
                tokenizer_type .................................. TokenizerFromFile
                train_iters ..................................... 250000
                train_samples ................................... None
                train_weighted_split_paths ...................... None
                train_weighted_split_paths_path ................. None
                transformer_pipeline_model_parallel_size ........ 1
                transformer_timers .............................. False
                use_checkpoint_args ............................. False
                use_checkpoint_opt_param_scheduler .............. False
                use_contiguous_buffers_in_local_ddp ............. True
                use_cpu_initialization .......................... None
                use_distributed_optimizer ....................... True
                use_flash_attn .................................. True
                use_one_sent_docs ............................... False
                valid_num_workers ............................... 0
                valid_weighted_split_paths ...................... None
                valid_weighted_split_paths_path ................. None
                virtual_pipeline_model_parallel_size ............ None
                vision_backbone_type ............................ vit
                vision_pretraining .............................. False
                vision_pretraining_type ......................... classify
                vocab_extra_ids ................................. 0
                vocab_file ...................................... None
                wandb_entity_name ............................... None
                wandb_project_name .............................. None
                weight_decay .................................... 0.1
                weight_decay_incr_style ......................... constant
                world_size ...................................... 1
              -------------------- end of arguments ---------------------
              setting number of micro-batches to constant 512
              > building TokenizerFromFile tokenizer ...
               > padded vocab (size: 49152) with 0 dummy tokens (new size: 49152)
              05:15:56.69 >>> Call to _initialize_distributed in File "/tmp/Megatron/megatron/initialize.py", line 220
              05:15:56.69  220 | def _initialize_distributed():
              05:15:56.69  222 |     args = get_args()
              05:15:56.69 .......... args = Namespace(DDP_impl='local', accumulate_allreduce...weight_decay_incr_style='constant', world_size=1)
              05:15:56.69  224 |     device_count = torch.cuda.device_count()
              05:15:56.69 .......... device_count = 2
              05:15:56.69  225 |     if torch.distributed.is_initialized():
              05:15:56.69  235 |         if args.rank == 0:
              05:15:56.69  236 |             print('> initializing torch distributed ...', flush=True)
              > initializing torch distributed ...
              05:15:56.69  238 |         if device_count > 0:
              05:15:56.69  239 |             device = args.rank % device_count
              05:15:56.69 .................. device = 0
              05:15:56.69  240 |             if args.local_rank is not None:
              05:15:56.69  241 |                 assert args.local_rank == device, \
              05:15:56.69  245 |             torch.cuda.set_device(device)
              05:15:56.70  249 |         torch.distributed.init_process_group(
              05:15:56.70  250 |             backend="gloo",#args.distributed_backend,
              05:15:56.70  251 |             world_size=args.world_size, rank=args.rank,
              05:15:56.70  252 |             timeout=timedelta(seconds=args.distributed_timeout))
              05:15:56.70  249 |         torch.distributed.init_process_group(
              05:15:56.70  256 |     if device_count > 0:
              05:15:56.70  257 |         if mpu.model_parallel_is_initialized():
              05:15:56.70  260 |             mpu.initialize_model_parallel(args.tensor_model_parallel_size,
              05:15:56.70  261 |                                           args.pipeline_model_parallel_size,
              05:15:56.70  262 |                                           args.virtual_pipeline_model_parallel_size,
              05:15:56.70  263 |                                           args.pipeline_model_parallel_split_rank)
              05:15:56.70  260 |             mpu.initialize_model_parallel(args.tensor_model_parallel_size,
              > initializing tensor model parallel with size 1
              > initializing pipeline model parallel with size 1
              05:15:56.70 <<< Return value from _initialize_distributed: None
              > setting random seeds to 1234 ...
              > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
              05:15:56.70 >>> Call to _compile_dependencies in File "/tmp/Megatron/megatron/initialize.py", line 160
              05:15:56.70  160 | def _compile_dependencies():
              05:15:56.70  162 |     args = get_args()
                  05:15:56.73 >>> Call to get_args in File "/tmp/Megatron/megatron/global_vars.py", line 38
                  05:15:56.73   38 | def get_args():
                  05:15:56.73   40 |     _ensure_var_is_initialized(_GLOBAL_ARGS, 'args')
                  05:15:56.73   41 |     return _GLOBAL_ARGS
                  05:15:56.73 <<< Return value from get_args: Namespace(DDP_impl='local', accumulate_allreduce...weight_decay_incr_style='constant', world_size=1)
              05:15:56.73  162 |     args = get_args()
              05:15:56.73 .......... args = Namespace(DDP_impl='local', accumulate_allreduce...weight_decay_incr_style='constant', world_size=1)
              05:15:56.73  168 |     if torch.distributed.get_rank() == 0:
                  05:15:56.84 >>> Call to get_rank in File "/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 584
                  05:15:56.84 ...... group = <object object at 0x7fe25503e6c0>
                  05:15:56.84  584 | def get_rank(group=group.WORLD):
                  05:15:56.84  600 |     if _rank_not_in_group(group):
                  05:15:56.84  603 |     _check_default_pg()
                  05:15:56.84  604 |     if group == GroupMember.WORLD:
                  05:15:56.84  605 |         return _default_pg.rank()
                  05:15:56.84 <<< Return value from get_rank: 0
              05:15:56.84  168 |     if torch.distributed.get_rank() == 0:
              05:15:56.84  169 |         start_time = time.time()
              05:15:56.84 .............. start_time = 1686719756.846662
              05:15:56.84  170 |         print('> compiling dataset index builder ...')
              > compiling dataset index builder ...
              05:15:56.84  171 |         from megatron.data.dataset_utils import compile_helper
              05:15:56.84 .............. compile_helper = <function compile_helper at 0x7fe24b749280>
              05:15:56.84  172 |         compile_helper()
                  05:15:56.92 >>> Call to compile_helper in File "/tmp/Megatron/megatron/data/dataset_utils.py", line 81
                  05:15:56.92   81 | def compile_helper():
                  05:15:56.92   84 |     import os
                  05:15:56.92 .......... os = <module 'os' from '/opt/conda/envs/starcoder/lib/python3.8/os.py'>
                  05:15:56.92   85 |     import subprocess
                  05:15:56.92 .......... subprocess = <module 'subprocess' from '/opt/conda/envs/starcoder/lib/python3.8/subprocess.py'>
                  05:15:56.92   86 |     path = os.path.abspath(os.path.dirname(__file__))
                  05:15:56.92 .......... path = '/tmp/Megatron/megatron/data'
                  05:15:56.92   87 |     ret = subprocess.run(['make', '-C', path])
              make: Entering directory '/tmp/Megatron/megatron/data'
              make: Nothing to be done for 'default'.
              make: Leaving directory '/tmp/Megatron/megatron/data'
                  05:15:56.96 .......... ret = CompletedProcess(args=['make', '-C', '/tmp/Megatron/megatron/data'], returncode=0)
                  05:15:56.96   88 |     if ret.returncode != 0:
                  05:15:56.96 <<< Return value from compile_helper: None
              05:15:56.96  172 |         compile_helper()
              05:15:56.96  173 |         print('>>> done with dataset index builder. Compilation time: {:.3f} '
              05:15:56.96  174 |               'seconds'.format(time.time() - start_time), flush=True)
              05:15:56.96  173 |         print('>>> done with dataset index builder. Compilation time: {:.3f} '
              05:15:56.96  174 |               'seconds'.format(time.time() - start_time), flush=True)
              05:15:56.96  173 |         print('>>> done with dataset index builder. Compilation time: {:.3f} '
              >>> done with dataset index builder. Compilation time: 0.114 seconds
              05:15:56.96  181 |     seq_len = args.seq_length
              05:15:56.96 .......... seq_len = 8192
              05:15:56.96  182 |     attn_batch_size = \
              05:15:56.96  183 |         (args.num_attention_heads / args.tensor_model_parallel_size) * \
              05:15:56.96  184 |         args.micro_batch_size
              05:15:56.96  183 |         (args.num_attention_heads / args.tensor_model_parallel_size) * \
              05:15:56.96  182 |     attn_batch_size = \
              05:15:56.96 .......... attn_batch_size = 48.0
              05:15:56.96  187 |     custom_kernel_constraint = seq_len > 16 and seq_len <=8192 and \
              05:15:56.96  188 |         seq_len % 4 == 0 and attn_batch_size % 4 == 0
              05:15:56.96  187 |     custom_kernel_constraint = seq_len > 16 and seq_len <=8192 and \
              05:15:56.96  188 |         seq_len % 4 == 0 and attn_batch_size % 4 == 0
              05:15:56.96  187 |     custom_kernel_constraint = seq_len > 16 and seq_len <=8192 and \
              05:15:56.96 .......... custom_kernel_constraint = True
              05:15:56.96  190 |     if not ((args.fp16 or args.bf16) and
              05:15:56.96  191 |             custom_kernel_constraint and
              05:15:56.96  190 |     if not ((args.fp16 or args.bf16) and
              05:15:56.96  192 |             args.masked_softmax_fusion):
              05:15:56.96  190 |     if not ((args.fp16 or args.bf16) and
              05:15:56.96  199 |     if torch.distributed.get_rank() == 0:
                  05:15:56.96 >>> Call to get_rank in File "/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 584
                  05:15:56.96 ...... group = <object object at 0x7fe25503e6c0>
                  05:15:56.96  584 | def get_rank(group=group.WORLD):
                  05:15:56.96  600 |     if _rank_not_in_group(group):
                  05:15:56.96  603 |     _check_default_pg()
                  05:15:56.96  604 |     if group == GroupMember.WORLD:
                  05:15:56.96  605 |         return _default_pg.rank()
                  05:15:56.96 <<< Return value from get_rank: 0
              05:15:56.96  199 |     if torch.distributed.get_rank() == 0:
              05:15:56.96  200 |         start_time = time.time()
              05:15:56.96 .............. start_time = 1686719756.9662645
              05:15:56.96  201 |         print('> compiling and loading fused kernels ...', flush=True)
              > compiling and loading fused kernels ...
              05:15:56.96  202 |         fused_kernels.load(args)
                  05:15:56.96 >>> Call to load in File "/tmp/Megatron/megatron/fused_kernels/__init__.py", line 4
                  05:15:56.96 ...... args = Namespace(DDP_impl='local', accumulate_allreduce...weight_decay_incr_style='constant', world_size=1)
                  05:15:56.96    4 | def load(args):
                  05:15:56.96    5 |     if torch.version.hip is None:
                  05:15:56.96    6 |         print("running on CUDA devices")
              running on CUDA devices
                  05:15:56.96    7 |         from megatron.fused_kernels.cuda import load as load_kernels
                  05:15:58.87 .............. load_kernels = <function load at 0x7fe2422201f0>
                  05:15:58.87   12 |     load_kernels(args)
              Detected CUDA files, patching ldflags
              Emitting ninja build file /tmp/Megatron/megatron/fused_kernels/cuda/build/build.ninja...
              Building extension module scaled_upper_triang_masked_softmax_cuda...
              Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
              [1/2] /usr/local/cuda-11.0/bin/nvcc -DTORCH_EXTENSION_NAME=scaled_upper_triang_masked_softmax_cuda -DTORCH_API_INCLUDE_EXTENSION_H -isystem /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include -isystem /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/TH -isystem /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda-11.0/include -isystem /opt/conda/envs/starcoder/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 -std=c++17 -gencode arch=compute_70,code=sm_70 --use_fast_math -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda -gencode arch=compute_80,code=sm_80 -c /tmp/Megatron/megatron/fused_kernels/cuda/scaled_upper_triang_masked_softmax_cuda.cu -o scaled_upper_triang_masked_softmax_cuda.cuda.o 
              FAILED: scaled_upper_triang_masked_softmax_cuda.cuda.o 
              /usr/local/cuda-11.0/bin/nvcc -DTORCH_EXTENSION_NAME=scaled_upper_triang_masked_softmax_cuda -DTORCH_API_INCLUDE_EXTENSION_H -isystem /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include -isystem /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/TH -isystem /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda-11.0/include -isystem /opt/conda/envs/starcoder/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 -std=c++17 -gencode arch=compute_70,code=sm_70 --use_fast_math -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda -gencode arch=compute_80,code=sm_80 -c /tmp/Megatron/megatron/fused_kernels/cuda/scaled_upper_triang_masked_softmax_cuda.cu -o scaled_upper_triang_masked_softmax_cuda.cuda.o 
              /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/cast.h(2108): error: no instance of overloaded function "pybind11::detail::collect_arguments" matches the argument list
                          argument types are: (const char *const)
                        detected during:
                          instantiation of "pybind11::object pybind11::detail::object_api<Derived>::operator()(Args &&...) const [with Derived=pybind11::detail::accessor<pybind11::detail::accessor_policies::str_attr>, policy=pybind11::return_value_policy::automatic_reference, Args=<const char *const &>]" 
              /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/pytypes.h(1375): here
                          instantiation of "__nv_bool pybind11::detail::object_api<Derived>::contains(T &&) const [with Derived=pybind11::handle, T=const char *const &]" 
              /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/detail/internals.h(176): here
              
              /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/cast.h(2108): error: no instance of overloaded function "pybind11::detail::collect_arguments" matches the argument list
                        detected during instantiation of "pybind11::object pybind11::detail::object_api<Derived>::operator()(Args &&...) const [with Derived=pybind11::detail::accessor<pybind11::detail::accessor_policies::str_attr>, policy=pybind11::return_value_policy::automatic_reference, Args=<>]" 
              /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/pybind11.h(201): here
              
              /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/cast.h(2108): error: no instance of overloaded function "pybind11::detail::collect_arguments" matches the argument list
                          argument types are: (pybind11::handle, pybind11::handle)
                        detected during:
                          instantiation of "pybind11::object pybind11::detail::object_api<Derived>::operator()(Args &&...) const [with Derived=pybind11::detail::accessor<pybind11::detail::accessor_policies::str_attr>, policy=pybind11::return_value_policy::automatic_reference, Args=<pybind11::handle &, pybind11::handle &>]" 
              /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/pytypes.h(923): here
                          instantiation of "pybind11::str pybind11::str::format(Args &&...) const [with Args=<pybind11::handle &, pybind11::handle &>]" 
              /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/pybind11.h(755): here
              
              /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/cast.h(2108): error: no instance of overloaded function "pybind11::detail::collect_arguments" matches the argument list
                          argument types are: (pybind11::handle, pybind11::handle, pybind11::none, pybind11::str)
                        detected during instantiation of "pybind11::object pybind11::detail::object_api<Derived>::operator()(Args &&...) const [with Derived=pybind11::handle, policy=pybind11::return_value_policy::automatic_reference, Args=<pybind11::handle, pybind11::handle, pybind11::none, pybind11::str>]" 
              /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/pybind11.h(971): here
              
              /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/cast.h(2108): error: no instance of overloaded function "pybind11::detail::collect_arguments" matches the argument list
                          argument types are: (pybind11::object, const pybind11::handle)
                        detected during:
                          instantiation of "pybind11::object pybind11::detail::object_api<Derived>::operator()(Args &&...) const [with Derived=pybind11::detail::accessor<pybind11::detail::accessor_policies::str_attr>, policy=pybind11::return_value_policy::automatic_reference, Args=<pybind11::object &, const pybind11::handle &>]" 
              /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/pytypes.h(923): here
                          instantiation of "pybind11::str pybind11::str::format(Args &&...) const [with Args=<pybind11::object &, const pybind11::handle &>]" 
              /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/pybind11.h(1401): here
              
              /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/cast.h(2108): error: no instance of overloaded function "pybind11::detail::collect_arguments" matches the argument list
                          argument types are: (pybind11::cpp_function)
                        detected during instantiation of "pybind11::object pybind11::detail::object_api<Derived>::operator()(Args &&...) const [with Derived=pybind11::handle, policy=pybind11::return_value_policy::automatic_reference, Args=<pybind11::cpp_function>]" 
              /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/pybind11.h(1407): here
              
              /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/cast.h(2108): error: no instance of overloaded function "pybind11::detail::collect_arguments" matches the argument list
                          argument types are: (pybind11::cpp_function, pybind11::none, pybind11::none, const char [1])
                        detected during instantiation of "pybind11::object pybind11::detail::object_api<Derived>::operator()(Args &&...) const [with Derived=pybind11::handle, policy=pybind11::return_value_policy::automatic_reference, Args=<pybind11::cpp_function, pybind11::none, pybind11::none, const char (&)[1]>]" 
              /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/pybind11.h(1418): here
              
              /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/cast.h(2108): error: no instance of overloaded function "pybind11::detail::collect_arguments" matches the argument list
                          argument types are: (pybind11::tuple)
                        detected during instantiation of "pybind11::object pybind11::detail::object_api<Derived>::operator()(Args &&...) const [with Derived=pybind11::detail::accessor<pybind11::detail::accessor_policies::str_attr>, policy=pybind11::return_value_policy::automatic_reference, Args=<pybind11::tuple &>]" 
              /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/pybind11.h(1812): here
              
              /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/cast.h(2108): error: no instance of overloaded function "pybind11::detail::collect_arguments" matches the argument list
                          argument types are: (pybind11::object)
                        detected during instantiation of "pybind11::object pybind11::detail::object_api<Derived>::operator()(Args &&...) const [with Derived=pybind11::detail::accessor<pybind11::detail::accessor_policies::str_attr>, policy=pybind11::return_value_policy::automatic_reference, Args=<pybind11::object &>]" 
              /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/pybind11.h(1830): here
              
              /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/cast.h(2108): error: no instance of overloaded function "pybind11::detail::collect_arguments" matches the argument list
                          argument types are: (pybind11::object)
                        detected during instantiation of "pybind11::object pybind11::detail::object_api<Derived>::operator()(Args &&...) const [with Derived=pybind11::detail::accessor<pybind11::detail::accessor_policies::str_attr>, policy=pybind11::return_value_policy::automatic_reference, Args=<pybind11::object>]" 
              /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/pybind11.h(1831): here
              
              10 errors detected in the compilation of "/tmp/Megatron/megatron/fused_kernels/cuda/scaled_upper_triang_masked_softmax_cuda.cu".
              ninja: build stopped: subcommand failed.
                  05:16:05.35 !!! RuntimeError: Error building extension 'scaled_upper_triang_masked_softmax_cuda'
                  05:16:05.35 !!! When calling: load_kernels(args)
                  05:16:05.35 !!! Call ended by exception
              05:16:05.35  202 |         fused_kernels.load(args)
              05:16:05.39 !!! RuntimeError: Error building extension 'scaled_upper_triang_masked_softmax_cuda'
              05:16:05.39 !!! When calling: fused_kernels.load(args)
              05:16:05.39 !!! Call ended by exception
              Traceback (most recent call last):
                File "/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1516, in _run_ninja_build
                  subprocess.run(
                File "/opt/conda/envs/starcoder/lib/python3.8/subprocess.py", line 516, in run
                  raise CalledProcessError(retcode, process.args,
              subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
              
              The above exception was the direct cause of the following exception:
              
              Traceback (most recent call last):
                File "pretrain_gpt.py", line 158, in <module>
                  pretrain(train_valid_test_datasets_provider, model_provider,
                File "/tmp/Megatron/megatron/training.py", line 107, in pretrain
                  initialize_megatron(extra_args_provider=extra_args_provider,
                File "/tmp/Megatron/megatron/initialize.py", line 106, in initialize_megatron
                  _compile_dependencies()
                File "/opt/conda/envs/starcoder/lib/python3.8/site-packages/snoop/tracer.py", line 173, in simple_wrapper
                  return function(*args, **kwargs)
                File "/tmp/Megatron/megatron/initialize.py", line 202, in _compile_dependencies
                  fused_kernels.load(args)
                File "/tmp/Megatron/megatron/fused_kernels/__init__.py", line 12, in load
                  load_kernels(args)
                File "/tmp/Megatron/megatron/fused_kernels/cuda/__init__.py", line 70, in load
                  scaled_upper_triang_masked_softmax_cuda = _cpp_extention_load_helper(
                File "/tmp/Megatron/megatron/fused_kernels/cuda/__init__.py", line 42, in _cpp_extention_load_helper
                  return cpp_extension.load(
                File "/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 969, in load
                  return _jit_compile(
                File "/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1176, in _jit_compile
                  _write_ninja_file_and_build_library(
                File "/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1280, in _write_ninja_file_and_build_library
                  _run_ninja_build(
                File "/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1538, in _run_ninja_build
                  raise RuntimeError(message) from e
              RuntimeError: Error building extension 'scaled_upper_triang_masked_softmax_cuda'
              Traceback (most recent call last):
                File "/opt/conda/envs/starcoder/lib/python3.8/runpy.py", line 194, in _run_module_as_main
                  return _run_code(code, main_globals, None,
                File "/opt/conda/envs/starcoder/lib/python3.8/runpy.py", line 87, in _run_code
                  exec(code, run_globals)
                File "/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/distributed/launch.py", line 260, in <module>
                  main()
                File "/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/distributed/launch.py", line 255, in main
                  raise subprocess.CalledProcessError(returncode=process.returncode,
              subprocess.CalledProcessError: Command '['/opt/conda/envs/starcoder/bin/python', '-u', 'pretrain_gpt.py', '--local_rank=0', '--tensor-model-parallel-size', '1', '--pipeline-model-parallel-size', '1', '--num-layers', '40', '--hidden-size', '6144', '--num-attention-heads', '48', '--attention-head-type', 'multiquery', '--init-method-std', '0.01275', '--seq-length', '8192', '--max-position-embeddings', '8192', '--attention-dropout', '0.1', '--hidden-dropout', '0.1', '--micro-batch-size', '1', '--global-batch-size', '512', '--lr', '0.0003', '--min-lr', '0.00003', '--train-iters', '250000', '--lr-decay-iters', '250000', '--lr-decay-style', 'cosine', '--lr-warmup-iters', '2000', '--weight-decay', '.1', '--adam-beta2', '.95', '--clip-grad', '1.0', '--bf16', '--use-flash-attn', '--fim-rate', '0.5', '--log-interval', '10', '--save-interval', '2500', '--eval-interval', '2500', '--eval-iters', '2', '--use-distributed-optimizer', '--valid-num-workers', '0', '--tokenizer-type', 'TokenizerFromFile', '--tokenizer-file', '/home/jupyter/Satya/Megatron/tokenizer_starcoder/tokenizer.json', '--save', '/home/jupyter/Satya/Megatron/Model_starcoder/', '--load', '/home/jupyter/Satya/Megatron/Model_starcoder/']' returned non-zero exit status 1.
              examples/pretrain_starcoder.sh: line 75: --structured-logs: command not found


in the above code i also tried using snoop trace. Below is the main error. 

          Detected CUDA files, patching ldflags
          Emitting ninja build file /tmp/Megatron/megatron/fused_kernels/cuda/build/build.ninja...
          Building extension module scaled_upper_triang_masked_softmax_cuda...
          Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
          [1/2] /usr/local/cuda-11.0/bin/nvcc -DTORCH_EXTENSION_NAME=scaled_upper_triang_masked_softmax_cuda -DTORCH_API_INCLUDE_EXTENSION_H -isystem /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include -isystem /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/TH -isystem /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda-11.0/include -isystem /opt/conda/envs/starcoder/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 -std=c++17 -gencode arch=compute_70,code=sm_70 --use_fast_math -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda -gencode arch=compute_80,code=sm_80 -c /tmp/Megatron/megatron/fused_kernels/cuda/scaled_upper_triang_masked_softmax_cuda.cu -o scaled_upper_triang_masked_softmax_cuda.cuda.o 
             FAILED: scaled_upper_triang_masked_softmax_cuda.cuda.o 



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RuntimeError: Error building extension 'scaled_upper_triang_masked_softmax_cuda' #61

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

RuntimeError: Error building extension 'scaled_upper_triang_masked_softmax_cuda' #61

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions