Skip to content

Error building extension 'fused_adam' #80

@liweiyangv

Description

@liweiyangv

Thanks for releasing the code. How can I deal with the runtimeError: Error building extension 'fused_adam' with deepspeed==0.16.1. The following error is

2024-12-15 12:05:30,940] [INFO] [logging.py:128:log_dist] [Rank -1] DeepSpeed info: version=0.16.1, git-hash=unknown, git-branch=unknown
[2024-12-15 12:05:30,940] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-12-15 12:05:30,940] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2024-12-15 12:05:30,957] [INFO] [config.py:733:init] Config mesh_device None world_size = 1
[2024-12-15 12:05:37,492] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
Using /home/yangliwei/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/yangliwei/.cache/torch_extensions/py310_cu117/fused_adam/build.ninja...
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/2] /mnt/data/yangliwei/anaconda3/envs/glamm/bin/nvcc -ccbin /mnt/data/yangliwei/anaconda3/envs/glamm/bin/x86_64-conda-linux-gnu-cc -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/mnt/data/yangliwei/anaconda3/envs/glamm/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/mnt/data/yangliwei/anaconda3/envs/glamm/lib/python3.10/site-packages/deepspeed/ops/csrc/adam -isystem /mnt/data/yangliwei/anaconda3/envs/glamm/lib/python3.10/site-packages/torch/include -isystem /mnt/data/yangliwei/anaconda3/envs/glamm/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /mnt/data/yangliwei/anaconda3/envs/glamm/lib/python3.10/site-packages/torch/include/TH -isystem /mnt/data/yangliwei/anaconda3/envs/glamm/lib/python3.10/site-packages/torch/include/THC -isystem /mnt/data/yangliwei/anaconda3/envs/glamm/include -isystem /mnt/data/yangliwei/anaconda3/envs/glamm/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -DBF16_AVAILABLE -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -std=c++14 -c /mnt/data/yangliwei/anaconda3/envs/glamm/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o
FAILED: multi_tensor_adam.cuda.o
/mnt/data/yangliwei/anaconda3/envs/glamm/bin/nvcc -ccbin /mnt/data/yangliwei/anaconda3/envs/glamm/bin/x86_64-conda-linux-gnu-cc -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/mnt/data/yangliwei/anaconda3/envs/glamm/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/mnt/data/yangliwei/anaconda3/envs/glamm/lib/python3.10/site-packages/deepspeed/ops/csrc/adam -isystem /mnt/data/yangliwei/anaconda3/envs/glamm/lib/python3.10/site-packages/torch/include -isystem /mnt/data/yangliwei/anaconda3/envs/glamm/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /mnt/data/yangliwei/anaconda3/envs/glamm/lib/python3.10/site-packages/torch/include/TH -isystem /mnt/data/yangliwei/anaconda3/envs/glamm/lib/python3.10/site-packages/torch/include/THC -isystem /mnt/data/yangliwei/anaconda3/envs/glamm/include -isystem /mnt/data/yangliwei/anaconda3/envs/glamm/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -DBF16_AVAILABLE -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -std=c++14 -c /mnt/data/yangliwei/anaconda3/envs/glamm/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o
nvcc warning : incompatible redefinition for option 'compiler-bindir', the last value of this option was used
: fatal error: cuda_runtime.h: No such file or directory
compilation terminated.
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
File "/mnt/data/yangliwei/anaconda3/envs/glamm/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1900, in _run_ninja_build
subprocess.run(
File "/mnt/data/yangliwei/anaconda3/envs/glamm/lib/python3.10/subprocess.py", line 526, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/mnt/data/yangliwei/code/groundingLMM-main/train.py", line 671, in
main(args)
File "/mnt/data/yangliwei/code/groundingLMM-main/train.py", line 423, in main
model_engine, optimizer, scheduler = initialize_deepspeed(model, tokenizer, args)
File "/mnt/data/yangliwei/code/groundingLMM-main/train.py", line 395, in initialize_deepspeed
model_engine, optimizer, _, scheduler = deepspeed.initialize(
File "/mnt/data/yangliwei/anaconda3/envs/glamm/lib/python3.10/site-packages/deepspeed/init.py", line 193, in initialize
engine = DeepSpeedEngine(args=args,
File "/mnt/data/yangliwei/anaconda3/envs/glamm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 315, in init
self._configure_optimizer(optimizer, model_parameters)
File "/mnt/data/yangliwei/anaconda3/envs/glamm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1284, in _configure_optimizer
basic_optimizer = self._configure_basic_optimizer(model_parameters)
File "/mnt/data/yangliwei/anaconda3/envs/glamm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1361, in _configure_basic_optimizer
optimizer = FusedAdam(
File "/mnt/data/yangliwei/anaconda3/envs/glamm/lib/python3.10/site-packages/deepspeed/ops/adam/fused_adam.py", line 94, in init
fused_adam_cuda = FusedAdamBuilder().load()
File "/mnt/data/yangliwei/anaconda3/envs/glamm/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 531, in load
return self.jit_load(verbose)
File "/mnt/data/yangliwei/anaconda3/envs/glamm/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 578, in jit_load
op_module = load(name=self.name,
File "/mnt/data/yangliwei/anaconda3/envs/glamm/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1284, in load
return _jit_compile(
File "/mnt/data/yangliwei/anaconda3/envs/glamm/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1508, in _jit_compile
_write_ninja_file_and_build_library(
File "/mnt/data/yangliwei/anaconda3/envs/glamm/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1623, in _write_ninja_file_and_build_library
_run_ninja_build(
File "/mnt/data/yangliwei/anaconda3/envs/glamm/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1916, in _run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error building extension 'fused_adam'

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions