-
Notifications
You must be signed in to change notification settings - Fork 53
Description
Thanks for releasing the code. How can I deal with the runtimeError: Error building extension 'fused_adam' with deepspeed==0.16.1. The following error is
2024-12-15 12:05:30,940] [INFO] [logging.py:128:log_dist] [Rank -1] DeepSpeed info: version=0.16.1, git-hash=unknown, git-branch=unknown
[2024-12-15 12:05:30,940] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-12-15 12:05:30,940] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2024-12-15 12:05:30,957] [INFO] [config.py:733:init] Config mesh_device None world_size = 1
[2024-12-15 12:05:37,492] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
Using /home/yangliwei/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/yangliwei/.cache/torch_extensions/py310_cu117/fused_adam/build.ninja...
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/2] /mnt/data/yangliwei/anaconda3/envs/glamm/bin/nvcc -ccbin /mnt/data/yangliwei/anaconda3/envs/glamm/bin/x86_64-conda-linux-gnu-cc -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/mnt/data/yangliwei/anaconda3/envs/glamm/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/mnt/data/yangliwei/anaconda3/envs/glamm/lib/python3.10/site-packages/deepspeed/ops/csrc/adam -isystem /mnt/data/yangliwei/anaconda3/envs/glamm/lib/python3.10/site-packages/torch/include -isystem /mnt/data/yangliwei/anaconda3/envs/glamm/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /mnt/data/yangliwei/anaconda3/envs/glamm/lib/python3.10/site-packages/torch/include/TH -isystem /mnt/data/yangliwei/anaconda3/envs/glamm/lib/python3.10/site-packages/torch/include/THC -isystem /mnt/data/yangliwei/anaconda3/envs/glamm/include -isystem /mnt/data/yangliwei/anaconda3/envs/glamm/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -DBF16_AVAILABLE -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -std=c++14 -c /mnt/data/yangliwei/anaconda3/envs/glamm/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o
FAILED: multi_tensor_adam.cuda.o
/mnt/data/yangliwei/anaconda3/envs/glamm/bin/nvcc -ccbin /mnt/data/yangliwei/anaconda3/envs/glamm/bin/x86_64-conda-linux-gnu-cc -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/mnt/data/yangliwei/anaconda3/envs/glamm/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/mnt/data/yangliwei/anaconda3/envs/glamm/lib/python3.10/site-packages/deepspeed/ops/csrc/adam -isystem /mnt/data/yangliwei/anaconda3/envs/glamm/lib/python3.10/site-packages/torch/include -isystem /mnt/data/yangliwei/anaconda3/envs/glamm/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /mnt/data/yangliwei/anaconda3/envs/glamm/lib/python3.10/site-packages/torch/include/TH -isystem /mnt/data/yangliwei/anaconda3/envs/glamm/lib/python3.10/site-packages/torch/include/THC -isystem /mnt/data/yangliwei/anaconda3/envs/glamm/include -isystem /mnt/data/yangliwei/anaconda3/envs/glamm/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -DBF16_AVAILABLE -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -std=c++14 -c /mnt/data/yangliwei/anaconda3/envs/glamm/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o
nvcc warning : incompatible redefinition for option 'compiler-bindir', the last value of this option was used
: fatal error: cuda_runtime.h: No such file or directory
compilation terminated.
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
File "/mnt/data/yangliwei/anaconda3/envs/glamm/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1900, in _run_ninja_build
subprocess.run(
File "/mnt/data/yangliwei/anaconda3/envs/glamm/lib/python3.10/subprocess.py", line 526, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/mnt/data/yangliwei/code/groundingLMM-main/train.py", line 671, in
main(args)
File "/mnt/data/yangliwei/code/groundingLMM-main/train.py", line 423, in main
model_engine, optimizer, scheduler = initialize_deepspeed(model, tokenizer, args)
File "/mnt/data/yangliwei/code/groundingLMM-main/train.py", line 395, in initialize_deepspeed
model_engine, optimizer, _, scheduler = deepspeed.initialize(
File "/mnt/data/yangliwei/anaconda3/envs/glamm/lib/python3.10/site-packages/deepspeed/init.py", line 193, in initialize
engine = DeepSpeedEngine(args=args,
File "/mnt/data/yangliwei/anaconda3/envs/glamm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 315, in init
self._configure_optimizer(optimizer, model_parameters)
File "/mnt/data/yangliwei/anaconda3/envs/glamm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1284, in _configure_optimizer
basic_optimizer = self._configure_basic_optimizer(model_parameters)
File "/mnt/data/yangliwei/anaconda3/envs/glamm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1361, in _configure_basic_optimizer
optimizer = FusedAdam(
File "/mnt/data/yangliwei/anaconda3/envs/glamm/lib/python3.10/site-packages/deepspeed/ops/adam/fused_adam.py", line 94, in init
fused_adam_cuda = FusedAdamBuilder().load()
File "/mnt/data/yangliwei/anaconda3/envs/glamm/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 531, in load
return self.jit_load(verbose)
File "/mnt/data/yangliwei/anaconda3/envs/glamm/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 578, in jit_load
op_module = load(name=self.name,
File "/mnt/data/yangliwei/anaconda3/envs/glamm/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1284, in load
return _jit_compile(
File "/mnt/data/yangliwei/anaconda3/envs/glamm/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1508, in _jit_compile
_write_ninja_file_and_build_library(
File "/mnt/data/yangliwei/anaconda3/envs/glamm/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1623, in _write_ninja_file_and_build_library
_run_ninja_build(
File "/mnt/data/yangliwei/anaconda3/envs/glamm/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1916, in _run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error building extension 'fused_adam'