torch_extensions/py38_cu113/fused_adam/fused_adam.so: cannot open shared object file

Hello, I installed your package using `setup/setup.sh`. The single-GPU command in the tutorial works fine, but when I run the multi-GPU command `deepspeed --num_gpus 8 --num_nodes 2 --master_addr machine1 train.py --config conf/tutorial-gpt2-micro.yaml --nnodes 2 --nproc_per_node 8 --training_arguments.fp16 true --training_arguments.per_device_train_batch_size 4 --training_arguments.deepspeed conf/deepspeed/z2-small-conf.json --run_id tutorial-gpt2-micro-multi-node` I received an error message saying that 
>   File "miniconda3/envs/mistral/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1775, in _import_module_from_library
    module = importlib.util.module_from_spec(spec)
  File "<frozen importlib._bootstrap>", line 556, in module_from_spec
  File "<frozen importlib._bootstrap_external>", line 1166, in create_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
ImportError: ~/.cache/torch_extensions/py38_cu113/fused_adam/fused_adam.so: cannot open shared object file: No such file or directory. 

I also tried running the same code in the same environment but on a different machine, and this time I get the error message 

>  File "miniconda3/envs/mistral/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1494, in verify_ninja_availability
    raise RuntimeError("Ninja is required to load C++ extensions")
RuntimeError: Ninja is required to load C++ extensions

Do you have any idea about how to resolve this issue? I installed all packages using `setup/setup.sh` so I guess my package versions follow what you included in the requirements files. Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

torch_extensions/py38_cu113/fused_adam/fused_adam.so: cannot open shared object file #196

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

torch_extensions/py38_cu113/fused_adam/fused_adam.so: cannot open shared object file #196

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions