Skip to content

torch_extensions/py38_cu113/fused_adam/fused_adam.so: cannot open shared object file #196

@yandachen

Description

@yandachen

Hello, I installed your package using setup/setup.sh. The single-GPU command in the tutorial works fine, but when I run the multi-GPU command deepspeed --num_gpus 8 --num_nodes 2 --master_addr machine1 train.py --config conf/tutorial-gpt2-micro.yaml --nnodes 2 --nproc_per_node 8 --training_arguments.fp16 true --training_arguments.per_device_train_batch_size 4 --training_arguments.deepspeed conf/deepspeed/z2-small-conf.json --run_id tutorial-gpt2-micro-multi-node I received an error message saying that

File "miniconda3/envs/mistral/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1775, in _import_module_from_library
module = importlib.util.module_from_spec(spec)
File "", line 556, in module_from_spec
File "", line 1166, in create_module
File "", line 219, in _call_with_frames_removed
ImportError: ~/.cache/torch_extensions/py38_cu113/fused_adam/fused_adam.so: cannot open shared object file: No such file or directory.

I also tried running the same code in the same environment but on a different machine, and this time I get the error message

File "miniconda3/envs/mistral/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1494, in verify_ninja_availability
raise RuntimeError("Ninja is required to load C++ extensions")
RuntimeError: Ninja is required to load C++ extensions

Do you have any idea about how to resolve this issue? I installed all packages using setup/setup.sh so I guess my package versions follow what you included in the requirements files. Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions