Skip to content

[BUG: No backend type associated with device type cpu #252

@Tizzzzy

Description

@Tizzzzy

Python -VV

Python 3.12.0 | packaged by Anaconda, Inc. | (main, Oct  2 2023, 17:29:18) [GCC 11.2.0]

Pip Freeze

annotated-types==0.7.0
attrs==25.3.0
certifi==2025.7.14
charset-normalizer==3.4.2
docstring_parser==0.17.0
filelock==3.18.0
fire==0.7.0
fsspec==2025.7.0
idna==3.10
Jinja2==3.1.6
jsonschema==4.25.0
jsonschema-specifications==2025.4.1
MarkupSafe==3.0.2
mistral_common==1.8.3
mistral_inference==1.6.0
mpmath==1.3.0
networkx==3.5
numpy==2.3.2
nvidia-cublas-cu12==12.6.4.1
nvidia-cuda-cupti-cu12==12.6.80
nvidia-cuda-nvrtc-cu12==12.6.77
nvidia-cuda-runtime-cu12==12.6.77
nvidia-cudnn-cu12==9.5.1.17
nvidia-cufft-cu12==11.3.0.4
nvidia-cufile-cu12==1.11.1.6
nvidia-curand-cu12==10.3.7.77
nvidia-cusolver-cu12==11.7.1.2
nvidia-cusparse-cu12==12.5.4.2
nvidia-cusparselt-cu12==0.6.3
nvidia-nccl-cu12==2.26.2
nvidia-nvjitlink-cu12==12.6.85
nvidia-nvtx-cu12==12.6.77
pillow==11.3.0
pycountry==24.6.1
pydantic==2.11.7
pydantic-extra-types==2.10.5
pydantic_core==2.33.2
referencing==0.36.2
regex==2024.11.6
requests==2.32.4
rpds-py==0.26.0
safetensors==0.5.3
sentencepiece==0.2.0
setuptools==78.1.1
simple-parsing==0.1.7
sympy==1.14.0
termcolor==3.1.0
tiktoken==0.9.0
torch==2.7.1
triton==3.3.1
typing-inspection==0.4.1
typing_extensions==4.14.1
urllib3==2.5.0
wheel==0.45.1
xformers==0.0.31.post1

Reproduction Steps

  1. git clone https://github.com/mistralai/mistral-inference
  2. conda create -n moe python==3.12
  3. conda activate moe
  4. pip install mistral-inference
  5. export MISTRAL_MODEL=$HOME/mistral_models
  6. mkdir -p $MISTRAL_MODEL
  7. export M8x7B_DIR=$MISTRAL_MODEL/8x7b_instruct
  8. wget https://models.mistralcdn.com/mixtral-8x7b-v0-1/Mixtral-8x7B-v0.1-Instruct.tar
  9. mkdir -p $M8x7B_DIR
  10. tar -xf Mixtral-8x7B-v0.1-Instruct.tar -C $M8x7B_DIR
  11. torchrun --nproc-per-node 2 --no-python mistral-demo $M8x7B_DIR
  12. torchrun --nproc-per-node 2 --no-python mistral-chat $M8x7B_DIR --instruct

Expected Behavior

I expect the code will run

Additional Context

I have two A100 GPU, each with 80GB.

I also tried these command before running:
export CUDA_VISIBLE_DEVICES=0,1 and export TORCH_DISTRIBUTED_BACKEND=nccl

None of them help

Here is the full bug:

W0725 19:47:31.319000 1752 site-packages/torch/distributed/run.py:766] 
W0725 19:47:31.319000 1752 site-packages/torch/distributed/run.py:766] *****************************************
W0725 19:47:31.319000 1752 site-packages/torch/distributed/run.py:766] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0725 19:47:31.319000 1752 site-packages/torch/distributed/run.py:766] *****************************************
Prompt: [rank1]: Traceback (most recent call last):
[rank1]:   File "/opt/conda/envs/moe/bin/mistral-chat", line 8, in <module>
[rank1]:     sys.exit(mistral_chat())
[rank1]:              ^^^^^^^^^^^^^^
[rank1]:   File "/opt/conda/envs/moe/lib/python3.12/site-packages/mistral_inference/main.py", line 260, in mistral_chat
[rank1]:     fire.Fire(interactive)
[rank1]:   File "/opt/conda/envs/moe/lib/python3.12/site-packages/fire/core.py", line 135, in Fire
[rank1]:     component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank1]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/opt/conda/envs/moe/lib/python3.12/site-packages/fire/core.py", line 468, in _Fire
[rank1]:     component, remaining_args = _CallAndUpdateTrace(
[rank1]:                                 ^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/opt/conda/envs/moe/lib/python3.12/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace
[rank1]:     component = fn(*varargs, **kwargs)
[rank1]:                 ^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/opt/conda/envs/moe/lib/python3.12/site-packages/mistral_inference/main.py", line 167, in interactive
[rank1]:     dist.broadcast(length_tensor, src=0)
[rank1]:   File "/opt/conda/envs/moe/lib/python3.12/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank1]:     return func(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/opt/conda/envs/moe/lib/python3.12/site-packages/torch/distributed/distributed_c10d.py", line 2714, in broadcast
[rank1]:     work = group.broadcast([tensor], opts)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: RuntimeError: No backend type associated with device type cpu
W0725 19:47:45.085000 1752 site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 1817 closing signal SIGTERM
E0725 19:47:45.450000 1752 site-packages/torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: 1) local_rank: 1 (pid: 1818) of binary: mistral-chat
Traceback (most recent call last):
  File "/opt/conda/envs/moe/bin/torchrun", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/opt/conda/envs/moe/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/moe/lib/python3.12/site-packages/torch/distributed/run.py", line 892, in main
    run(args)
  File "/opt/conda/envs/moe/lib/python3.12/site-packages/torch/distributed/run.py", line 883, in run
    elastic_launch(
  File "/opt/conda/envs/moe/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 139, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/moe/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 270, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
mistral-chat FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-07-25_19:47:45
  host      : 9dda69c50e7f
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 1818)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Suggested Solutions

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions