-
Notifications
You must be signed in to change notification settings - Fork 988
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Python -VV
Python 3.12.0 | packaged by Anaconda, Inc. | (main, Oct 2 2023, 17:29:18) [GCC 11.2.0]Pip Freeze
annotated-types==0.7.0
attrs==25.3.0
certifi==2025.7.14
charset-normalizer==3.4.2
docstring_parser==0.17.0
filelock==3.18.0
fire==0.7.0
fsspec==2025.7.0
idna==3.10
Jinja2==3.1.6
jsonschema==4.25.0
jsonschema-specifications==2025.4.1
MarkupSafe==3.0.2
mistral_common==1.8.3
mistral_inference==1.6.0
mpmath==1.3.0
networkx==3.5
numpy==2.3.2
nvidia-cublas-cu12==12.6.4.1
nvidia-cuda-cupti-cu12==12.6.80
nvidia-cuda-nvrtc-cu12==12.6.77
nvidia-cuda-runtime-cu12==12.6.77
nvidia-cudnn-cu12==9.5.1.17
nvidia-cufft-cu12==11.3.0.4
nvidia-cufile-cu12==1.11.1.6
nvidia-curand-cu12==10.3.7.77
nvidia-cusolver-cu12==11.7.1.2
nvidia-cusparse-cu12==12.5.4.2
nvidia-cusparselt-cu12==0.6.3
nvidia-nccl-cu12==2.26.2
nvidia-nvjitlink-cu12==12.6.85
nvidia-nvtx-cu12==12.6.77
pillow==11.3.0
pycountry==24.6.1
pydantic==2.11.7
pydantic-extra-types==2.10.5
pydantic_core==2.33.2
referencing==0.36.2
regex==2024.11.6
requests==2.32.4
rpds-py==0.26.0
safetensors==0.5.3
sentencepiece==0.2.0
setuptools==78.1.1
simple-parsing==0.1.7
sympy==1.14.0
termcolor==3.1.0
tiktoken==0.9.0
torch==2.7.1
triton==3.3.1
typing-inspection==0.4.1
typing_extensions==4.14.1
urllib3==2.5.0
wheel==0.45.1
xformers==0.0.31.post1Reproduction Steps
- git clone https://github.com/mistralai/mistral-inference
- conda create -n moe python==3.12
- conda activate moe
- pip install mistral-inference
- export MISTRAL_MODEL=$HOME/mistral_models
- mkdir -p $MISTRAL_MODEL
- export M8x7B_DIR=$MISTRAL_MODEL/8x7b_instruct
- wget https://models.mistralcdn.com/mixtral-8x7b-v0-1/Mixtral-8x7B-v0.1-Instruct.tar
- mkdir -p $M8x7B_DIR
- tar -xf Mixtral-8x7B-v0.1-Instruct.tar -C $M8x7B_DIR
- torchrun --nproc-per-node 2 --no-python mistral-demo $M8x7B_DIR
- torchrun --nproc-per-node 2 --no-python mistral-chat $M8x7B_DIR --instruct
Expected Behavior
I expect the code will run
Additional Context
I have two A100 GPU, each with 80GB.
I also tried these command before running:
export CUDA_VISIBLE_DEVICES=0,1 and export TORCH_DISTRIBUTED_BACKEND=nccl
None of them help
Here is the full bug:
W0725 19:47:31.319000 1752 site-packages/torch/distributed/run.py:766]
W0725 19:47:31.319000 1752 site-packages/torch/distributed/run.py:766] *****************************************
W0725 19:47:31.319000 1752 site-packages/torch/distributed/run.py:766] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0725 19:47:31.319000 1752 site-packages/torch/distributed/run.py:766] *****************************************
Prompt: [rank1]: Traceback (most recent call last):
[rank1]: File "/opt/conda/envs/moe/bin/mistral-chat", line 8, in <module>
[rank1]: sys.exit(mistral_chat())
[rank1]: ^^^^^^^^^^^^^^
[rank1]: File "/opt/conda/envs/moe/lib/python3.12/site-packages/mistral_inference/main.py", line 260, in mistral_chat
[rank1]: fire.Fire(interactive)
[rank1]: File "/opt/conda/envs/moe/lib/python3.12/site-packages/fire/core.py", line 135, in Fire
[rank1]: component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/opt/conda/envs/moe/lib/python3.12/site-packages/fire/core.py", line 468, in _Fire
[rank1]: component, remaining_args = _CallAndUpdateTrace(
[rank1]: ^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/opt/conda/envs/moe/lib/python3.12/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace
[rank1]: component = fn(*varargs, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/opt/conda/envs/moe/lib/python3.12/site-packages/mistral_inference/main.py", line 167, in interactive
[rank1]: dist.broadcast(length_tensor, src=0)
[rank1]: File "/opt/conda/envs/moe/lib/python3.12/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank1]: return func(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/opt/conda/envs/moe/lib/python3.12/site-packages/torch/distributed/distributed_c10d.py", line 2714, in broadcast
[rank1]: work = group.broadcast([tensor], opts)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: RuntimeError: No backend type associated with device type cpu
W0725 19:47:45.085000 1752 site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 1817 closing signal SIGTERM
E0725 19:47:45.450000 1752 site-packages/torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: 1) local_rank: 1 (pid: 1818) of binary: mistral-chat
Traceback (most recent call last):
File "/opt/conda/envs/moe/bin/torchrun", line 8, in <module>
sys.exit(main())
^^^^^^
File "/opt/conda/envs/moe/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/moe/lib/python3.12/site-packages/torch/distributed/run.py", line 892, in main
run(args)
File "/opt/conda/envs/moe/lib/python3.12/site-packages/torch/distributed/run.py", line 883, in run
elastic_launch(
File "/opt/conda/envs/moe/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 139, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/moe/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 270, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
mistral-chat FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2025-07-25_19:47:45
host : 9dda69c50e7f
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 1818)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================Suggested Solutions
No response
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working