Parallelization on RTX 4060 Ti cards. #1789
Unanswered
AntonThai2022
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I have 4 RTX 4060 Ti video cards, they are connected to one PCI Express Bridge. It is known about them that they do not support NVIDIA Direct P2P technology. I need to run the TensorRT-LLM Engine built using the library on them. After I built this engine with the command:
trtllm-build --checkpoint_dir /workspace/TensorRT-LLM/quantized-llama-3-70b-pp1-tp4-awq-w4a16-kvint8-gs64 --output_dir ./quantized-llama-3-70b --gemm_plugin auto
And I'm trying to run it with the command
mpirun -n 4 --allow-run-as-root python3 ../run.py --max_output_len=40 --tokenizer_dir ./llama70b_hf/models--meta-llama--Meta-Llama-3-70B-Instruct/ snapshots/7129260dd854a80eb10ace5f61c20324b472b31c/ --engine_dir quantized-llama-3-70b --input_text "In Bash, how do I list all text files?"
I use a ready-made checkpoint.
When I run this engine for execution I get an error
[TensorRT-LLM][WARNING] Device 0 peer access Device 1 is not available.
[TensorRT-LLM][WARNING] Device 0 peer access Device 2 is not available.
[TensorRT-LLM][WARNING] Device 0 peer access Device 3 is not available.
Traceback (most recent call last):
File "/workspace/TensorRT-LLM/TensorRT-LLM/examples/llama/../run.py", line 632, in
main(args)
File "/workspace/TensorRT-LLM/TensorRT-LLM/examples/llama/../run.py", line 478, in main
runner = runner_cls.from_dir(**runner_kwargs)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/model_runner_cpp.py", line 222, in from_dir
executor = trtllm.Executor(engine_dir, trtllm.ModelType.DECODER_ONLY,
RuntimeError: [TensorRT-LLM][ERROR] CUDA runtime error in error: peer access is not supported between these two devices
It is obvious that the cards cannot communicate directly via the PCI Express bus.
How can I change the settings for building the engine or launching it, so that the cards interact through RAM.
Or maybe I need to rework the language model.
Beta Was this translation helpful? Give feedback.
All reactions