Skip to content

model_analyzer profile with mig : DCGM initialization error #954

@jason-i-vv

Description

@jason-i-vv

Hardware:H800

nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Tue_Oct_29_23:50:19_PDT_2024
Cuda compilation tools, release 12.6, V12.6.85
Build cuda_12.6.r12.6/compiler.35059454_0

model-analyzer --version
1.47.0

# baseimage
nvcr.io/nvidia/tritonserver:24.11-py3

When using the entire card, there is no problem. However, after enabling the MIG mode, when the container is on the MIG card, model_analyzer cannot be executed.

docker run -ti --rm --gpus='"device=0:0,0:1"' --network=host -v $PWD:/mnt --name triton-server tritonserver-modelanalyzer:latest

model-analyzer profile \
  --model-repository=/mnt/models \
  --profile-models=densenet_onnx \
  --output-model-repository-path=results

[Model Analyzer] Initializing GPUDevice handles
CacheManager Init Failed. Error: -17
Traceback (most recent call last):
  File "/usr/local/bin/model-analyzer", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/model_analyzer/entrypoint.py", line 263, in main
    gpus = GPUDeviceFactory().verify_requested_gpus(config.gpus)
           ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/model_analyzer/device/gpu_device_factory.py", line 39, in __init__
    self.init_all_devices()
  File "/usr/local/lib/python3.12/dist-packages/model_analyzer/device/gpu_device_factory.py", line 58, in init_all_devices
    dcgm_handle = dcgm_agent.dcgmStartEmbedded(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/model_analyzer/monitor/dcgm/dcgm_agent.py", line 56, in wrapper
    return fn(*newargs, **newkwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/model_analyzer/monitor/dcgm/dcgm_agent.py", line 91, in dcgmStartEmbedded
    dcgm_structs._dcgmCheckReturn(ret)
  File "/usr/local/lib/python3.12/dist-packages/model_analyzer/monitor/dcgm/dcgm_structs.py", line 691, in _dcgmCheckReturn
    raise DCGMError(ret)
model_analyzer.monitor.dcgm.dcgm_structs.DCGMError_InitError: DCGM initialization error

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions