Skip to content

Cuda Runtime (an illegal memory access was encountered) when Calibration on Multi-GPUs plateform #1931

@xingyueye

Description

@xingyueye

Hi~all, a strange illegal memory access problem happened when I execute
trtexec --onnx=model.onnx --int8 --calib=model_calibration.cache
on a multi-GPUs plateform, and the errors shows

[04/18/2022-14:24:28] [I] [TRT] Starting Calibration.
[04/18/2022-14:24:28] [E] Error[1]: [calibrator.cpp::add::779] Error Code 1: Cuda Runtime (an illegal memory access was encountered)
[04/18/2022-14:24:28] [E] Error[1]: [executionContext.cpp::commonEmitDebugTensor::1258] Error Code 1: Cuda Runtime (an illegal memory access was encountered)
[04/18/2022-14:24:28] [E] Error[1]: [convolutionRunner.cpp::executeConv::458] Error Code 1: Cudnn (CUDNN_STATUS_BAD_PARAM)
[04/18/2022-14:24:28] [F] [TRT] [defaultAllocator.cpp::free::85] Error Code 1: Cuda Runtime (an illegal memory access was encountered)
[04/18/2022-14:24:28] [F] [TRT] [defaultAllocator.cpp::free::85] Error Code 1: Cuda Runtime (an illegal memory access was encountered)
[04/18/2022-14:24:28] [F] [TRT] [defaultAllocator.cpp::free::85] Error Code 1: Cuda Runtime (an illegal memory access was encountered)
[04/18/2022-14:24:28] [F] [TRT] [defaultAllocator.cpp::free::85] Error Code 1: Cuda Runtime (an illegal memory access was encountered)
[04/18/2022-14:24:28] [F] [TRT] [resources.h::operator()::445] Error Code 1: Cuda Driver (an illegal memory access was encountered)
[04/18/2022-14:24:28] [F] [TRT] [resources.h::operator()::445] Error Code 1: Cuda Driver (an illegal memory access was encountered)
[04/18/2022-14:24:28] [F] [TRT] [resources.h::operator()::445] Error Code 1: Cuda Driver (an illegal memory access was encountered)
[04/18/2022-14:24:28] [F] [TRT] [resources.h::operator()::445] Error Code 1: Cuda Driver (an illegal memory access was encountered)

But it runs correctly on a single-GPU plateform.
I guess that calibration_cache and model_weights are allocated on different devices, But how to specify the same one. Neither "--device" or "CUDA_VISIBLE_DEVICES=0" would work.

Metadata

Metadata

Assignees

No one assigned

    Labels

    triagedIssue has been triaged by maintainers

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions