-
Notifications
You must be signed in to change notification settings - Fork 658
Open
Labels
questionFurther information is requestedFurther information is requested
Description
Describe the question.
Hello,
I am encountering an intermittent error when using NVIDIA DALI to load data. The same code sometimes runs successfully, but sometimes fails with a critical pipeline error. I would like to understand the possible reasons for this behavior.
Additionally, the failure location is not consistent: across different runs, the error may be reported in different operators or different parts of the code.
Critical error in pipeline:
Error in CPU operator 'nvidia.dali.fn.decoders.image',
which was used in the pipeline definition with the following traceback:
File ".../dali_loader.py", line 36, in create_lmdb_dali_train_pipeline
images = fn.decoders.image(images, device='cpu', output_type=types.RGB)
encountered:
Error in thread 0: CUDA runtime API error cudaErrorInvalidValue (1):
invalid argument
Current pipeline object is no longer valid.
Error in CPU operator 'nvidia.dali.fn.ones',
File ".../dali_loader.py", line 64, in create_lmdb_dali_train_pipeline
mask = fn.ones(shape=(crop, crop)).gpu()
CUDA runtime API error cudaErrorInvalidValue (1):
invalid argument
Current pipeline object is no longer valid.
Environment
OS: Ubuntu
GPU: A100
CUDA version: 12.2
NVIDIA DALI version: 1.51.2
PyTorch version: 2.4.1
Python version: 3.11.13
Data format: LMDB (ImageNet-style)
Check for duplicates
- I have searched the open bugs/issues and have found no duplicates for this bug report
Metadata
Metadata
Assignees
Labels
questionFurther information is requestedFurther information is requested