No speedup from float16 with directml compared to cuda #23340

Samy-mri · 2025-01-13T15:13:18Z

Samy-mri
Jan 13, 2025

Dear community,

I am using direct-ml for inference of UNet models trained with PyTorch. UNets are mostly Conv3D + BatchNormalization + Relu operations. No transformers used.

The inference results are great and I am looking for model optimization for faster inference.
I hoped that converting the model weights to float16 would be about twice as fast, however I took just as long or sometimes 5-10% longer than float32.

With the same models and cuda execution provider I got halve the inference time, as expected. However I like the portability from directml.

I export models as:

    torch.onnx.export(
        model,  # model being run
        input,  # model input (or a tuple for multiple inputs)
        output,  # where to save the model (can be a file or file-like object)
        export_params=True,  # store the trained parameter weights inside the model file
        opset_version=opset_version,  # the ONNX version to export the model to
        do_constant_folding=True,  # whether to execute constant folding for optimization
        input_names=["input"],  # the model's input names
        output_names=["output"],  # the model's output names
        dynamic_axes={
            "input": {0: "batch_size"},  # variable length axes
            "output": {0: "batch_size"},
        },
    )

I tried the following:

Converted from float32 to float16 using onnxconverter_common
with and without IObinding, no effect on inference time.
Monitoring GPU, I found that while running float16, GPU utillity deviated between 60-80%. While running float32 models, GPU utility ranged 90-100%. GPU RAM was halved as expected.
Opset version 7 and 20 and compatible onnxruntime (1.17 - 1.20)
I suspected some float16 operations not implemented in directml, so using Nvidia Nsight systems I checked if there was a sudden large Tx/Rx to CPU, or process-linked CPU activity during inference. I couldn't immediately see something suspicious but I am a beginner in CPU-GPU profiling.
Batch prediction: Since GPU-util for float16 was under-utilized, I experimented with doubling batch_size. GPU Util does go up to >90% however each prediction also doubles, resulting in no inference speed up.
Olive optimization exporting with model output float16. The model size did not decrease, nor did it the speed up inference. It's outside the scope of this forum but just for completeness.

Expected behaviour

I would expect half the inference time, since using the same platform and GPU I can get so with the cuda provider.

Are there any other options that I could try?
platform: Windows 11
python=3.11.9
Onnx=1.16
Onnxruntime=1.17 / 1.20
GPU: NVIDIA RTX 2080 8gb VRAM.

Samy-mri · 2025-01-14T12:30:26Z

Samy-mri
Jan 14, 2025
Author

Closing this discussion and reposting as an issue:
#23359

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

No speedup from float16 with directml compared to cuda #23340

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

No speedup from float16 with directml compared to cuda #23340

Uh oh!

Uh oh!

Samy-mri Jan 13, 2025

Expected behaviour

Replies: 1 comment

Uh oh!

Samy-mri Jan 14, 2025 Author

Samy-mri
Jan 13, 2025

Samy-mri
Jan 14, 2025
Author