Binary operation between tensors with different datatype #15559

sh0416 · 2023-04-12T09:58:24Z

sh0416
Apr 12, 2023

I just want to know how the memory is used when we apply binary operation on two different array.

For example, if we add two arrays (w1 and w2) with fp16 and fp32, what happened in the GPU VRAM?

Assumption 1. the output is fp32, so w1 is upcast to fp32 and add to w2. In this case, additional GPU VRAM is required.

Assumption 2. there is a magic function signature that add: fp16 -> fp32 -> fp32 so the additional GPU VRAM doesn't need.

Which one is the true..?

sh0416 · 2023-04-12T10:05:33Z

sh0416
Apr 12, 2023
Author

https://docs.nvidia.com/cuda/cuda-math-api/modules.html#modules

There doesn't exist operations that has different datatype.. I think assumption 1 is probably true..

But TPU could not different! Is there any information about it?

0 replies

jakevdp · 2023-04-12T15:12:44Z

jakevdp
Apr 12, 2023
Maintainer

In general, binary operations between arrays of two types will result in one or both being cast to a common type following the documented type promotion rules.

If you ever want to see exactly how a particular function is computed, you can do so by examining the jaxpr or compiled HLO using ahead-of-time compilation.

For example:

import jax
import jax.numpy as jnp

def f(a, b):
  return a + b

a = jnp.arange(10, dtype='float32')
b = jnp.arange(10, dtype='bfloat16')

# jaxpr: JAX-level IR
print(jax.make_jaxpr(f)(a, b))

{ lambda ; a:f32[10] b:bf16[10]. let
    c:f32[10] = convert_element_type[new_dtype=float32 weak_type=False] b
    d:f32[10] = add a c
  in (d,) }

# un-optimized HLO
print(jax.jit(f).lower(a, b).as_text())

module @jit_f {
  func.func public @main(%arg0: tensor<10xf32> {jax.arg_info = "a", mhlo.sharding = "{replicated}"}, %arg1: tensor<10xbf16> {jax.arg_info = "b", mhlo.sharding = "{replicated}"}) -> (tensor<10xf32> {jax.result_info = ""}) {
    %0 = stablehlo.convert %arg1 : (tensor<10xbf16>) -> tensor<10xf32>
    %1 = stablehlo.add %arg0, %0 : tensor<10xf32>
    return %1 : tensor<10xf32>
  }
}

# optimized HLO
print(jax.jit(f).lower(a, b).compile().as_text())

HloModule jit_f, entry_computation_layout={(f32[10]{0},bf16[10]{0})->f32[10]{0}}, allow_spmd_sharding_propagation_to_output={true}

%fused_computation (param_0: f32[10], param_1.1: bf16[10]) -> f32[10] {
  %param_0 = f32[10]{0} parameter(0)
  %param_1.1 = bf16[10]{0} parameter(1)
  %convert.0 = f32[10]{0} convert(bf16[10]{0} %param_1.1), metadata={op_name="jit(f)/jit(main)/convert_element_type[new_dtype=float32 weak_type=False]" source_file="<ipython-input-1-8902cd0e4329>" source_line=5}
  ROOT %add.0 = f32[10]{0} add(f32[10]{0} %param_0, f32[10]{0} %convert.0), metadata={op_name="jit(f)/jit(main)/add" source_file="<ipython-input-1-8902cd0e4329>" source_line=5}
}

ENTRY %main.5 (Arg_0.1: f32[10], Arg_1.2: bf16[10]) -> f32[10] {
  %Arg_0.1 = f32[10]{0} parameter(0), sharding={replicated}
  %Arg_1.2 = bf16[10]{0} parameter(1), sharding={replicated}
  ROOT %fusion = f32[10]{0} fusion(f32[10]{0} %Arg_0.1, bf16[10]{0} %Arg_1.2), kind=kLoop, calls=%fused_computation, metadata={op_name="jit(f)/jit(main)/add" source_file="<ipython-input-1-8902cd0e4329>" source_line=5}
}

In all cases, you see that the bfloat16 array is upcast to float32 in the binary operation. I don't believe that any backends have support general mixed-precision arithmetic without casting.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Binary operation between tensors with different datatype #15559

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Binary operation between tensors with different datatype #15559

Uh oh!

sh0416 Apr 12, 2023

Replies: 2 comments

Uh oh!

sh0416 Apr 12, 2023 Author

Uh oh!

Uh oh!

jakevdp Apr 12, 2023 Maintainer

sh0416
Apr 12, 2023

sh0416
Apr 12, 2023
Author

jakevdp
Apr 12, 2023
Maintainer