Precision of dot operation #18938

southfreebird · 2023-12-12T16:10:26Z

southfreebird
Dec 12, 2023

Hello Team,
I want to make a matrix multiplication of two bfloat16 matrices, but I'd like to make it in high-precision. I've found a precision parameter in dot_general which, to my understanding, should do this.
I've crated a simple example, where I tried different values (DEFAULT, HIGH, HIGHEST):

rng_a, rng_b = jax.random.split(jax.random.PRNGKey(0), 2)

dtype = jnp.bfloat16
A = jax.random.normal(rng_a, (4096, 4096), dtype=dtype)
B = jax.random.normal(rng_b, (4096, 4096), dtype=dtype)

def func1(inputs, kernel):
    return jax.lax.dot_general(
        A,
        B,
        dimension_numbers=(((1), (0)), ((), ())),
        precision=jax.lax.Precision.HIGH,
    )

def func2(inputs, kernel):
    return jax.lax.dot_general(
        A,
        B,
        dimension_numbers=(((1), (0)), ((), ())),
        precision=jax.lax.Precision.DEFAULT,
    )

compiled_func1 = jax.jit(func1).lower(A, B).compile()
compiled_func2 = jax.jit(func2).lower(A, B).compile()

res1 = compiled_func1(A, B)
res2 = compiled_func2(A, B)

print(jnp.mean(jnp.abs(res1 - res2)))

But I did not observe any difference when running with these different values. When I use float32 as dtype, there is a difference between HIGHEST and (HIGH, DEFAULT), but HIGH and DEFAULT behave similarly.

Could you please tell me if there is a way to perform high-precision matmul with bfloat16 tensors without the need for direct casting of tensors to float32?

Answered by jakevdp

Dec 12, 2023

Can you say more about your goal here? You're doing bf16 dot products, which will always be done at bf16 precision.

If you want to accumulate in float32, you could pass preferred_element_type='float32'. Is that what you have in mind?

View full answer

jakevdp · 2023-12-12T17:46:14Z

jakevdp
Dec 12, 2023
Maintainer

What platform are you running your computations on? The precision parameter only affects computations on accelerators like GPU and TPU, and (to the best of my knowledge) has no effect on CPU (see https://jax.readthedocs.io/en/latest/jax.lax.html#jax.lax.Precision).

0 replies

southfreebird · 2023-12-12T18:48:51Z

southfreebird
Dec 12, 2023
Author

Hi @jakevdp,
Thank you for your answer

I found this behaviour on an H100 GPU.

I took a look at the optimized HLO, I found that custom call with CUBLAS op is created with the correct precision parameters, but has no effect on the result. I'm interested if this is the correct behaviour?

HloModule jit_func1, is_scheduled=true, entry_computation_layout={(bf16[4096,4096]{1,0}, bf16[4096,4096]{1,0})->bf16[4096,4096]{1,0}}, allow_spmd_sharding_propagation_to_output={true}

ENTRY %main.4 (Arg_0.1: bf16[4096,4096], Arg_1.2: bf16[4096,4096]) -> bf16[4096,4096] {
  %Arg_1.2 = bf16[4096,4096]{1,0} parameter(1), sharding={replicated}
  %Arg_0.1 = bf16[4096,4096]{1,0} parameter(0), sharding={replicated}
  %custom-call.1 = (bf16[4096,4096]{1,0}, s8[33554432]{0}) custom-call(bf16[4096,4096]{1,0} %Arg_0.1, bf16[4096,4096]{1,0} %Arg_1.2), custom_call_target="__cublas$gemm", metadata={op_name="jit(func1)/jit(main)/dot_general[dimension_numbers=(((1,), (0,)), ((), ())) precision=(<Precision.HIGH: 1>, <Precision.HIGH: 1>) preferred_element_type=None]" source_file="test.py" source_line=23}, backend_config={"alpha_real":1,"alpha_imag":0,"beta":0,"dot_dimension_numbers":{"lhs_contracting_dimensions":["1"],"rhs_contracting_dimensions":["0"],"lhs_batch_dimensions":[],"rhs_batch_dimensions":[]},"precision_config":{"operand_precision":["HIGH","HIGH"]},"epilogue":"DEFAULT","lhs_stride":"16777216","rhs_stride":"16777216","grad_x":false,"grad_y":false}
  ROOT %get-tuple-element = bf16[4096,4096]{1,0} get-tuple-element((bf16[4096,4096]{1,0}, s8[33554432]{0}) %custom-call.1), index=0, frontend_attributes={fingerprint_before_lhs="beae7d6d49e965b56621cc086436cf4f"}, metadata={op_name="jit(func1)/jit(main)/dot_general[dimension_numbers=(((1,), (0,)), ((), ())) precision=(<Precision.HIGH: 1>, <Precision.HIGH: 1>) preferred_element_type=None]" source_file="/test.py" source_line=23}
}


HloModule jit_func2, is_scheduled=true, entry_computation_layout={(bf16[4096,4096]{1,0}, bf16[4096,4096]{1,0})->bf16[4096,4096]{1,0}}, allow_spmd_sharding_propagation_to_output={true}

ENTRY %main.4 (Arg_0.1: bf16[4096,4096], Arg_1.2: bf16[4096,4096]) -> bf16[4096,4096] {
  %Arg_1.2 = bf16[4096,4096]{1,0} parameter(1), sharding={replicated}
  %Arg_0.1 = bf16[4096,4096]{1,0} parameter(0), sharding={replicated}
  %custom-call.1 = (bf16[4096,4096]{1,0}, s8[33554432]{0}) custom-call(bf16[4096,4096]{1,0} %Arg_0.1, bf16[4096,4096]{1,0} %Arg_1.2), custom_call_target="__cublas$gemm", metadata={op_name="jit(func2)/jit(main)/dot_general[dimension_numbers=(((1,), (0,)), ((), ())) precision=(<Precision.DEFAULT: 0>, <Precision.DEFAULT: 0>) preferred_element_type=None]" source_file="/test.py" source_line=31}, backend_config={"alpha_real":1,"alpha_imag":0,"beta":0,"dot_dimension_numbers":{"lhs_contracting_dimensions":["1"],"rhs_contracting_dimensions":["0"],"lhs_batch_dimensions":[],"rhs_batch_dimensions":[]},"precision_config":{"operand_precision":["DEFAULT","DEFAULT"]},"epilogue":"DEFAULT","lhs_stride":"16777216","rhs_stride":"16777216","grad_x":false,"grad_y":false}
  ROOT %get-tuple-element = bf16[4096,4096]{1,0} get-tuple-element((bf16[4096,4096]{1,0}, s8[33554432]{0}) %custom-call.1), index=0, frontend_attributes={fingerprint_before_lhs="3e702b9741e8e484ae189ae7424db405"}, metadata={op_name="jit(func2)/jit(main)/dot_general[dimension_numbers=(((1,), (0,)), ((), ())) precision=(<Precision.DEFAULT: 0>, <Precision.DEFAULT: 0>) preferred_element_type=None]" source_file="/test.py" source_line=31}
}

2 replies

jakevdp Dec 12, 2023
Maintainer

Can you say more about your goal here? You're doing bf16 dot products, which will always be done at bf16 precision.

If you want to accumulate in float32, you could pass preferred_element_type='float32'. Is that what you have in mind?

Answer selected by jakevdp

southfreebird Dec 12, 2023
Author

I think that preferred_element_type probably is a solution for my task. Thank you!
My goal was to understand if there is possibility to make a high-precision matmul without explicit cast of arguments to float32 precision.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Precision of dot operation #18938

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Precision of dot operation #18938

Uh oh!

southfreebird Dec 12, 2023

Replies: 2 comments · 2 replies

Uh oh!

jakevdp Dec 12, 2023 Maintainer

Uh oh!

Uh oh!

southfreebird Dec 12, 2023 Author

Uh oh!

jakevdp Dec 12, 2023 Maintainer

Uh oh!

southfreebird Dec 12, 2023 Author

southfreebird
Dec 12, 2023

Replies: 2 comments 2 replies

jakevdp
Dec 12, 2023
Maintainer

southfreebird
Dec 12, 2023
Author

jakevdp Dec 12, 2023
Maintainer

southfreebird Dec 12, 2023
Author