How to checkpoint activations for backward with different dtype #21238

qGentry · 2024-05-15T12:31:31Z

qGentry
May 15, 2024

Hi, i've recently read Snowflake's technical report on how they've trained their Arctic MoE model.
https://medium.com/snowflake/snowflake-arctic-cookbook-series-building-an-efficient-training-system-for-arctic-6658b9bdfcae
They mentioned that in order to reduce memory overhead for storing intermediate activations, they quantized them.

My problem is that if i just simply try to checkpoint some function's result without passing it further though graph, it just gets optimized away by compiler (which is totally expected behavior)

activations = layer1(inputs)
activations_fp8 = activations.astype(jnp.float8_e5m2) # for example fp8
activations_fp8 = jax.ad_checkpoint.checkpoint_name(activations_fp8, "dense_fp8")
activations = layer2(activations) # activations_fp8 is not passed further, so it is optimized away

I can try to do something like this to preserve dependency on lower-precision activations

activations = layer1(inputs)
dtype = activations.dtype
activations_fp8 = activations.astype(jnp.float8_e5m2) # for example fp8
activations_fp8 = jax.ad_checkpoint.checkpoint_name(activations_fp8, "dense_fp8")
activations = activations_fp8.astype(dtype)
activations = layer2(activations)

this indeed works and I can see reduction in memory usage compared to checkpointing activations in their default precision but this way I'm losing computation's precision during forward pass.

I was wondering on how to implement this properly in JAX?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to checkpoint activations for backward with different dtype #21238

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

How to checkpoint activations for backward with different dtype #21238

Uh oh!

Uh oh!

qGentry May 15, 2024

Replies: 0 comments

qGentry
May 15, 2024