Attention, dynamic shapes and JIT #18790

dfdx · 2023-12-03T20:32:33Z

dfdx
Dec 3, 2023

Probably, a trivial question, but I can't the right keywords to find an answer.

On one hand, jax.jit requires all arrays to have static shape.

On another hand, attention in transformers operates on sequences of difference length. Of course, it's possible to pad all sequences to the maximum length, but if, for example, my maximum is 4096 tokens and real sequence is 30 tokens, then it's a huge waste of computational resources.

How do people solve this dilemma?

Answered by jakevdp

Dec 6, 2023

In general, the strategy for handling dynamic array sizes in JAX is, depending on the situation, to either dispatch a new JIT-compiled operation for each size, or to pad all entries to a maximum size. In practice, the extra calculations done in the padding strategy may not be as important as you fear, especially if you're running on an accelerator like GPU or TPU.

View full answer

PhilipVinc · 2023-12-04T17:18:58Z

PhilipVinc
Dec 4, 2023

Incremental padding at sizes that are either power of 2, or multiples of something else.

6 replies

doctor-phil Dec 6, 2023

@PhilipVinc could you explain how incremental padding could be done without changing the size of the array?

jakevdp Dec 6, 2023
Maintainer

If you're curious how people implement attention in JAX, you could start by looking at implementations such as this one in flax: https://github.com/google/flax/blob/829e004c6d5646f24bd048023a4407e3b4d5aa98/flax/experimental/nnx/examples/07_transformer.py#L181

doctor-phil Dec 6, 2023

@jakevdp Thank you, I'll take a look at this and it might be helpful.

My question is not about attention, necessarily. I am just wondering if there is a systematic way to solve problems that require dynamic array sizes without padding arrays to the maximum length and doing a lot of extra calculations.

jakevdp Dec 6, 2023
Maintainer

In general, the strategy for handling dynamic array sizes in JAX is, depending on the situation, to either dispatch a new JIT-compiled operation for each size, or to pad all entries to a maximum size. In practice, the extra calculations done in the padding strategy may not be as important as you fear, especially if you're running on an accelerator like GPU or TPU.

Answer selected by dfdx

dfdx Dec 11, 2023
Author

Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Attention, dynamic shapes and JIT #18790

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 6 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Attention, dynamic shapes and JIT #18790

Uh oh!

dfdx Dec 3, 2023

Replies: 1 comment · 6 replies

Uh oh!

PhilipVinc Dec 4, 2023

Uh oh!

doctor-phil Dec 6, 2023

Uh oh!

jakevdp Dec 6, 2023 Maintainer

Uh oh!

doctor-phil Dec 6, 2023

Uh oh!

jakevdp Dec 6, 2023 Maintainer

Uh oh!

dfdx Dec 11, 2023 Author

dfdx
Dec 3, 2023

Replies: 1 comment 6 replies

PhilipVinc
Dec 4, 2023

jakevdp Dec 6, 2023
Maintainer

jakevdp Dec 6, 2023
Maintainer

dfdx Dec 11, 2023
Author