Rng greatly slows down data parallel training #15783

jjyyxx · 2023-04-28T06:43:24Z

jjyyxx
Apr 28, 2023

I was working with a transformer model in jax and haiku, and found that dropout greatly slows down data parallel training, the main training step looks like

self._train_state, batch_scalars = self._train_step(train_key, self._train_state, batch)

where

Sharding is created with sharding = jax.sharding.PositionalSharding(jax.devices()), containing GPU:0 and GPU:1
train_key is a PRNGKeyArray, not sharded
self._train_state is a PyTree of params and opt_states, replicated with jax.device_put(train_state, sharding.replicate())
batch is a PyTree of data and labels, sharded with jax.device_put(batch, sharding)
Every operation in this model (except final loss reduction) is independent between each sample in batch, so this should be trivially data parallel.

Without x = hk.dropout(hk.next_rng_key(), self.dropout, x) (boils down to a jax.random.split and a jax.random.bernoulli), every thing works well (Single device: 4.2 it/s, Two devices: 7.5 it/s). But when dropout is enabled (called 20 times), I got

Single device: 3.75 it/s
Two devices: 1.9 it/s
Two devices with jax.config.update('jax_threefry_partitionable', True): 5.32 it/s (I was aware of the document)
which is far from expected.

Did I miss somthing? Could this performance be optimized?

cgarciae · 2023-05-05T20:00:41Z

cgarciae
May 5, 2023
Collaborator

cc @froystig

0 replies

froystig · 2023-05-05T21:00:40Z

froystig
May 5, 2023
Maintainer

Do you have a minimal code example that reproduces this? I'll convert this to an issue, since it seems like one. We can move the discussion there.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Rng greatly slows down data parallel training #15783

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Rng greatly slows down data parallel training #15783

Uh oh!

jjyyxx Apr 28, 2023

Replies: 2 comments

Uh oh!

cgarciae May 5, 2023 Collaborator

Uh oh!

Uh oh!

froystig May 5, 2023 Maintainer

jjyyxx
Apr 28, 2023

cgarciae
May 5, 2023
Collaborator

froystig
May 5, 2023
Maintainer