multi worker training w/ pmap #19550

haohuanw · 2024-01-29T05:44:28Z

haohuanw
Jan 29, 2024

i am migrating a tensorflow model to jax and seeing a weird behavior on matching training curves and looking for some help.

observation:

training with jax pmap training loop with (1 worker & 8 gpu) matches with tensorflow (1 worker & 8gpu or 2 workers & 4gpu) setting.
training with jax pmap training loop with (2 worker & 4gpu) resulted in unmatched performance.
plotting gradient (global norm) shows that 1 worker & 8gpu roughly has doubled gradient compares to 2 worker & 4 gpu. further experiment with 4 worker & 2gpu shows that it has half of 2 worker & 4gpu. this makes me think that somehow pmap is not summing across multiple workers but instead averaging across workers.

implementation details:
i used tf2jax for converting keras implemented model and metrics, optimizer is using optax.

tf train step:

with tf.GradientTape() as tape:
    logits = model(features, training=True)
    total_loss = metrics(logits, features, labels, training = True)
    scaled_loss = total_loss * (1. / self.distribute_strategy.num_replicas_in_sync)

grads = tape.gradient(scaled_loss, model.trainable_variables)
grads, _ = tf.clip_by_global_norm(grads, _GRAD_CLIP_NORM)
optimizer.apply_gradients(zip(grads, model.trainable_variables)

jax train step:

rng1 = jax.random.fold_in(rng, state.step)

def loss_fn(params):
      (total_loss, metrics_dict), _ = state.apply_fn(params, data_inputs, rng=rng1)
       return total_loss / jax.num_devices(), metrics_dict

gradient_fn = jax.value_and_grad(loss_fn, has_aux=True)
(_, metrics_dict), grads = gradient_fn(state.params)
grads, _ = optax.clip_by_global_norm(_GRAD_CLIP_NORM).update(grads, None, None)
grads = jax.lax.psum(grads, axis_name="data")
new_state = state.apply_gradients(grads=grads)
metrics_dict = jax.lax.pmean(metrics_dict, axis_name="data")
metrics_dict["learning_rate"] = self.task.learning_rate(new_state.step)
return new_state, metrics_dict

also i tried changing loss_fn to not return a scaled total loss but doesn't seem to help or change anything.

haohuanw · 2024-01-29T17:36:55Z

haohuanw
Jan 29, 2024
Author

additional experiments:

i was suspicious on the global norm clipping being at the wrong place, but removing that doesn't help either.

how i plotted gradient norm:

# from grads returned in jax train step:

grad_norms_sq = []
scalars = {}
histograms = {}
for grad_name, grad in grads.items():
    grad_norm = jnp.linalg.norm(grad)
    histograms[f"grad_hist/{grad_name}"] = grad
    scalars[f"grad/{grad_name}_norm"] = grad_norm
    grad_norms_sq.append(grad_norm**2)
    grad_norms_sq = jnp.stack(grad_norms_sq)
    gradient_norm = jnp.sqrt(jnp.sum(grad_norms_sq))
    scalars["grad/global_norm"] = gradient_norm
writer.write_scalars(global_step, scalars)
writer.write_histograms(global_step, histograms)

0 replies

haohuanw · 2024-02-01T01:49:57Z

haohuanw
Feb 1, 2024
Author

i found the issue - this is due to the fact that we thought the weights are initialized determinisitcally but it actually doesn't.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

multi worker training w/ pmap #19550

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

multi worker training w/ pmap #19550

Uh oh!

Uh oh!

haohuanw Jan 29, 2024

Replies: 2 comments

Uh oh!

haohuanw Jan 29, 2024 Author

Uh oh!

haohuanw Feb 1, 2024 Author

haohuanw
Jan 29, 2024

haohuanw
Jan 29, 2024
Author

haohuanw
Feb 1, 2024
Author