How to do distributed training? #9375

ayaka14732 · 2022-01-29T18:26:29Z

ayaka14732
Jan 29, 2022

I have been searching hard to find a tutorial about doing distributed training in JAX (e.g. with 100 v2-8 Cloud TPUs). It seems that Ray can achieve this goal (mesh-transformer-jax, swarm-jax), but I don't quite understand how to make it work.

functionstackx · 2022-01-31T05:07:28Z

functionstackx
Jan 31, 2022

check out https://github.com/sholtodouglas/scalingExperiments for data and tensor parallelism. pipeline parallelism through k8s/ray is coming soon-ish according to the repo owner

0 replies

mwitiderrick · 2022-08-20T16:22:24Z

mwitiderrick
Aug 20, 2022

Check out Distributed training with JAX & Flax: https://www.machinelearningnuggets.com/distributed-training-with-jax-and-flax/

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to do distributed training? #9375

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How to do distributed training? #9375

Uh oh!

ayaka14732 Jan 29, 2022

Replies: 2 comments

Uh oh!

functionstackx Jan 31, 2022

Uh oh!

mwitiderrick Aug 20, 2022

ayaka14732
Jan 29, 2022

functionstackx
Jan 31, 2022

mwitiderrick
Aug 20, 2022