Multi-node training using sharding API #20053

neel04 · 2024-03-03T20:23:32Z

neel04
Mar 3, 2024

I'm using TPUs, and wished to do training across multi-node/hosts. The official docs suggest using xmap/pmap.

However, I'm using the sharding API to shard across multiple local devices.

So how can we extend the sharding to accomodate a multi-node setup?

AIUI, we should be able to to provide a sort of 3D sharding like (2, 8, 1) for 2x TPUs with 8 local devices each, DDP styled.
This would allow us to switch between n-way data parallelism and m-way model parallelism as outlined here.

But this doesn't seem to be the case?

Related: SO bounty thread I started here.

wen020 · 2024-04-01T14:07:25Z

wen020
Apr 1, 2024

Do you currently know the best practices for implementing multi-host data parallelism with JAX?

3 replies

neel04 Apr 1, 2024
Author

I ended up using scalax

wen020 Apr 10, 2024

Can Scalax be used for distributed training on multiple machines?

neel04 Apr 10, 2024
Author

yes

ayaka14732 · 2024-04-10T13:28:51Z

ayaka14732
Apr 10, 2024

However, I'm using the sharding API to shard across multiple local devices. So how can we extend the sharding to accomodate a multi-node setup?

Just change the code from:

arr = jax.device_put(arr, NamedSharding(mesh, P(*partition_spec)))

To:

arr = jax.make_array_from_callback(arr.shape, NamedSharding(mesh, P(*partition_spec)), lambda idx: arr[idx])

See #20041

1 reply

neel04 Apr 10, 2024
Author

is idx the host_id?

yashk2810 · 2024-04-12T05:04:59Z

yashk2810
Apr 12, 2024
Collaborator

If your input pipeline is fully data parallel, take a look at the docstring of: https://jax.readthedocs.io/en/latest/_autosummary/jax.make_array_from_single_device_arrays.html

If am also planning to expose a helper function to do just this! Are you also asking on how to shard the weights?

1 reply

neel04 Apr 12, 2024
Author

Yes please. I'd prefer a complete jax example for TP, DDP and FSDP 🙏

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Multi-node training using sharding API #20053

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 5 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Multi-node training using sharding API #20053

Uh oh!

neel04 Mar 3, 2024

Replies: 3 comments · 5 replies

Uh oh!

wen020 Apr 1, 2024

Uh oh!

neel04 Apr 1, 2024 Author

Uh oh!

wen020 Apr 10, 2024

Uh oh!

neel04 Apr 10, 2024 Author

Uh oh!

ayaka14732 Apr 10, 2024

Uh oh!

neel04 Apr 10, 2024 Author

Uh oh!

yashk2810 Apr 12, 2024 Collaborator

Uh oh!

neel04 Apr 12, 2024 Author

neel04
Mar 3, 2024

Replies: 3 comments 5 replies

wen020
Apr 1, 2024

neel04 Apr 1, 2024
Author

neel04 Apr 10, 2024
Author

ayaka14732
Apr 10, 2024

neel04 Apr 10, 2024
Author

yashk2810
Apr 12, 2024
Collaborator

neel04 Apr 12, 2024
Author