1D Tensor Parallelism Across Multiple Hosts on TPU v4-32 #20041

yixiaoer · 2024-03-01T17:08:08Z

yixiaoer
Mar 1, 2024

The code below can implement 1D tensor parallelism across multiple devices with TPU v3-8.

import jax
from jax.experimental import mesh_utils
import jax.numpy as jnp
from jax.sharding import Mesh, NamedSharding, PartitionSpec as P

arr = jnp.arange(32*4).reshape(32, 4)
n_devices = jax.device_count() 
mesh_shape = [n_devices, 1]
axis_names = ('a1', 'a2')
partition_spec = ('a1', 'a2')
devices = mesh_utils.create_device_mesh(mesh_shape)
mesh = Mesh(devices, axis_names=axis_names)
arr = jax.device_put(arr, NamedSharding(mesh, P(*partition_spec)))
jax.debug.visualize_array_sharding(arr)

However, when attempting to run the same code on TPU v4-32 across 4 hosts, it didn't work as expected. I encountered the following error:

ValueError: device_put's second argument must be a Device or a Sharding which represents addressable devices, but got NamedSharding(mesh=Mesh('a1': 16, 'a2': 1), spec=PartitionSpec('a1', 'a2')). You are probably trying to use device_put in multi-controller JAX which is not supported. Please use jax.make_array_from_single_device_arrays API or pass device or Sharding which represents addressable devices.

I wonder if the problem is due to device_put() incompatibility with cross-host configurations, or if there's another underlying issue causing this error. How to implement similar 1D tensor parallelism on the v4-32 setup as I did with the v3-8?

Answered by yashk2810

Mar 1, 2024

Hi!

So the problem here is that device_put cannot transfer across hosts (we know about this and we are looking into improving the situation here). On single host, it works out as you know but will fail on multiple hosts.

A better thing here is to use jax.make_array_from_callback because the input on every host is the same i.e. it's arr = jnp.arange(32*4).reshape(32, 4). make_array_from_callback will carve out the shards that each device needs on that host. Here how the code will look:

import jax
from jax.experimental import mesh_utils
import jax.numpy as jnp
from jax.sharding import Mesh, NamedSharding, PartitionSpec as P

arr = np.arange(32*4).reshape(32, 4)
n_devices = jax.device_count(…

View full answer

yashk2810 · 2024-03-01T17:43:45Z

yashk2810
Mar 1, 2024
Collaborator

Hi!

So the problem here is that device_put cannot transfer across hosts (we know about this and we are looking into improving the situation here). On single host, it works out as you know but will fail on multiple hosts.

A better thing here is to use jax.make_array_from_callback because the input on every host is the same i.e. it's arr = jnp.arange(32*4).reshape(32, 4). make_array_from_callback will carve out the shards that each device needs on that host. Here how the code will look:

import jax
from jax.experimental import mesh_utils
import jax.numpy as jnp
from jax.sharding import Mesh, NamedSharding, PartitionSpec as P

arr = np.arange(32*4).reshape(32, 4)
n_devices = jax.device_count() 
mesh_shape = [n_devices, 1]
axis_names = ('a1', 'a2')
partition_spec = ('a1', 'a2')
devices = mesh_utils.create_device_mesh(mesh_shape)
mesh = Mesh(devices, axis_names=axis_names)
arr = jax.make_array_from_callback(arr.shape, NamedSharding(mesh, P(*partition_spec)), lambda idx: arr[idx])
jax.debug.visualize_array_sharding(arr)

3 replies

yixiaoer Mar 2, 2024
Author

The suggestion to use jax.make_array_from_callback indeed solved my issue of sharding on TPU v4-32. Thank you so much for your reply!

Yet, in my actual use case, my model is a monolith of parameters stored on the disk, which occupies a significant amount of memory. I need to first load it into CPU, and then shard it across each device. Below is the code I used with the v3-8:

cpu_device = jax.devices('cpu')[0]
with jax.default_device(cpu_device):
    params = convert_params(model)
params = shard_params(params)

Now, moving to the TPU v4-32, I'm looking to replicate this large data across the CPU device of 4 hosts, and then shard it. Is there a recommended method to achieve this?

yashk2810 Mar 2, 2024
Collaborator

Are you trying to load checkpoints?

yixiaoer Mar 2, 2024
Author

I am trying to load a params.pickle file that is converted from a PyTorch model.

yashk2810 · 2024-03-01T17:44:46Z

yashk2810
Mar 1, 2024
Collaborator

See the docs here if you want to do fully data parallel input loading and how to create an Array for that: https://jax.readthedocs.io/en/latest/_autosummary/jax.make_array_from_single_device_arrays.html

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

1D Tensor Parallelism Across Multiple Hosts on TPU v4-32 #20041

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

1D Tensor Parallelism Across Multiple Hosts on TPU v4-32 #20041

Uh oh!

yixiaoer Mar 1, 2024

Replies: 2 comments · 3 replies

Uh oh!

yashk2810 Mar 1, 2024 Collaborator

Uh oh!

yixiaoer Mar 2, 2024 Author

Uh oh!

yashk2810 Mar 2, 2024 Collaborator

Uh oh!

yixiaoer Mar 2, 2024 Author

Uh oh!

yashk2810 Mar 1, 2024 Collaborator

yixiaoer
Mar 1, 2024

Replies: 2 comments 3 replies

yashk2810
Mar 1, 2024
Collaborator

yixiaoer Mar 2, 2024
Author

yashk2810 Mar 2, 2024
Collaborator

yixiaoer Mar 2, 2024
Author

yashk2810
Mar 1, 2024
Collaborator