Automatic parallel matmul? #8538

ivandustin · 2021-11-13T06:31:06Z

ivandustin
Nov 13, 2021

Hi,

Based on XLA Known Issues, an XLA program runs on exactly one device. In TPU, this means that a jit(function) runs in a single core. (To utilize all the cores, pmap must be used.)

A reinforcement learning algorithm consists of asking an agent and updating the environment based on the agent's response. The training dataset is created on the fly. Therefore, there is no single training dataset that can be scattered to all cores and train in parallel.

For performance, it is possible to jit the whole reinforcement learning algorithm training. Since the algorithm is heavy on repeatedly asking the agent, you will eliminate the slow data transfer between the host and accelerator.

The problem is: Since the whole training function is jitted, it will only run in a single core. If the neural network is big, we wanted to utilize all the TPU cores when it comes to the part of the algorithm where the agent is asked (inferencing).

This problem could be solved if the matmul inside a single XLA program or a jitted function automatically utilizes all the TPU cores when the input shapes are big enough.

Can I kindly ask you guys if this is possible? Or how would you solve this problem?

Thank you.

Answered by sudhakarsingh27

Nov 18, 2021

For running multiple NNs in parallel, you could use pmap as you noted. I believe it will jit under the hood automatically (see #5681 (comment)). More info on how to use pmap: https://colab.research.google.com/github/google/jax/blob/main/cloud_tpu_colabs/Pmap_Cookbook.ipynb

If you want to parallelize a single matmul over multiple cores, then use xmap or pjit https://jax.readthedocs.io/en/latest/notebooks/xmap_tutorial.html.
https://jax.readthedocs.io/en/latest/jax-101/08-pjit.html

View full answer

sudhakarsingh27 · 2021-11-18T20:59:18Z

sudhakarsingh27
Nov 18, 2021

For running multiple NNs in parallel, you could use pmap as you noted. I believe it will jit under the hood automatically (see #5681 (comment)). More info on how to use pmap: https://colab.research.google.com/github/google/jax/blob/main/cloud_tpu_colabs/Pmap_Cookbook.ipynb

If you want to parallelize a single matmul over multiple cores, then use xmap or pjit https://jax.readthedocs.io/en/latest/notebooks/xmap_tutorial.html.
https://jax.readthedocs.io/en/latest/jax-101/08-pjit.html

7 replies

mattiasmar Nov 20, 2021

@sudhakarsingh27 Thanks for your reply. With "varying degrees of parallelism" I meant whether one can use a subsection of the mesh for computing a specific layer. Is that possible, if yes how?

sudhakarsingh27 Nov 22, 2021

I see. The way I understand it, we always work with a mesh of devices and can shard/replicate tensors on them. If we don't shard any tensor, then that tensor will be automatically replicated on each device and all the devices will then perform identical computation over that tensor. Essentially, it'd be like doing a matmul on a single device but since we were working with a mesh of devices, each device will do the same computation.

mattiasmar Nov 22, 2021

So I've seen examples where the mesh have been divided in "mp" and "dp" slices.
Do you know if is possible to work with multiple overlapping slices ("views") of the same mesh? I could imagine a workload that could benefit from the possibility of controlling the mp/dp ratio on a per layer basis.

sudhakarsingh27 Nov 23, 2021

I believe that should be possible with pseudo-code something like this inside a function to be pjitted:

dummy_tensor = np.random.randn(64, 128)
with_sharding_constraint(dummy_tensor, ShardingSpec("dp", "mp"))

out1 = dense_layer1(dummy_tensor)

with_sharding_constraint(out1, ShardingSpec(("mp", "dp"))) # or maybe ShardingSpec(("dp",None)) or ShardingSpec(("mp", None)) or ShardingSpec(None)
out2 = dense_layer2(out1)

That'd change the sharding of tensors on a per operator basis (in your case, per layer).

sudhakarsingh27 Nov 23, 2021

Do you know if is possible to work with multiple overlapping slices ("views") of the same mesh?

Ah, I'm not sure if it's possible to change the view of the mesh itself once it's defined (e.g. we can't change the mesh from (8,2) to (4,4)). Afaik, the underlying implementation (SPMD) performs optimization based on this very information. To work around this, you might have to experiment with different tensor sharding strategies as I mentioned above or as is mentioned in the pjit tutorial.

Automatic parallel matmul? #8538

Uh oh!

Uh oh!

ivandustin Nov 13, 2021

Replies: 1 comment · 7 replies

Uh oh!

Uh oh!

sudhakarsingh27 Nov 18, 2021

Uh oh!

mattiasmar Nov 20, 2021

Uh oh!

sudhakarsingh27 Nov 22, 2021

Uh oh!

mattiasmar Nov 22, 2021

Uh oh!

sudhakarsingh27 Nov 23, 2021

Uh oh!

Uh oh!

sudhakarsingh27 Nov 23, 2021

ivandustin
Nov 13, 2021

Replies: 1 comment 7 replies

sudhakarsingh27
Nov 18, 2021