xmap + CPU #11236

dieterichlawson · 2022-06-24T01:16:45Z

dieterichlawson
Jun 24, 2022

I have some questions about xmap on the cpu with --xla_force_host_platform_device_count. Let's assume there is a single host with N CPUs and N JAX CPU devices set via --xla_force_host_platform_device_count.

My current understanding is that in a single host setup there will be one master process that will spawn separate threads for each device. Do those threads have a shared memory space? If so, is that shared memory space used to make cross-device communication cheaper? For example, say you need to do an all-reduce among the devices. One way is copy the data from each thread's memory space to each other thread's memory. Another way is to just have one shared memory pool for all the threads and don't do any copying.
I'm trying to use tf.data.Datasets to load and feed data to my CPU device threads. One strategy would be to create a different data loader per thread by giving thread-unique seeds to the data shufflers. Then, the data wouldn't have to be communicated across 'device' boundaries. Is that possible with a single-host multi-cpu-device setup? Or is the only way to load data on a single thread and then shard it and communicate it out to each device?
Relatedly, is there some way to get the device ID in the single-host multi-cpu-device setup? I would like to use it to seed different dataset loaders. Or am I just misunderstanding the programming model?

ayaka14732 · 2022-10-04T04:43:30Z

ayaka14732
Oct 4, 2022

(1)

My current understanding is that in a single host setup there will be one master process that will spawn separate threads for each device.

One XLA CPU device may not be attatched to a certain physical CPU. For example, you can create 128 XLA CPU devices even if only 4 physical CPUs are available.

Do those threads have a shared memory space?

Yes. In a multi-threaded process, all of the process' threads share the same memory and open files.

If so, is that shared memory space used to make cross-device communication cheaper? For example, say you need to do an all-reduce among the devices....

xmap is a high-level API, so the cross-device communication is handled inside it. In other words, you will not need to implement all-reduce by yourself.

(2)

I'm trying to use tf.data.Datasets to load and feed data to my CPU device threads... Or is the only way to load data on a single thread and then shard it and communicate it out to each device?

I believe the only way is to load the data in a single thread and shard it to each device.

(3)

I would like to use it to seed different dataset loaders.

Assume that there are 128 XLA CPU devices, you can create a random key in a single thread, split it to 128 keys, shard it to each device, and pass it into the xmap-ed function as a regular parameter.

Alternatively, you can use np.arange(128) to create 128 device IDs, hard it to each device, and pass it into the xmap-ed function.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

xmap + CPU #11236

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

xmap + CPU #11236

Uh oh!

dieterichlawson Jun 24, 2022

Replies: 1 comment

Uh oh!

ayaka14732 Oct 4, 2022

dieterichlawson
Jun 24, 2022

ayaka14732
Oct 4, 2022