xmap + CPU #11236
Replies: 1 comment
-
(1)
One XLA CPU device may not be attatched to a certain physical CPU. For example, you can create 128 XLA CPU devices even if only 4 physical CPUs are available.
Yes. In a multi-threaded process, all of the process' threads share the same memory and open files.
(2)
I believe the only way is to load the data in a single thread and shard it to each device. (3)
Assume that there are 128 XLA CPU devices, you can create a random key in a single thread, split it to 128 keys, shard it to each device, and pass it into the Alternatively, you can use |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I have some questions about
xmap
on the cpu with--xla_force_host_platform_device_count
. Let's assume there is a single host with N CPUs and N JAX CPU devices set via--xla_force_host_platform_device_count
.My current understanding is that in a single host setup there will be one master process that will spawn separate threads for each device. Do those threads have a shared memory space? If so, is that shared memory space used to make cross-device communication cheaper? For example, say you need to do an all-reduce among the devices. One way is copy the data from each thread's memory space to each other thread's memory. Another way is to just have one shared memory pool for all the threads and don't do any copying.
I'm trying to use
tf.data.Datasets
to load and feed data to my CPU device threads. One strategy would be to create a different data loader per thread by giving thread-unique seeds to the data shufflers. Then, the data wouldn't have to be communicated across 'device' boundaries. Is that possible with a single-host multi-cpu-device setup? Or is the only way to load data on a single thread and then shard it and communicate it out to each device?Relatedly, is there some way to get the device ID in the single-host multi-cpu-device setup? I would like to use it to seed different dataset loaders. Or am I just misunderstanding the programming model?
Beta Was this translation helpful? Give feedback.
All reactions