SPMD Parallellism on CPU -- Question concerning Best Practice. #15174

danielkelshaw · 2023-03-23T18:54:24Z

danielkelshaw
Mar 23, 2023

Hey, I'm currently doing some work where I need to run a function over a variety of initial conditions - I'm looking for the most efficient way to do this.

I have a function of the form:

def p_compute_geodesic(initial_condition: jax.Array) -> tuple[PyTree, PyTree]:

    """Brief overview of the function:

    This function runs an optimisation procedure (Newton-Raphson) to try and minimise some residual.
    - A jax.lax.while_loop is used to determine stop conditions for the optimisation.
    - The residual solves an initial value problem - this is done using a jax.lax.fori_loop.

    The optimised state, and auxiliary information about the optimisation are returned as PyTrees.

    Note: for simplicity I am demonstrating this as a partial application (hence p_), omitting static inputs.
    """

    ...

The function is largely a wrapper around a jax.lax.while_loop so the use of jax.jit does not do much in this case as the condition / body function are already lowered.

Attempt 01 :: `jax.vmap`

My first port-of-call was jax.vmap to run this as SIMD. However, when I inspect the cpu usage via htop it is clear that not all cpus are being used. The load factor does not increase much, and is no-where near optimal loading on my 16 core machine.

a, b = jax.vmap(p_compute_geodesic)(initial_conditions)

I understand that vmap will take as long as the slowest function call, this is something I'm happy to live with.

Attempt 02 :: `jax.pmap` with `jax.vmap`

I notice that jax.device_count() states I am using a single cpu, however htop shows processing on multiple cores. I gather from other discussions that this is due to BLAS / LAPACK calls for certain functions? In order to access all available cpus, I set the environment variable

XLA_FLAGS="--xla_force_host_platform_device_count=16"

This allows me to use pmap to utilise parallelism in a SPMD fashion, limiting the size of the axis over which pmap is applied to 16. In my case, I reshape my input to (16, 6, ...) in order to run the function for 96 initial conditions:

a, b = jax.pmap(jax.vmap(p_compute_geodesic))(rearranged_initial_conditions)

This runs considerably faster but brings up a few questions:

Is there any way to use this process for an arbitrary number of initial conditions? In this example I have explicitly chosen a multiple of the number of devices visible to jax - is this a restriction, or are there any workarounds?
There is nothing stopping me from setting xla_force_host_platform_device_count to a number exceeding the total number of cores available on my machine -- it is not clear how this works, or how this affects performance? If I set the number of cores to 200, I can successfully pmap over these, but how this maps to true distribution over cpus is unclear.

Potential use of `jax.experimental.maps.xmap`

The idea of 'easy-to-revise parallelism' is quite alluring, but at first glance it appears that I would need to re-write the internals of my function in a pretty major way? Is this something worth looking into?

TL;DR

What is the best way to maximise cpu usage when using jax -- when should we use vmap, and when should we look to something more complicated such as the pmap of vmap mentioned above? How does xmap fit in here, and would this require a major re-write of the code in order to work?

Thank you for any advice you can offer, it's much appreciated!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SPMD Parallellism on CPU -- Question concerning Best Practice. #15174

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

SPMD Parallellism on CPU -- Question concerning Best Practice. #15174

Uh oh!

danielkelshaw Mar 23, 2023

Attempt 01 :: jax.vmap

Attempt 02 :: jax.pmap with jax.vmap

Potential use of jax.experimental.maps.xmap

TL;DR

Replies: 0 comments

danielkelshaw
Mar 23, 2023

Attempt 01 :: `jax.vmap`

Attempt 02 :: `jax.pmap` with `jax.vmap`

Potential use of `jax.experimental.maps.xmap`