MJX: Warp FFI sometimes deadlocks when `donate_argnums` is present in `pmap`

### Intro

Hey,

The following bug report is a bit limited in its details and I'm aware that the Jax-Warp backend is still very experimental. But I still wanted to report this issue to see if anyone has experienced something similar, and hopefully also save some time from others.

### My setup

`mujoco==3.4.0`, `mujoco-mjx==3.4.0`, `warp-lang==1.10.1` on `linux_x86_64` with 8xA100 GPUs.

### What's happening? What did you expect?

We use [Brax](https://github.com/google/brax) [PPO](https://github.com/google/brax/blob/a6b0c6b67841ff89f53a0667a3afaa336c11b0b0/brax/training/agents/ppo/train.py) to train RL policies. We've recently noticed that `jax.pmap` of `training_epoch` in [`ppo/train.py`](https://github.com/google/brax/blob/a6b0c6b67841ff89f53a0667a3afaa336c11b0b0/brax/training/agents/ppo/train.py#L636-L639) hangs when `donate_argnums=(0,1)`. What's annoying about this hanging is that it only happens after O(10e6) training steps.

The reason I think this is a MJX-Warp FFI issue is that the training works fine with `impl=jax` and only fails when `impl=warp`. Additionally, I've recently observed similar hanging in MJX-Warp FFI in a completely independent setting. In that setting, I was trying to access Warp memory outside of the allocated range, which deadlocked silently in a very similar way.

A simple workaround for those who might experience similar deadlocks/hangs is to remove the `donate_argnums` variable from `jax.pmap`.

### Steps for reproduction

As mentioned above, I unfortunately don't yet have a good reproduction for this. Our mujoco model is proprietary and the training setup is a bit customized, which is why I haven't been able to reproduce this with open-source models or vanilla Brax.

### Minimal model for reproduction

_No response_

### Code required for reproduction

_No response_

### Confirmations

- [x] I searched the [latest documentation](https://mujoco.readthedocs.io/en/latest/overview.html) thoroughly before posting.
- [x] I searched previous [Issues](https://github.com/google-deepmind/mujoco/issues) and [Discussions](https://github.com/google-deepmind/mujoco/discussions), I am certain this has not been raised before.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MJX: Warp FFI sometimes deadlocks when `donate_argnums` is present in `pmap` #2980

Intro

My setup

What's happening? What did you expect?

Steps for reproduction

Minimal model for reproduction

Code required for reproduction

Confirmations

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

MJX: Warp FFI sometimes deadlocks when donate_argnums is present in pmap #2980

Description

Intro

My setup

What's happening? What did you expect?

Steps for reproduction

Minimal model for reproduction

Code required for reproduction

Confirmations

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

MJX: Warp FFI sometimes deadlocks when `donate_argnums` is present in `pmap` #2980