Global collective all-reduce fails on sequential calls due to hardcoded/reused transfer IDs

> I used an LLM to help me write the issue below

**Describe the bug**

burn-collective's global all-reduce strategies (Tree, Centralized, Ring) use hardcoded or locally-scoped transfer IDs for the `TensorDataService`. When `all_reduce` is called multiple times in sequence (e.g., via `GradientsParams::all_reduce`, which iterates over each parameter tensor), the transfer IDs collide between calls, causing the data service to return the wrong tensor. This results in shape mismatches and panics.

Specifically:
- **Tree** ([`global/node/tree.rs`](https://github.com/tracel-ai/burn/blob/v0.20.1/crates/burn-collective/src/global/node/tree.rs)): hardcodes transfer IDs `0` and `1` at [L45](https://github.com/tracel-ai/burn/blob/v0.20.1/crates/burn-collective/src/global/node/tree.rs#L45), [L66](https://github.com/tracel-ai/burn/blob/v0.20.1/crates/burn-collective/src/global/node/tree.rs#L66), [L71](https://github.com/tracel-ai/burn/blob/v0.20.1/crates/burn-collective/src/global/node/tree.rs#L71), [L87](https://github.com/tracel-ai/burn/blob/v0.20.1/crates/burn-collective/src/global/node/tree.rs#L87)
- **Centralized** ([`global/node/centralized.rs`](https://github.com/tracel-ai/burn/blob/v0.20.1/crates/burn-collective/src/global/node/centralized.rs)): hardcodes transfer IDs `0` and `1` at [L42](https://github.com/tracel-ai/burn/blob/v0.20.1/crates/burn-collective/src/global/node/centralized.rs#L42), [L62](https://github.com/tracel-ai/burn/blob/v0.20.1/crates/burn-collective/src/global/node/centralized.rs#L62), [L68](https://github.com/tracel-ai/burn/blob/v0.20.1/crates/burn-collective/src/global/node/centralized.rs#L68), [L73](https://github.com/tracel-ai/burn/blob/v0.20.1/crates/burn-collective/src/global/node/centralized.rs#L73)
- **Ring** ([`global/node/ring.rs`](https://github.com/tracel-ai/burn/blob/v0.20.1/crates/burn-collective/src/global/node/ring.rs)): uses a local `transfer_counter` starting at `0` per call at [L83](https://github.com/tracel-ai/burn/blob/v0.20.1/crates/burn-collective/src/global/node/ring.rs#L83), which resets on each invocation

Since the `TensorDataService` maintains a persistent WebSocket connection across calls, the second `all_reduce` call's `expose(..., 0.into())` collides with stale state from the first call. This causes nodes to download the wrong tensor, leading to `PeerSentIncoherentTensor` errors or panics in `broadcast_shape()` / `can_mut_broadcast()` due to shape mismatches.

**To Reproduce**

1. Set up a multi-node configuration with 2 nodes and a global orchestrator
2. Call `all_reduce` more than once in sequence on different tensors (this is what `GradientsParams::all_reduce` does — it iterates over all parameter gradients and calls `burn_collective::all_reduce` for each one)
3. The second `all_reduce` call panics with one of:
   - `PeerSentIncoherentTensor` — shape validation catches the wrong tensor
   - `index out of bounds: the len is 1 but the index is 1` — in `broadcast_shape()` when the downloaded tensor has fewer dimensions than expected

Minimal reproduction: modify the multinode-tests [`node.rs`](https://github.com/tracel-ai/burn/blob/v0.20.1/crates/burn-collective/multinode-tests/src/bin/node.rs#L146-L149) to call `all_reduce` twice with tensors of different shapes (e.g., first `[4, 8]`, then `[16]`). The second call will fail.

Note: the existing multinode-tests only call [`all_reduce` once per session](https://github.com/tracel-ai/burn/blob/v0.20.1/crates/burn-collective/multinode-tests/src/bin/node.rs#L148), which is why this wasn't caught.

**Expected behavior**

Sequential `all_reduce` calls should work correctly, each operating on its own tensor independently. This is required for [`GradientsParams::all_reduce`](https://github.com/tracel-ai/burn/blob/v0.20.1/crates/burn-optim/src/optim/grads.rs#L120-L155) to function, which iterates over all parameter gradients and calls `burn_collective::all_reduce` for each one — the primary integration point for DDP training with custom training loops.

**Suggested fix**

Use a monotonically increasing `AtomicU64` counter (stored on the `Node` struct or passed through the all-reduce functions) for transfer IDs, so each `expose`/`download_tensor` pair gets a globally unique ID across calls. The counter should persist across all `all_reduce` invocations for the lifetime of the collective session.

For tree.rs, instead of:
```rust
data_service.expose(result.clone(), 1, 0.into()).await;
// ...
.download_tensor(child_addr.clone(), 0.into())
```

Use something like:
```rust
let id = node.next_transfer_id(); // AtomicU64::fetch_add(1, Ordering::SeqCst)
data_service.expose(result.clone(), 1, id.into()).await;
// ...
.download_tensor(child_addr.clone(), id.into())
```

**Desktop:**
- OS: Linux (Ubuntu 24.04)
- Burn version: 0.20.1
- Tested with both NdArray and CUDA backends — same behavior on both

**Additional context**

The local collective (intra-process, thread-based) handles sequential `all_reduce` calls correctly. The bug is specific to the global collective (inter-process, WebSocket-based) used for multi-node training.

This blocks any use of `GradientsParams::all_reduce` with the global collective, which means DDP training via the documented pattern (custom training loop + `GradientsParams::all_reduce`) cannot work in multi-node setups.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Global collective all-reduce fails on sequential calls due to hardcoded/reused transfer IDs #4549

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Global collective all-reduce fails on sequential calls due to hardcoded/reused transfer IDs #4549

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions