[Documentation] nixlbench ETCD barrier synchronization timing guidance needed

## Problem
nixlbench fails with ETCD barrier synchronization errors when nodes don't start within close proximity:

```
Error in barrier wait... rank 0 completed 1/2 ranks)
Failed to synchronize at start barrier
```

## Environment
- Platform: AWS SageMaker HyperPod EKS  
- Setup: 2-node H100 configuration with ETCD coordination
- Network: EFA-enabled

## Details
When using `kubectl exec` to start nixlbench on multiple nodes, if there's >3 seconds delay between node startups (or AWS credentials expire between calls), ETCD barrier synchronization fails.

## Suggestion
Add documentation guidance:
1. Nodes should start within 3 seconds of each other
2. Recommend using Kubernetes Jobs for simultaneous startup
3. Document ETCD barrier timeout settings

## Workaround
Start both nodes within 3 seconds:
```bash
kubectl exec node1 -- nixlbench ... &
sleep 1
kubectl exec node2 -- nixlbench ...
```

Or use K8s Jobs for proper simultaneous startup.

Successfully implemented with K8s StatefulSet + coordination script.

Reference: https://github.com/dmvevents/dynamo-workshop/blob/main/NIXLBENCH_TESTING.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Documentation] nixlbench ETCD barrier synchronization timing guidance needed #1022

Problem

Environment

Details

Suggestion

Workaround

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Documentation] nixlbench ETCD barrier synchronization timing guidance needed #1022

Description

Problem

Environment

Details

Suggestion

Workaround

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions