miles Docker image does not support B300 (sm_103a)

## Problem

Tried to reproduce Qwen3-14B training (#530) on 8x B300 SXM6 (275GB). The Docker image `radixark/miles:latest` doesn't work on B300 because the bundled `ptxas` doesn't know about `sm_103a`.

## What works

- Docker pull, git pull, pip install — all fine
- Weight conversion (`convert_hf_to_torch_dist.py`) — completes successfully, 14.77B params loaded and saved
- Ray cluster startup + Megatron model loading — fine

## What breaks

SGLang engine startup fails. I hit three issues in sequence:

### 1. TP memory imbalance check (workaround exists)

In colocate mode, Megatron loads first and takes GPU memory. SGLang then sees uneven memory across GPUs and refuses to start:

```
RuntimeError: The memory capacity is unbalanced.
min_per_gpu_memory=101.79, local_gpu_memory=265.55, local_gpu_memory * 0.9=238.99
```

miles sets `SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK=true`, but the current SGLang version checks `SGLANG_ENABLE_TP_MEMORY_INBALANCE_CHECK` instead (defaults to `true`). The old env var is deprecated and no longer controls the behavior.

Workaround: add `"SGLANG_ENABLE_TP_MEMORY_INBALANCE_CHECK": "false"` to the runtime env.

### 2. CUDA graph capture fails (workaround exists)

After bypassing (1), Triton fails to compile kernels during CUDA graph capture:

```
NoTritonConfigsError: No valid triton configs. PTXASError: PTXAS error: Internal Triton PTX codegen error
```

Workaround: `--sglang-disable-cuda-graph`

### 3. ptxas does not support sm_103a (no workaround)

After bypassing (1) and (2), SGLang still crashes because core Triton kernels (e.g. `alloc_extend_kernel` in the memory allocator) also fail to compile:

```
ptxas fatal   : Value 'sm_103a' is not defined for option 'gpu-name'
```

This is the bundled ptxas inside the Triton package, not the system one. There's no script-level workaround for this.

## Environment

- 8x NVIDIA B300 SXM6 AC, 275GB each
- `radixark/miles:latest` (sha256:6e467519505da8407f8c33e3c5ec622c0cc4e6e0929102efe34f8415940caf06)
- Triton ptxas path: `/usr/local/lib/python3.12/dist-packages/triton/backends/nvidia/bin/ptxas`

## Notes

- This blocks all models on B300, not just 14B.
- Issue (1) about the deprecated env var seems like a miles-side bug that affects all GPUs in colocate mode — worth fixing separately regardless of B300 support.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

miles Docker image does not support B300 (sm_103a) #533

Problem

What works

What breaks

1. TP memory imbalance check (workaround exists)

2. CUDA graph capture fails (workaround exists)

3. ptxas does not support sm_103a (no workaround)

Environment

Notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

miles Docker image does not support B300 (sm_103a) #533

Description

Problem

What works

What breaks

1. TP memory imbalance check (workaround exists)

2. CUDA graph capture fails (workaround exists)

3. ptxas does not support sm_103a (no workaround)

Environment

Notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions