Skip to content

torch.randn not supported on the 0.11.0 fork #6

@Linus467

Description

@Linus467

when running longer instances my vllm server always errors out because torch.randn is not supported.

docker run --rm -i \
  --device=/dev/kfd --device=/dev/dri \
  mixa3607/vllm-gfx906:0.11.0-rocm-6.3.3-tomylin890-abbe414 \
  python3 - << 'PY'
import torch

print("device:", torch.cuda.get_device_name(0))
x = torch.randn(4, 32000, device="cuda")
y, idx = x.sort(dim=-1, descending=False)
print("sort ok, shape:", y.shape)
PY
device: AMD Instinct MI50/MI60
Traceback (most recent call last):
  File "<stdin>", line 5, in <module>
torch.AcceleratorError: HIP error: invalid device function
HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing AMD_SERIALIZE_KERNEL=3
Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.
docker run --rm -i \
  --device=/dev/kfd --device=/dev/dri \
  mixa3607/vllm-gfx906:0.11.0-rocm-6.3.3  \
  python3 - << 'PY'
import torch

print("device:", torch.cuda.get_device_name(0))
x = torch.randn(4, 32000, device="cuda")  # simulate logits
y, idx = x.sort(dim=-1, descending=False)
print("sort ok, shape:", y.shape)
PY
device: AMD Instinct MI50/MI60
Traceback (most recent call last):
  File "<stdin>", line 5, in <module>
torch.AcceleratorError: HIP error: invalid device function
HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing AMD_SERIALIZE_KERNEL=3
Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.

docker run --rm -i \
  --device=/dev/kfd --device=/dev/dri \
  nalanzeyu/vllm-gfx906:latest  \
  python3 - << 'PY'
import torch

print("device:", torch.cuda.get_device_name(0))
x = torch.randn(4, 32000, device="cuda")
y, idx = x.sort(dim=-1, descending=False)
print("sort ok, shape:", y.shape)
PY
device: AMD Instinct MI60 / MI50
sort ok, shape: torch.Size([4, 32000])

This is used in some top -k -> top-p ste p which causes an error after prolonged runs. It happens for me when using langchain for like 50 requests at once and after 40 are done it errors out.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions