Skip to content

Conversation

@dev-tomek
Copy link
Contributor

@dev-tomek dev-tomek commented Dec 2, 2025

Addresses #5481.
Fixes RuntimeError: Native API failed. Native API returns: 20 (UR_RESULT_ERROR_DEVICE_LOST) on Triton GEMM + PostOp (add matrix) kernel benchmark int8 BMG.

The memory reservation would rise between each configuration ran within the benchmark, finally resulting in oom under the hood.
The issue visible on only a single bmg runner due to more system RAM on that runner, which makes it pass runtime checks and run an additional test case.

Passing Triton GEMM + PostOp (add matrix) kernel benchmark int8 BMG on that runner.

Similar error message is visible also on FlexAttention (batch_size=16) Causal Mask fwd, however the same fix does not apply indicating a different issue. This will be continued here: #5603.

@dev-tomek dev-tomek marked this pull request as ready for review December 4, 2025 13:53
@dev-tomek dev-tomek changed the title empty cache between each run to avoid OOM Fix Triton GEMM + PostOp (add matrix) kernel benchmark int8 failure BMG Dec 4, 2025
# Maximum across onednn=600, triton=1000
# For onednn and triton: Some configs increase performance with warmup as a step function, but some
# slowly decrease with saturation. Performance is best at 150-200ms range, but we want stable, not just best
torch.xpu.empty_cache()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about the other benchmarks we run, shouldn't we do the same ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do it in a common place, maybe get_empty_cache_for_benchmark?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BMG] Triton GEMM + PostOp (add matrix) kernel benchmark int8 failure

5 participants