Filter out operators we cannot test correctness for #105

PaliC · 2025-08-21T19:10:21Z

The issue this aims to solve is described in #104

Once this is merged I will update the tritonbench suite. This PR is a bit specific to tritonbench atm. It is not comprehensive of everything that needs to be accounted for just for what's in the tritonbench test set right now.

Some analysis

Right now we are using opinfo as our ground truth for testing. However, it has some pretty bogus inputs and outputs (especially with our testing harness in allclose. Effectively, for random or fill ops it outputs empty tensors or watermarked inputs. Some examples of this are below.

randint.default

[2025-08-22 15:05:16][INFO][eval.py] Looking at randint.default with
[2025-08-22 15:05:16][INFO][eval.py] args - (10, torch.Size([0, 5, 0]))
[2025-08-22 15:05:16][INFO][eval.py] kwargs - {'device': 'cuda'}
[2025-08-22 15:05:16][INFO][eval.py] reference (which is aten) output is: tensor([], device='cuda:0', size=(0, 5, 0), dtype=torch.int64)
[2025-08-22 15:05:16][INFO][eval.py] aten output is: tensor([], device='cuda:0', size=(0, 5, 0), dtype=torch.int64)

Bernoulli.default

[2025-08-22 15:05:16][INFO][eval.py] Looking at bernoulli.default with
[2025-08-22 15:05:16][INFO][eval.py] args - (tensor([], device='cuda:0', size=(0, 3), dtype=torch.bfloat16),)
[2025-08-22 15:05:16][INFO][eval.py] kwargs - {}
[2025-08-22 15:05:16][INFO][eval.py] reference (which is aten) output is: tensor([], device='cuda:0', size=(0, 3), dtype=torch.bfloat16)
[2025-08-22 15:05:16][INFO][eval.py] aten output is: tensor([], device='cuda:0', size=(0, 3), dtype=torch.bfloat16)

empty_like.default

[2025-08-22 15:05:16][INFO][eval.py] Looking at empty_like.default with
[2025-08-22 15:05:16][INFO][eval.py] args - (tensor(-6.7188, device='cuda:0', dtype=torch.bfloat16),)
[2025-08-22 15:05:16][INFO][eval.py] kwargs - {}
[2025-08-22 15:05:16][INFO][eval.py] reference (which is aten) output is: -6.71875
[2025-08-22 15:05:16][INFO][eval.py] aten output is: -6.71875

[2025-08-22 15:05:16][INFO][eval.py] Looking at empty_like.default with
[2025-08-22 15:05:16][INFO][eval.py] args - (tensor([], device='cuda:0', size=(0, 5, 0), dtype=torch.bfloat16),)
[2025-08-22 15:05:16][INFO][eval.py] kwargs - {}
[2025-08-22 15:05:16][INFO][eval.py] reference (which is aten) output is: tensor([], device='cuda:0', size=(0, 5, 0), dtype=torch.bfloat16)
[2025-08-22 15:05:16][INFO][eval.py] aten output is: tensor([], device='cuda:0', size=(0, 5, 0), dtype=torch.bfloat16)

new_empty_strided.default

[2025-08-22 15:05:16][INFO][eval.py] Looking at new_empty_strided.default with
[2025-08-22 15:05:16][INFO][eval.py] args - (tensor(-6.7188, device='cuda:0', dtype=torch.bfloat16), (), ())
[2025-08-22 15:05:16][INFO][eval.py] kwargs - {}
[2025-08-22 15:05:16][INFO][eval.py] Error in allclose
[2025-08-22 15:05:16][INFO][eval.py] 
Exception raised for None:
    args: ((T([], bf16), T([], bf16),), {})
    exc: Scalars are not close!

Expected 0.0 but got -6.71875.
Absolute difference: 6.71875 (up to 0.01 allowed)
Relative difference: inf (up to 0.01 allowed)

[2025-08-22 15:05:16][INFO][eval.py] reference (which is aten) output is: -6.71875
[2025-08-22 15:05:16][INFO][eval.py] aten output is: 0.0
[2025-08-22 15:05:16][INFO][eval.py] for new_empty_strided.default is_correct=False abs_error=6.71875 rel_error=1.0

This pr allows us to skip these tests for the torchbench as our allclose does not handle them.

What to do later

For pytorch the testing of distributions and random ops can be found at
test_distributions.py and test_random

For fill / tensor creation ops test_tensor_creation_ops.py is where we find those tests

We need to add this testing to backendbench

jiannanWang · 2025-08-21T20:03:41Z

I'm ok with excluding those ops. They always fail in the tests.

My concern is about the naming of these ops. I think bernoulli is a random op. But is it appropriate to also call empty_like, new_empty, new_empty_strieded random ops? I feel there's a difference.

PaliC · 2025-08-21T20:15:34Z

@jiannanWang That's fair, I guess it technically is it isn't. Let's go with untestable

msaroufim · 2025-08-21T22:37:10Z

BackendBench/scripts/dataset_filters.py

+    "empty_like",
+    "new_empty",
+    "new_empty_strided",
+    "bernoulli",


I feel like this one is different from the others, it is testable

I think fixing the error message to say we don't support them yet is correct. There is a way to test these but it would require some custom work

msaroufim · 2025-08-21T22:37:46Z

BackendBench/scripts/dataset_filters.py

@@ -20,6 +20,13 @@
    "_fft_c2c.default",  # cuFFT only supports dimensions whose sizes are powers of two when computing in half precision
 ]

+UNTESTABLE_OPERATORS = [
+    "empty_like",


we can test metadata if not the values, like we expect the output to have a certain shape

same reply as above, I'll add comments to talk about this

msaroufim

see CIL also if we're making claims around untestable ops I'd like a more comprehensive list it'll just be confusing slowly iterating on this

PaliC · 2025-08-21T22:58:12Z

@msaroufim this is the entire set of ops for torchbench + opinfo that we don't currently support correctness for (assuming aten is correct). After merging #92 this should become much easier if more stuff comes up as we expand our suites

msaroufim

Per our offline chat please add more details on the logs you got and how OpInfo does testing for random operators (watermarking) and memory allocation (also likely watermarking)

msaroufim · 2025-08-23T00:45:27Z

I was kinda hoping for a bit more detail before merge, in particular when linking to testing of randomness or creation ops I don't feel like the PR description adequately explains the current gaps in OpInfo and how we'd go about fixing them. In particular it's obvious to conceive of examples where memory allocation is being done incorrectly so if you're not incorrect on some obvious cases then you're more likely to be correct. Since we have merge rights into PyTorch core itself we have the ability to go make improvements there, this issue points out something "off" about our testing but stops short of scoping out how to fix it

Filter out operators we cannot test correctness for

46cf91d

PaliC requested review from msaroufim and jiannanWang as code owners August 21, 2025 19:10

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Aug 21, 2025

fix wording

f8509f7

jiannanWang approved these changes Aug 21, 2025

View reviewed changes

msaroufim reviewed Aug 21, 2025

View reviewed changes

msaroufim requested changes Aug 21, 2025

View reviewed changes

fix wording

b736062

PaliC requested review from msaroufim and jiannanWang August 21, 2025 23:02

msaroufim approved these changes Aug 22, 2025

View reviewed changes

PaliC merged commit 289d8c6 into meta-pytorch:main Aug 22, 2025
3 checks passed

PaliC deleted the remove_random_ops branch August 22, 2025 22:13

PaliC mentioned this pull request Aug 22, 2025

Remove ops that you can't test for correctness against #104

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Filter out operators we cannot test correctness for #105

Filter out operators we cannot test correctness for #105

Uh oh!

PaliC commented Aug 21, 2025 •

edited

Loading

Uh oh!

jiannanWang commented Aug 21, 2025

Uh oh!

PaliC commented Aug 21, 2025 •

edited

Loading

Uh oh!

msaroufim Aug 21, 2025 •

edited

Loading

Uh oh!

PaliC Aug 21, 2025

Uh oh!

msaroufim Aug 21, 2025

Uh oh!

PaliC Aug 21, 2025

Uh oh!

msaroufim left a comment •

edited

Loading

Uh oh!

PaliC commented Aug 21, 2025 •

edited

Loading

Uh oh!

msaroufim left a comment

Uh oh!

Uh oh!

msaroufim commented Aug 23, 2025 •

edited

Loading

Uh oh!

Uh oh!

Filter out operators we cannot test correctness for #105

Filter out operators we cannot test correctness for #105

Uh oh!

Conversation

PaliC commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Some analysis

What to do later

Uh oh!

jiannanWang commented Aug 21, 2025

Uh oh!

PaliC commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

msaroufim Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

PaliC Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

msaroufim Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

PaliC Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

msaroufim left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

PaliC commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

msaroufim left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

msaroufim commented Aug 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

PaliC commented Aug 21, 2025 •

edited

Loading

PaliC commented Aug 21, 2025 •

edited

Loading

msaroufim Aug 21, 2025 •

edited

Loading

msaroufim left a comment •

edited

Loading

PaliC commented Aug 21, 2025 •

edited

Loading

msaroufim commented Aug 23, 2025 •

edited

Loading