Skip to content

Ensure that cuda errors don't waterfall when benchmarking by spawning an isolated process #45

@PaliC

Description

@PaliC

Say we have an input that looks like
((T([512, 4096, 56, 56], f16), T([512, 4096, 56, 56], f16)), {})
At large inputs like this the operator _adaptive_avg_pool2d_backward.default has the behavior in which it will succeed, but if you try to run torch.cuda.synronize() or torch.cuda.clear_cache afterwards (or really just try to run another forward pass of a model/op) you run into cuda illegal memory access issues. If we are not careful. This issue was also seen in KernelBench / the inductor code scraper as well.

I'm a bit worried if we have some backend which has kernels with a similar problem it can cause our benchmark to crash.

The way this was solved in KernelBench/the scraper is that we spawned an isolated process for each kernel execution. While this creates overhead, it ensures that one kernel doing wacky things does not mess with the other kernels. There is also the added benefit of running multiple kernels at once (one per gpu) is a logical next step.

For this task we'd want to have the following things (in seperate prs or the same one)

  • A unit test which tests the adverse input with eval (or a similar one) followed by running a good input. (this can be expected to fail in the initial pr)
  • Adding isolated processes to eval to solve for this.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions