Ensure that cuda errors don't waterfall when benchmarking by spawning an isolated process

Say we have an input that looks like 
`((T([512, 4096, 56, 56], f16), T([512, 4096, 56, 56], f16)), {})`
At large inputs like this the operator _adaptive_avg_pool2d_backward.default has the behavior in which it will succeed, but if you try to run torch.cuda.synronize() or torch.cuda.clear_cache afterwards (or really just try to run another forward pass of a model/op) you run into cuda illegal memory access issues. If we are not careful. This issue was also seen in KernelBench / the inductor code scraper as well.

 I'm a bit worried if we have some backend which has kernels with a similar problem it can cause our benchmark to crash.

The way this was solved in KernelBench/the scraper is that we spawned an isolated process for each kernel execution. While this creates overhead, it ensures that one kernel doing wacky things does not mess with the other kernels. There is also the added benefit of running multiple kernels at once (one per gpu) is a logical next step.

For this task we'd want to have the following things (in seperate prs or the same one)
- [x] A unit test which tests the adverse input with eval (or a similar one) followed by running a good input. (this can be expected to fail in the initial pr)
- [x] Adding isolated processes to eval to solve for this.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Ensure that cuda errors don't waterfall when benchmarking by spawning an isolated process #45

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Ensure that cuda errors don't waterfall when benchmarking by spawning an isolated process #45

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions