How can I debug compute pass scheduling? #5074

Patryk27 · 2024-01-16T20:44:41Z

Patryk27
Jan 16, 2024

Hi,

I've got a couple of compute shaders that work on two different buffers and later some another shader merges those two buffers together - I've got a strong suspicion that most of those shaders (being mixed ALU/mem-bound) should be scheduled in parallel, but it looks like Mac only "parallelizes" the first ones:

For completion, that's how it looks like in Xcode's dependencies tab:

Inspecting those shaders tells me that they spend a lot of time waiting for memory (and don't actually use that many registers), making them a perfect candidate for parallelization:

... but it's not happening - is there any way for me to debug why scheduler would do this or the other thing?

(maybe there's some low-hanging fruit to check, some accidentally-created synchronization edge that I could get rid of etc.)

Thanks!

Edit: of course I'm aware that passes being executed in parallel doesn't imply any performance gain (and that it's some magical code in the driver which ultimately decides what's happening) - I'd just like to make sure there's no low-hanging optimization-fruit there.

cwfitzgerald · 2024-01-16T21:01:00Z

cwfitzgerald
Jan 16, 2024
Maintainer

Thankfully, this is very simple! They are defined to run in serial in WebGPU. Each dispatch is its own usage scope, so will run as-if serial. We communicate this to metal by making a "serial" compute pass, causing this behavior.

See #2659 for discussion of parallel compute passes - something we want just haven't been implemented

5 replies

Patryk27 Jan 16, 2024
Author

Yes, but that's within a single command buffers, no? Compute passes across command buffers can still be parallelized by the driver (as seen in the image above), right? 🤔

Patryk27 Jan 16, 2024
Author

Or, to put it differently, if currently the implementation is force-serial, then why I see two passes getting executed concurrently above? 😅

Patryk27 Jan 16, 2024
Author

For comparison, here's how it looks like on the waterfall view in Xcode 👀
(the tiny slices being, it seems, the allocated time slots for given shader)

Patryk27 Jan 19, 2024
Author

cc @cwfitzgerald 👀

BGR360 Sep 14, 2024

Compute passes across command buffers can still be parallelized by the driver (as seen in the image above), right?

@Patryk27 no. I recently asked a similar question and found out that command buffers execute serially.

How can I debug compute pass scheduling? #5074

Uh oh!

Uh oh!

Patryk27 Jan 16, 2024

Replies: 1 comment · 5 replies

Uh oh!

cwfitzgerald Jan 16, 2024 Maintainer

Uh oh!

Uh oh!

Patryk27 Jan 16, 2024 Author

Uh oh!

Patryk27 Jan 16, 2024 Author

Uh oh!

Uh oh!

Patryk27 Jan 16, 2024 Author

Uh oh!

Patryk27 Jan 19, 2024 Author

Uh oh!

BGR360 Sep 14, 2024

Patryk27
Jan 16, 2024

Replies: 1 comment 5 replies

cwfitzgerald
Jan 16, 2024
Maintainer

Patryk27 Jan 16, 2024
Author

Patryk27 Jan 16, 2024
Author

Patryk27 Jan 16, 2024
Author

Patryk27 Jan 19, 2024
Author