-
Notifications
You must be signed in to change notification settings - Fork 43
Description
Instruction timing currently only has "Clks normalized by wavefronts", "Clks normalized by hit count" and "Total clks" options. The second one is the most useful of them when trying to view alu latency. But, often there are latency outliers. I assume one reason is when the CU issued instructions from other waves between two neighboring VALU, but there could be other reasons too.
Currently, the only way to avoid the issues - at least partially - is to only look at the hits from "slowest in selection" or "fastest in selection". They will still have outliers, but they are much more apparent and not hidden in the average of thousands of hits. The obvious disadvantage is that they these two waves might not cover all execution paths in the program, so in branch heavy workloads a lot of the shader will just be greyed out.
Lowest and median clks of the hits would exclude large outliers by definition and make "Wavefront latency: selection total" more useful for displaying ALU latency and sub optimal instruction scheduling.
Examples of the issues I'm currently having:
This is using the selection total option. All of the VALU are displayed has having huge latency (in ALU terms), if they were always required then performance would be terrible.
This is fasted in selection. Clearly the compiler did a decent scheduling job here, a lot of instructions have minimum possible latency.
This is slowest in selection. There are clear outliers, but overall this is still more informational than the first view. Not what I would want to use, but still better than the first option if the second had no coverage of this part of the program.


