Simpleperf PMU Event Ambiguity on ARM (Qualcomm) Prevents Accurate Top-Down Analysis #2187
Replies: 3 comments
-
|
Not sure what I can answer here. So if anything is unclear, you can check ARM manual. But I agree that ARM leaves many parts as implementation defined. From software's perspective, the kernel is operating PMU events one by one, it can't start and stop all counters exactly at the same time. This can also create some mismatch. |
Beta Was this translation helpful? Give feedback.
-
Follow-up: Can't compute
|
Beta Was this translation helpful? Give feedback.
-
I think this instruction should be replaced with raw input spec, because instructions are rewiring instructions rather than all instructions that use slots |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Description
I am trying to perform a standard Top-Down Microarchitecture Analysis (TMA) on a Qualcomm Snapdragon 8 Gen 3 device (Phoenix core) using
simpleperf stat. The collected performance counter data contains logical contradictions that make it impossible to derive a coherent model. The core issue is that theraw-stall-slot-*event family does not behave as required for TMA, suggesting overlapping event definitions or incorrect assumptions about their scope.Steps to Reproduce
./7zzs b) on a Qualcomm Snapdragon 8 Gen 3 device.Expected Result
The sum of the four L1 categories should be close to 100%. The events
raw-stall-slot-frontendandraw-stall-slot-backendshould be mutually exclusive subsets ofraw-stall-slot.Actual Result
The collected data shows severe logical inconsistencies. Using the provided output as an example:
The following contradictions are present:
1.
raw-stall-slotdoes not represent total wasted slots.The fundamental TMA axiom is:
Total_Slots = instructions + All_Wasted_Slots.112,886,642,806 + 310,875,425,026 = 423,762,067,832 slotscycles * width=10):69,656,860,191 * 10 = 696,568,601,910 slotsThe calculated value is only ~60% of the theoretical maximum, indicating
raw-stall-slotis not counting all wasted slots (e.g., it may exclude slots wasted by bad speculation).2.
raw-stall-slot-*events are not mutually exclusive.The sum of the sub-components exceeds the total, which is logically impossible if they are mutually exclusive:
raw-stall-slot-backend + raw-stall-slot-frontend = 133,569,545,216 + 179,401,713,413 = 312,971,258,629raw-stall-slot (310,875,425,026).Request
To enable accurate performance analysis, we need:
Clarification: Official documentation for the semantics of these events on supported platforms (especially Qualcomm). Specifically:
raw-stall-slotactually encompass?raw-stall-slot-frontend,raw-stall-slot-backend) overlap?Guidance: A vendor-recommended formula for performing TMA using the events that
simpleperfactually provides on this platform.Improvement: If the current events are insufficient, could
simpleperfbe extended to provide derived metrics that correctly and non-overlappingly calculate the four TMA L1 categories?Environment
This issue prevents any reliable architectural performance analysis on a major Android platform. Thank you for your attention to this matter.
Beta Was this translation helpful? Give feedback.
All reactions