Skip to content

Conversation

@mikhailramalho
Copy link
Member

@mikhailramalho mikhailramalho commented Jun 17, 2025

Updates the SpacemiT X60 scheduling model with actual latencies measured on BPi-F3 hardware.

tl;dr: execution time is neutral on SPEC after this patch. There is a code size regression described in issue #146407

Changes:

  • Added 10 new latency classes
  • Updated latencies for ~30 instruction categories based on hardware measurements

Completed:

  • Basic integer ALU, min/max, saturating/averaging arithmetic
  • Carry operations, mask operations, comparisons
  • Integer/FP division (split simple/complex based on LMUL)
  • Widening operations
  • FP operations including add/sub, mul, FMA
  • FP conversions (widening/narrowing)
  • FP reductions including vfredmax/min/usum (fixed fractional LMUL latencies)
  • FP ordered reductions vfredosum (split simple/complex)
  • FP widening reductions vfwredosum/vfwredusum (split simple/complex)
  • Integer reductions
  • Mask manipulation operations
  • Permutation operations (gather/compress/slide)
  • Narrowing shifts and clips (split simple/complex)

Missing:

  • All vector loads/stores uops are missing their actual latency values. The values in this PR are estimations, while I'm still collecting the real numbers

Performance Impact:

  • https://lnt.lukelau.me/db_default/v4/nts/674?compare_to=673
  • This change is mostly NFC
  • The two benchmarks with improvement/regression on execution time are known to be noisy
  • There is an increase in code size in two benchmarks: reviewing the code changes, we see a lot more vector load/store instructions

Known Issues:

  • Code size regression on two SPEC benchmarks
  • Some operations grouped use worst-case latency
  • TableGen !cond expressions not working as expected for vector single-width FMA instructions
  • All compromises I've made are documented as TODO in the code

Planned follow-up PRs:

  • Debug the code size regressions
  • Add latencies for vector loads/stores
  • Work on the TODOs introduced by this PR. It should require splitting some WriteRes groups and changing other scheduling models, but it should be NFC for them

Signed-off-by: Mikhail R. Gadelha <[email protected]>
Signed-off-by: Mikhail R. Gadelha <[email protected]>
Signed-off-by: Mikhail R. Gadelha <[email protected]>
Signed-off-by: Mikhail R. Gadelha <[email protected]>
Signed-off-by: Mikhail R. Gadelha <[email protected]>
Signed-off-by: Mikhail R. Gadelha <[email protected]>
Signed-off-by: Mikhail R. Gadelha <[email protected]>
Signed-off-by: Mikhail R. Gadelha <[email protected]>
Signed-off-by: Mikhail R. Gadelha <[email protected]>
Signed-off-by: Mikhail R. Gadelha <[email protected]>
Signed-off-by: Mikhail R. Gadelha <[email protected]>
Signed-off-by: Mikhail R. Gadelha <[email protected]>
Signed-off-by: Mikhail R. Gadelha <[email protected]>
Signed-off-by: Mikhail R. Gadelha <[email protected]>
Signed-off-by: Mikhail R. Gadelha <[email protected]>
Signed-off-by: Mikhail R. Gadelha <[email protected]>
Signed-off-by: Mikhail R. Gadelha <[email protected]>
Signed-off-by: Mikhail R. Gadelha <[email protected]>
Signed-off-by: Mikhail R. Gadelha <[email protected]>
Signed-off-by: Mikhail R. Gadelha <[email protected]>
Signed-off-by: Mikhail R. Gadelha <[email protected]>
Signed-off-by: Mikhail R. Gadelha <[email protected]>
Signed-off-by: Mikhail R. Gadelha <[email protected]>
Signed-off-by: Mikhail R. Gadelha <[email protected]>
Signed-off-by: Mikhail R. Gadelha <[email protected]>
Signed-off-by: Mikhail R. Gadelha <[email protected]>
@mikhailramalho mikhailramalho changed the title [WIP][RISCV] Update SpacemiT X60 vector scheduling model with measured latencies [RISC-V] Update SpacemiT X60 vector scheduling model with measured latencies Jun 30, 2025
@mikhailramalho mikhailramalho requested a review from topperc June 30, 2025 19:36

// Pattern of vmacc, vmadd, vmul, vmulh, etc.: e8/e16 = 4/4/5/8, e32 = 5,5,5,8,
// e64 = 7,8,16,32. We use the worst-case until we can split the SEW.
// TODO: change WriteVIMulV, etc to be defined with LMULSEWSchedWrites
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I kind of agree that we can make multiplication's SchedWrite SEW-dependant

Signed-off-by: Mikhail R. Gadelha <[email protected]>
Signed-off-by: Mikhail R. Gadelha <[email protected]>
Signed-off-by: Mikhail R. Gadelha <[email protected]>
Signed-off-by: Mikhail R. Gadelha <[email protected]>
@LiqinWeng LiqinWeng self-requested a review July 2, 2025 02:19
Signed-off-by: Mikhail R. Gadelha <[email protected]>
Signed-off-by: Mikhail R. Gadelha <[email protected]>
Signed-off-by: Mikhail R. Gadelha <[email protected]>
Signed-off-by: Mikhail R. Gadelha <[email protected]>

// Strided and indexed loads and stores: scale with both LMUL and EEW
foreach eew = [8, 16, 32, 64] in {
defvar EEWMultiplier = !div(eew, 8);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, this seems backwards from what I expect with larger EEWs being more expensive? I would expect this to scale with the number of elements, and thus have smaller EEWs be more expensive.


// Segmented loads and stores: base latency multiplied by number of fields
// TODO: These latencies are estimations and are not confirmed experimentally
foreach mx = SchedMxList in {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These don't seem right. I'd expect either LD + shuffle costing, or one-per-element costing.

defm "" : LMULWriteResMX<"WriteVIMinMaxX", [SMX60_VIEU], mx, IsWorstCase>;
}

let Latency = Get44816Latency<mx>.c in {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of these have ReleaseAtCycle, and some don't, but I don't see any obvious pattern? Am I missing something here?

@preames
Copy link
Collaborator

preames commented Jul 14, 2025

I talked w/Mikhail offline. I'm generally happy with the direction of this overall patch, but we need to split off smaller pieces to make them practically reviewable. Mikhail is going to post a much smaller patch, starting with the simple integer instructions. I plan to iterate back and forth with him quickly on the patch series with quick LGTMs on the sub-parts where possible for the "obvious" stuff. We'll slow down once we get to the harder parts, so we can focus attention on the interesting questions.

mikhailramalho added a commit that referenced this pull request Jul 24, 2025
This PR adds hardware-measured latencies for all instructions defined in
Section 11 of the RVV specification: "Vector Integer Arithmetic
Instructions" to the SpacemiT-X60 scheduling model.

The code in this PR was extracted from PR #144564, so it's smaller to
review. I made a few adjustments here and there, and the code is almost
identical; the only change was to add ReleaseAtCycles to all
instructions modified in this patch, except for the vmul, vdiv, and vrem
ones.
mahesh-attarde pushed a commit to mahesh-attarde/llvm-project that referenced this pull request Jul 28, 2025
This PR adds hardware-measured latencies for all instructions defined in
Section 11 of the RVV specification: "Vector Integer Arithmetic
Instructions" to the SpacemiT-X60 scheduling model.

The code in this PR was extracted from PR llvm#144564, so it's smaller to
review. I made a few adjustments here and there, and the code is almost
identical; the only change was to add ReleaseAtCycles to all
instructions modified in this patch, except for the vmul, vdiv, and vrem
ones.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants