Skip to content

[herd] More efficient computation of atomic load X stores pairs#1735

Merged
maranget merged 1 commit intomasterfrom
opt-atomic-pairs
Mar 20, 2026
Merged

[herd] More efficient computation of atomic load X stores pairs#1735
maranget merged 1 commit intomasterfrom
opt-atomic-pairs

Conversation

@maranget
Copy link
Copy Markdown
Member

@maranget maranget commented Mar 2, 2026

We take inspiration from the efficient computation of register read-from (see PR #1704): atomic stores are ordered according to (extended by iico) program order and an atomic load is paired with the closest atomic write that follows it.

Preliminary to PR #1733.

@maranget maranget marked this pull request as ready for review March 3, 2026 14:37
Copy link
Copy Markdown
Collaborator

@HadrienRenaud HadrienRenaud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, thanks @maranget. There is an important point about interleaving reads in a comment that I think needs to be addressed before this is merged. Maybe a justification of why this is not a problem is enough.

@maranget
Copy link
Copy Markdown
Member Author

maranget commented Mar 5, 2026

For some reason CI is not running, an idea why?

@maranget
Copy link
Copy Markdown
Member Author

maranget commented Mar 5, 2026

The complete test make test-all fails on ./herd/tests/instructions/AArch64.vmsa+mte/A016.litmus...

@maranget
Copy link
Copy Markdown
Member Author

maranget commented Mar 5, 2026

In some cases, such as combining vmsa and mte, there exist (atomic) read effects that are unrelated by generalised po. As a consequence it is not possible to build a set of effects that rely on generalised po.

I have reverted to finding the closest po-after write (and not closest effect) and added a naive non-intervening load test. As long as writes (to a given location and on a given thread) are po-ordered, it works.

@maranget maranget force-pushed the opt-atomic-pairs branch 2 times, most recently from 399c3ed to c3b869a Compare March 5, 2026 23:18
Copy link
Copy Markdown
Collaborator

@HadrienRenaud HadrienRenaud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @maranget, this looks very good. Sorry I am coming with other big questions

Should we check that the order is total? I think it would only need a linear pass on the store set. We could probably put this in StoreSet actually:

let check_total set =
  if not (is_empty set) then
    Set.fold
      (fun e' e ->
         if e' != e (* min element *) then
           assert (is_before_strict e e');
         e')
      set (min_elt set)
    |> ignore

let of_list li =
  let set = of_list li in
  let () = check_total set in
  set

herd/mem.ml Outdated
Comment on lines +871 to +872
module
EvtSetByPo
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: should this belong in memUtils?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a possibility, I was a bit affraid of the signature but there should be no difficulty,

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

herd/memUtils.ml Outdated
and collect_reg_loads_stores es = collect_by_loc2 es E.is_reg_load_any E.is_reg_store_any

let accumulate_loc_proc proc loc e =
IntMap.update proc @@ function
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: indentation is wrong

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you dropped the fix on this one

@maranget
Copy link
Copy Markdown
Member Author

maranget commented Mar 6, 2026

Thanks @maranget, this looks very good. Sorry I am coming with other big questions

Should we check that the order is total? I think it would only need a linear pass on the store set. We could probably put this in StoreSet actually:

let check_total set =
  if not (is_empty set) then
    Set.fold
      (fun e' e ->
         if e' != e (* min element *) then
           assert (is_before_strict e e');
         e')
      set (min_elt set)
    |> ignore

let of_list li =
  let set = of_list li in
  let () = check_total set in
  set

First consider that two write effects performed by different instructions are ordered, provided po is a total order.
Moreover the iico relation cannot have a write effect as its source. As a consequence, if two stores to a given location are performed by the same instruction they are unrelated. Hence our "stores to a given relation are totally orderd by the generalised po" is equivalent to "any instruction performs at most one store to a given location"

In fact this sounds quite right because, if some intruction could perform two or more stores to a given location, then this instruction would be non-determinist (unless the value stored always are identical).

Hence I plan to simplify the ordering relation behind StoreSet and see what happens.

@maranget
Copy link
Copy Markdown
Member Author

maranget commented Mar 6, 2026

Hence I plan to simplify the ordering relation behind StoreSet and see what happens.

I have followed your suggestions. I have also tried to use plain po as the order behind StoreSet.

It does not work because of AArch64+ASL. In that setting, the initialisation of the program counter is iico-before all PC reads and writes in the instruction. This is quite surprising.

@HadrienRenaud
Copy link
Copy Markdown
Collaborator

It does not work because of AArch64+ASL. In that setting, the initialisation of the program counter is iico-before all PC reads and writes in the instruction. This is quite surprising.

I do not understand. The initialisation of the PC is a register write, why does it interfere here? Also why would ASL even call this function?

* to this set.
*)
module
EvtSetByPo :
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I would have kept this in mem.ml for now. EvtSetByPo's underlying assumptions seem fairly coupled to its usage within mem.ml, as it relies on a non‑total compare, which make sense in the specific context of mem.ml but seem riskier/easier to miss if exposed as a shared utility.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I can revert this easily.

@maranget
Copy link
Copy Markdown
Member Author

maranget commented Mar 6, 2026

It does not work because of AArch64+ASL. In that setting, the initialisation of the program counter is iico-before all PC reads and writes in the instruction. This is quite surprising.

I do not understand. The initialisation of the PC is a register write, why does it interfere here? Also why would ASL even call this function?

I do not understand either how a register write can be iico_data before something....

In AArch64_ASL mode, the failing StoreSet is used by match_reg_events, not by make_atomic_pairs. Here is the backtrace:

Not ordered stores (eeid=ev35 action=W1:PCq=200008) and (eeid=ev39 action=W1:PCq=200004)

Fatal: File "./herd/tests/instructions/AArch64/L027.litmus" Adios
Fatal error: exception File "herd/memUtils.ml", line 83, characters 14-20: Assertion failed
Raised at Dune__exe__MemUtils.Make.EvtSetByPo.compare in file "herd/memUtils.ml", line 83, characters 14-26
Called from Stdlib__Set.Make.add in file "set.ml", line 135, characters 18-33
Called from Dune__exe__Mem.Make.match_reg_events.(fun) in file "herd/mem.ml", line 878, characters 19-42
Called from Stdlib__Map.Make.fold in file "map.ml", line 329, characters 19-42
Called from Stdlib__Map.Make.fold in file "map.ml", line 329, characters 26-41
Called from Dune__exe__Mem.Make.do_solve_regs.(fun) in file "herd/mem.ml", lines 981-983, characters 10-18
Called from Dune__exe__Mem.Make.calculate_rf_with_cnstrnts in file "herd/mem.ml", line 2339, characters 12-36
Called from Stdlib__List.fold_left in file "list.ml", line 123, characters 24-34
Called from Dune__exe__Top_herd.Make.run in file "herd/top_herd.ml", line 478, characters 12-53
Re-raised at Dune__exe__Top_herd.Make.run in file "herd/top_herd.ml", line 483, characters 12-19
Called from Dune__exe__RunTest.Make.run in file "herd/runTest.ml", line 92, characters 8-29
Called from Misc.input_protect_gen in file "lib/misc.ml", line 487, characters 8-14
Re-raised at Misc.input_protect_gen in file "lib/misc.ml", line 487, characters 41-48
Called from Dune__exe__Herd.(fun) in file "herd/herd.ml", line 793, characters 12-31
Re-raised at Dune__exe__Herd.(fun) in file "herd/herd.ml", line 817, characters 11-18
Called from Dune__exe__Herd in file "herd/herd.ml", lines 791-818, characters 4-27

@maranget
Copy link
Copy Markdown
Member Author

maranget commented Mar 6, 2026

And here the image of the failing candidate for a much simpler test;

AArch64 L061
(* branch to instruction address with .section name *)

{ 0:X0 = 1; }
  P0            ;
  B .text       ;
  ADD W0, W0, #1;
.text: NOP      ;

A.pdf

This image was produced by the compare function of EvtSetByPo that uses E.po_strict in place of is_before_strict, as this was the case before the last forced push.

@HadrienRenaud
Copy link
Copy Markdown
Collaborator

Thank you, this makes sense. I'll have a look soon

@maranget
Copy link
Copy Markdown
Member Author

maranget commented Mar 6, 2026

Commit cb2e0cd changes some of iico_data into iico_order, which may be more appropriate. Hopefully this change is of aesthetic nature. I am not sure we should do it...

@maranget
Copy link
Copy Markdown
Member Author

maranget commented Mar 6, 2026

By the way something has changed with PR #1704. Consider this ambiguous test:

AArch64 LDPSame
{
int t[2]={1,2};
0:X0=t;
}
  P0           ;
LDP W1,W1,[X0] ;
locations [0:X1;]

With current master, we have:

% herd7 LDPSame.litmus 
Not ordered stores (eeid=f action=W0:X1q=S2) and (eeid=g action=W0:X1q=S4)

Fatal: File "LDPSame.litmus" Adios
Fatal error: exception File "herd/mem.ml", line 882, characters 6-12: Assertion failed

While before PR #1704, we had:

% herd7 -version LDPSame.litmus 
7.58+1, Rev: 238a08e757f7c4063dd232294f0df835daaa403d
% herd7 LDPSame.litmus 
Warning: File "LDPSame.litmus": Ambiguous po for register 0:X1

@HadrienRenaud
Copy link
Copy Markdown
Collaborator

Thanks @maranget for the bug report. I'll have a look, hopefully I'll have a PR today.

@HadrienRenaud
Copy link
Copy Markdown
Collaborator

I've picked a few bits of this PR into #1744, and fixed some problems that you were having in this PR.
I would suggest merging #1744 first, then rebasing this one on top, because #1744 is a fix when this one is an improvement. If you want to merge #1735, or part of it, before #1744, then I can rebase #1744 on top of master.

To summarize the problems found (and fixed):

  1. There was a iico_data starting from a register write because this register write was read-from inside the same instruction. We now require Intra-Instruction Data Dependencies to start from a register read or a memory read.
  2. There was 2 register writes to the same location because we needed to initialize PC inside the instruction if it was needed. We now just get rid of the PC events during the ASL to AArch64 translation.

HadrienRenaud added a commit that referenced this pull request Mar 16, 2026
[herd,asl] Fix ambiguous PC writes

This PR cherry picks a commit from #1735 that introduces some checks on the coherence of writes, and fixes the problems highlighted from those checks.

## PC handling in ASL+herd

The aarch64 handwritten semantics in herd handle the program counter differently from the ASL code: the ASL code has an explicit PC register, whereas the handwritten semantics rely on po and BCC branching events.

We had introduced in #711 an AArch64 PC register in ASL to support the ASL code that uses it.
This implementation wrote 2 times to the PC:
1. First at the beginning of the instruction to correctly initialize the PC value
2. Secondly when the branching is committed and we actually increments the PC

This poses a few problems, mainly because the current herd algorithm does not handle multiple writes at the same register in a single instruction.

We thus remove the translation of the PC accesses, replaced here by a BCC commit event in case of a write. This leaves implicit the PC reads.

## Intra-Instruction Dependencies starting from a register write

The PC handling described above had a strange graph, where a `iico_data` was starting at a register write. This is because this register write was read-from in the same instruction.

(Precisely, the first write to PC was read from to query the current PC and then increment it, resulting in a graph like [the one](https://github.com/user-attachments/files/25799596/A.pdf) shown by Luc in [a comment](#1735 (comment)) on #1704).

We simply add guards on the Intra-Instruction Data dependencies to force them to start at a register read or at a memory read.
@maranget maranget force-pushed the opt-atomic-pairs branch 2 times, most recently from b4bca44 to 31a0360 Compare March 19, 2026 15:56
@maranget
Copy link
Copy Markdown
Member Author

maranget commented Mar 19, 2026

Hi @HadrienRenaud . At last, we may be ready. Would you review?

Copy link
Copy Markdown
Collaborator

@HadrienRenaud HadrienRenaud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very good, thanks @maranget

We take inspiration from the efficient computation of register
read-from (see PR #1704): atomic stores by a given thread and
to a given location are ordered according to (extended by iico)
program order and an atomic load is paired
with the closest atomic store that follows it.
Additionaly it is checked that there is no atomic load
in-between. That is, the next effect in (generalised) po
indeed is a write.
@maranget maranget merged commit ce8f8ce into master Mar 20, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants