[herd] More efficient computation of atomic load X stores pairs#1735
[herd] More efficient computation of atomic load X stores pairs#1735
Conversation
02e15d8 to
9a98003
Compare
3676f0e to
2689dce
Compare
HadrienRenaud
left a comment
There was a problem hiding this comment.
This looks good to me, thanks @maranget. There is an important point about interleaving reads in a comment that I think needs to be addressed before this is merged. Maybe a justification of why this is not a problem is enough.
2689dce to
eada4fa
Compare
|
For some reason CI is not running, an idea why? |
|
The complete test |
eada4fa to
6b2ef26
Compare
|
In some cases, such as combining vmsa and mte, there exist (atomic) read effects that are unrelated by generalised po. As a consequence it is not possible to build a set of effects that rely on generalised po. I have reverted to finding the closest po-after write (and not closest effect) and added a naive non-intervening load test. As long as writes (to a given location and on a given thread) are po-ordered, it works. |
399c3ed to
c3b869a
Compare
HadrienRenaud
left a comment
There was a problem hiding this comment.
Thanks @maranget, this looks very good. Sorry I am coming with other big questions
Should we check that the order is total? I think it would only need a linear pass on the store set. We could probably put this in StoreSet actually:
let check_total set =
if not (is_empty set) then
Set.fold
(fun e' e ->
if e' != e (* min element *) then
assert (is_before_strict e e');
e')
set (min_elt set)
|> ignore
let of_list li =
let set = of_list li in
let () = check_total set in
set
herd/mem.ml
Outdated
| module | ||
| EvtSetByPo |
There was a problem hiding this comment.
Question: should this belong in memUtils?
There was a problem hiding this comment.
That's a possibility, I was a bit affraid of the signature but there should be no difficulty,
herd/memUtils.ml
Outdated
| and collect_reg_loads_stores es = collect_by_loc2 es E.is_reg_load_any E.is_reg_store_any | ||
|
|
||
| let accumulate_loc_proc proc loc e = | ||
| IntMap.update proc @@ function |
There was a problem hiding this comment.
nit: indentation is wrong
There was a problem hiding this comment.
I think you dropped the fix on this one
First consider that two write effects performed by different instructions are ordered, provided po is a total order. In fact this sounds quite right because, if some intruction could perform two or more stores to a given location, then this instruction would be non-determinist (unless the value stored always are identical). Hence I plan to simplify the ordering relation behind |
I have followed your suggestions. I have also tried to use plain po as the order behind It does not work because of AArch64+ASL. In that setting, the initialisation of the program counter is iico-before all PC reads and writes in the instruction. This is quite surprising. |
d67f208 to
b2ceb79
Compare
I do not understand. The initialisation of the PC is a register write, why does it interfere here? Also why would ASL even call this function? |
herd/memUtils.mli
Outdated
| * to this set. | ||
| *) | ||
| module | ||
| EvtSetByPo : |
There was a problem hiding this comment.
FWIW, I would have kept this in mem.ml for now. EvtSetByPo's underlying assumptions seem fairly coupled to its usage within mem.ml, as it relies on a non‑total compare, which make sense in the specific context of mem.ml but seem riskier/easier to miss if exposed as a shared utility.
There was a problem hiding this comment.
Ok, I can revert this easily.
I do not understand either how a register write can be iico_data before something.... In AArch64_ASL mode, the failing StoreSet is used by |
|
And here the image of the failing candidate for a much simpler test; This image was produced by the |
|
Thank you, this makes sense. I'll have a look soon |
|
Commit cb2e0cd changes some of iico_data into iico_order, which may be more appropriate. Hopefully this change is of aesthetic nature. I am not sure we should do it... |
|
By the way something has changed with PR #1704. Consider this ambiguous test: With current master, we have: While before PR #1704, we had: |
|
Thanks @maranget for the bug report. I'll have a look, hopefully I'll have a PR today. |
|
I've picked a few bits of this PR into #1744, and fixed some problems that you were having in this PR. To summarize the problems found (and fixed):
|
[herd,asl] Fix ambiguous PC writes This PR cherry picks a commit from #1735 that introduces some checks on the coherence of writes, and fixes the problems highlighted from those checks. ## PC handling in ASL+herd The aarch64 handwritten semantics in herd handle the program counter differently from the ASL code: the ASL code has an explicit PC register, whereas the handwritten semantics rely on po and BCC branching events. We had introduced in #711 an AArch64 PC register in ASL to support the ASL code that uses it. This implementation wrote 2 times to the PC: 1. First at the beginning of the instruction to correctly initialize the PC value 2. Secondly when the branching is committed and we actually increments the PC This poses a few problems, mainly because the current herd algorithm does not handle multiple writes at the same register in a single instruction. We thus remove the translation of the PC accesses, replaced here by a BCC commit event in case of a write. This leaves implicit the PC reads. ## Intra-Instruction Dependencies starting from a register write The PC handling described above had a strange graph, where a `iico_data` was starting at a register write. This is because this register write was read-from in the same instruction. (Precisely, the first write to PC was read from to query the current PC and then increment it, resulting in a graph like [the one](https://github.com/user-attachments/files/25799596/A.pdf) shown by Luc in [a comment](#1735 (comment)) on #1704). We simply add guards on the Intra-Instruction Data dependencies to force them to start at a register read or at a memory read.
b4bca44 to
31a0360
Compare
|
Hi @HadrienRenaud . At last, we may be ready. Would you review? |
HadrienRenaud
left a comment
There was a problem hiding this comment.
This looks very good, thanks @maranget
31a0360 to
d2aec4f
Compare
We take inspiration from the efficient computation of register read-from (see PR #1704): atomic stores by a given thread and to a given location are ordered according to (extended by iico) program order and an atomic load is paired with the closest atomic store that follows it. Additionaly it is checked that there is no atomic load in-between. That is, the next effect in (generalised) po indeed is a write.
d2aec4f to
af44a24
Compare
We take inspiration from the efficient computation of register read-from (see PR #1704): atomic stores are ordered according to (extended by iico) program order and an atomic load is paired with the closest atomic write that follows it.
Preliminary to PR #1733.