Skip to content

Bug: recursive dups cause race conditions #21

@Lorenzobattistela

Description

@Lorenzobattistela

This issue is related to the multi threaded implementation of HVM4 runtime.

The minimal example to reproduce the issue is the following program:

@c2   = λs. !S0&C=s; λx0.S0₀(S0₁(x0))
@k2   = λs. !S0&K=s; λx0.S0₀(S0₁(x0))
@main = @c2(@k2)

The manual full reduction of the term can be found here: https://gist.github.com/Lorenzobattistela/89c6d6ff07f01402a92f2f93540732e7

The bug is caused by the following term structure that is found middle reduction:

!A&K = G₁ ;
!B&K = x0 ;
!G&C = λy0.&C{A₀, B₀}(&C{A₁, B₁}(y0)) ;
λx0.G₀

Note how the G dup refers to the A dup, which itself refers to G. So we have a recursive dup structure, where one dup needs the other to be resolved.

In the single-threaded environment, this is not a problem as shown in the manual reduction, because the dups resolve correctly with the lazy evaluation. However, in a multi threaded environment, two different threads pick up each of this DUP's (example, thread A picks up the G dup, thread B picks up the A dup that was enqueued as a task and stolen by thread B).

Initial state of the heap:

heap[G_loc] = term_loc
heap[A_loc] = DP1 to G_loc
...

So, thread A will do heap_take in the dup G location and set it to 0.

heap[G_loc] = 0 // acquired by thread A
heap[A_loc] = DP1 to G_loc
...

And then, thread B steals a task and finds the A dup, so it does heap_take on it and sets it to 0.

heap[G_loc] = 0 // acquired by thread A
heap[A_loc] = 0 // acquired by thread B
...

However, now, thread A reaches a point where it needs the A dup to continue G's reduction. So it tries to heap_take there. It finds a 0, and spin waits there.

Finally, thread B reaches a point where it needs the G value to continue A's reduction. Tries to heap_take there, and also finds 0, spin waits there.

The general problem can be stated as:

-> Thread A starts reducing a term. At some point in that term, it finds a dup, DUP_A. It enqueues other work
-> Thread B steals work from A's queue and start reducing. At some point, it finds DUP_B and starts reducing it
-> While reducing DUP_A, thread A finds a DP0 or DP1 of DUP_B. It tries to heap_take on it and receives 0 because thread B is working on it, so spin wait until someone set the value.
-> Now thread B is reducing, but it now finds a DP0 or DP1 of DUP_A, tries heap_take and receives 0, because thread A is working on it, so spin wait.

The problem was confirmed by implementing a global deadlock on DUP's, which solved the issue (but is slow and not a complete / correct solution)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions