Skip to content

codegen: crash / incorrect answer with 128 threads #36

@jzxia

Description

@jzxia

The code generated by src/codegen/broutine.jl crashes or produces wrong answer with 128 threads. Note that these errors have not occurred so far with <= 64 threads.

In particular, I tested on a computer running Ubuntu 20.04.2 LTS whose hardware topology is as follows:

julia> using Hwloc

julia> topology_info()
Machine: 1 (503.78 GB)
 Package: 2 (251.81 GB)
  Group: 8 (62.87 GB)
   NUMANode: 8 (62.87 GB)
    L3Cache: 32 (16.0 MB)
     L2Cache: 128 (512.0 kB)
      L1Cache: 128 (32.0 kB)
       Core: 128
        PU: 256

The following code (or a slight variation of it) is used to perform the test:

using Test
using BenchmarkTools
using LinearAlgebra
using BQCESubroutine
using YaoLocations

Threads.nthreads()

@testset "N=$N" for N in [15, 20]
        st = rand(Float64, 1<<N);
        loc = 1
        locs = BQCESubroutine.Locations(loc);
        st0 = broutine!(copy(st), Val(:X), locs);
        st1 = broutine!(copy(st), [0 1; 1 0], locs);
	println("|err| = ", norm(st0-st1))
        @test st0  st1
end;

I did the test for the following cases:

  • loc=1, old codegen using Threads.@threads
  • loc=N, old codegen using Threads.@threads
  • loc=N, new codegen using @batch from Polyester
  • loc=N, new codegen using Threads.@threads

where "old codegen" refers to the case where the following lines of src/codegen/broutine.jl are commented out (so that bsubspace is used); while "new codegen" refers to the case where the following lines are retained (so that threaded_subspace_loop_2x2_nontrivial is called).

if n == 1
push!(ret.args, threaded_subspace_loop_2x2_nontrivial(f_kernel, ctx, brt))
return ret
end

The test results are as follows. The errors occur in about 1/3 of all trials. Also, I haven't seen any errors so far with <=64 threads.

  • (crash) loc=1, old codegen using Threads.@threads
(base) visitor@delta106:~/julia_xjz/BQCESubroutine.jl$ julia --project=@.
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.6.2 (2021-07-14)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> using Test

julia> using BenchmarkTools

julia> using LinearAlgebra

julia> using BQCESubroutine

julia> using YaoLocations

julia> using BQCESubroutine: threaded_basic_broutine!

julia> @testset "N=$N" for N in [15, 20]
           #@testset "i=$i" for i in 1:N
           #for i in 1:N
           #for j in 1:1000
           for i in 1:1
               st = rand(Float64, 1<<N);
               locs = BQCESubroutine.Locations(i);
               st0 = broutine!(copy(st), Val(:X), locs);
               st1 = broutine!(copy(st), [0 1; 1 0], locs);
               println("|err| = ", norm(st0-st1))
               @test st0 ≈ st1
           end
           #end
       end;

signal (11): Segmentation fault
in expression starting at REPL[7]:1
unsafe_load at ./pointer.jl:105 [inlined]
unsafe_load at ./pointer.jl:105 [inlined]
macro expansion at /home/visitor/.julia/packages/StrideArraysCore/skpQT/src/ptr_array.jl:177 [inlined]
pload at /home/visitor/.julia/packages/StrideArraysCore/skpQT/src/ptr_array.jl:177 [inlined]
getindex at /home/visitor/.julia/packages/StrideArraysCore/skpQT/src/ptr_array.jl:331 [inlined]
macro expansion at /home/visitor/julia_xjz/BQCESubroutine.jl/src/codegen/broutine.jl:315 [inlined]
#90 at /home/visitor/.julia/packages/Polyester/7cr0U/src/closure.jl:223 [inlined]
BatchClosure at /home/visitor/.julia/packages/Polyester/7cr0U/src/batch.jl:8
unknown function (ip: 0x7f0c580868f0)
_call at /home/visitor/.julia/packages/ThreadingUtilities/IkkvN/src/threadtasks.jl:11 [inlined]
ThreadTask at /home/visitor/.julia/packages/ThreadingUtilities/IkkvN/src/threadtasks.jl:29
unknown function (ip: 0x7f0c5808d9cc)

signal (11): Segmentation fault
in expression starting at REPL[7]:1
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2237 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2419
unsafe_load at ./pointer.jl:105 [inlined]
unsafe_load at ./pointer.jl:105 [inlined]
macro expansion at /home/visitor/.julia/packages/StrideArraysCore/skpQT/src/ptr_array.jl:177 [inlined]
pload at /home/visitor/.julia/packages/StrideArraysCore/skpQT/src/ptr_array.jl:177 [inlined]
getindex at /home/visitor/.julia/packages/StrideArraysCore/skpQT/src/ptr_array.jl:331 [inlined]
macro expansion at /home/visitor/julia_xjz/BQCESubroutine.jl/src/codegen/broutine.jl:315 [inlined]
#90 at /home/visitor/.julia/packages/Polyester/7cr0U/src/closure.jl:223 [inlined]
BatchClosure at /home/visitor/.julia/packages/Polyester/7cr0U/src/batch.jl:8
unknown function (ip: 0x7f0c580868f0)
_call at /home/visitor/.julia/packages/ThreadingUtilities/IkkvN/src/threadtasks.jl:11 [inlined]
ThreadTask at /home/visitor/.julia/packages/ThreadingUtilities/IkkvN/src/threadtasks.jl:29
unknown function (ip: 0x7f0c5808d9cc)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2237 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2419
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1703 [inlined]
start_task at /buildworker/worker/package_linux64/build/src/task.c:839
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1703 [inlined]
start_task at /buildworker/worker/package_linux64/build/src/task.c:839
unknown function (ip: (nil))
Allocations: 10465430 (Pool: 10461946; Big: 3484); GC: 10
unknown function (ip: (nil))
Allocations: 10465430 (Pool: 10461946; Big: 3484); GC: 10
Segmentation fault (core dumped)
  • (incorrect answer) loc=1, old codegen using Threads.@threads
(base) visitor@delta106:~/julia_xjz/BQCESubroutine.jl$ julia --project=@.
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.6.2 (2021-07-14)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> using Test

julia> using BenchmarkTools

julia> using LinearAlgebra

julia> using BQCESubroutine

julia> using YaoLocations

julia> using BQCESubroutine: threaded_basic_broutine!

julia> Threads.nthreads()
128

julia> @testset "N=$N" for N in [15, 20]
           #@testset "i=$i" for i in 1:N
           #for i in 1:N
           #for j in 1:1000
           for i in 1:1
               st = rand(Float64, 1<<N);
               locs = BQCESubroutine.Locations(i);
               st0 = broutine!(copy(st), Val(:X), locs);
               st1 = broutine!(copy(st), [0 1; 1 0], locs);
               println("|err| = ", norm(st0-st1))
               @test st0 ≈ st1
           end
           #end
       end;
|err| = 9.591304821338616
N=15: Test Failed at REPL[8]:11
  Expression: st0 ≈ st1
   Evaluated: [0.9892420967764597, 0.26037900123707414, 0.614994982713237, 0.20759717205479533, 0.3126703177619974, 0.18078785290089638, 0.7422001386059047, 0.7726755538057188, 0.3277775066108153, 0.5181144753668747  …  0.5338442075110978, 0.8575211492346384, 0.9954840790239925, 0.6424407507078336, 0.7940770595462205, 0.053890792175115054, 0.9595014083141846, 0.8423338613101816, 0.5532812445454995, 0.42973496521957366] ≈ [0.9892420967764597, 0.26037900123707414, 0.614994982713237, 0.20759717205479533, 0.3126703177619974, 0.18078785290089638, 0.7422001386059047, 0.7726755538057188, 0.3277775066108153, 0.5181144753668747  …  0.5338442075110978, 0.8575211492346384, 0.9954840790239925, 0.6424407507078336, 0.7940770595462205, 0.053890792175115054, 0.9595014083141846, 0.8423338613101816, 0.5532812445454995, 0.42973496521957366]
Stacktrace:
  [1] macro expansion
    @ ./REPL[8]:11 [inlined]
  [2] top-level scope
    @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Test/src/Test.jl:1226 [inlined]
  [3] top-level scope
    @ ./REPL[8]:0
  [4] eval
    @ ./boot.jl:360 [inlined]
  [5] eval_user_input(ast::Any, backend::REPL.REPLBackend)
    @ REPL /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl:139
  [6] repl_backend_loop(backend::REPL.REPLBackend)
    @ REPL /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl:200
  [7] start_repl_backend(backend::REPL.REPLBackend, consumer::Any)
    @ REPL /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl:185
  [8] run_repl(repl::REPL.AbstractREPL, consumer::Any; backend_on_current_task::Bool)
    @ REPL /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl:317
  [9] run_repl(repl::REPL.AbstractREPL, consumer::Any)
    @ REPL /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl:305
 [10] (::Base.var"#874#876"{Bool, Bool, Bool})(REPL::Module)
    @ Base ./client.jl:387
 [11] #invokelatest#2
    @ ./essentials.jl:708 [inlined]
 [12] invokelatest
    @ ./essentials.jl:706 [inlined]
 [13] run_main_repl(interactive::Bool, quiet::Bool, banner::Bool, history_file::Bool, color_set::Bool)
    @ Base ./client.jl:372
 [14] exec_options(opts::Base.JLOptions)
    @ Base ./client.jl:302
 [15] _start()
    @ Base ./client.jl:485
Test Summary: | Fail  Total
N=15          |    1      1
Test Summary: | Fail  Total
N=15          |    1      1
ERROR: Some tests did not pass: 0 passed, 1 failed, 0 errored, 0 broken.

caused by: Some tests did not pass: 0 passed, 1 failed, 0 errored, 0 broken.

julia>
  • (crash) loc=N, old codegen using Threads.@threads
...
signal (11): Segmentation fault
in expression starting at REPL[8]:1
unsafe_load at ./pointer.jl:105 [inlined]
unsafe_load at ./pointer.jl:105 [inlined]
...
  • (crash) loc=N, new codegen using Threads.@threads
    ditto

  • (crash) loc=N, new codegen using @batch from Polyester
    ditto

  • (incorrect answer) loc=N, new codegen using @batch from Polyester

(base) visitor@delta106:~/julia_xjz/BQCESubroutine.jl$ julia --project=@.
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.6.2 (2021-07-14)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> using BQCESubroutine
[ Info: Precompiling BQCESubroutine [29e2bfda-5ba7-471c-9125-afac425f1f80]

julia> using Test

julia> using BenchmarkTools

julia> using LinearAlgebra

julia> using BQCESubroutine

julia> using YaoLocations

julia> Threads.nthreads()
128

julia> @testset "N=$N" for N in [15, 20]
               st = rand(Float64, 1<<N);
               locs = BQCESubroutine.Locations(N);
               st0 = broutine!(copy(st), Val(:X), locs);
               st1 = broutine!(copy(st), [0 1; 1 0], locs);
               println("|err| = ", norm(st0-st1))
               @test st0 ≈ st1
       end;
threaded_subspace_loop_2x2_nontrivial
|err| = 16.047140931650585
N=15: Test Failed at REPL[8]:7
  Expression: st0 ≈ st1
   Evaluated: [0.5331964447622937, 0.6840490894483715, 0.2992315961195635, 0.2788357425851684, 0.8245955857174441, 0.34661593647558275, 0.13788131297975648, 0.4132599933839103, 0.10438664295039812, 0.6052680657151797  …  0.01005720357114237, 0.40938335588275665, 0.13120408445874276, 0.21412778340666128, 0.23683502279509216, 0.4887433118091513, 0.43142024877557206, 0.4821280787877209, 0.5761057194395589, 0.7531886577130373] ≈ [0.5331964447622937, 0.6840490894483715, 0.2992315961195635, 0.2788357425851684, 0.8245955857174441, 0.34661593647558275, 0.13788131297975648, 0.4132599933839103, 0.10438664295039812, 0.6052680657151797  …  0.01005720357114237, 0.40938335588275665, 0.13120408445874276, 0.21412778340666128, 0.23683502279509216, 0.4887433118091513, 0.43142024877557206, 0.4821280787877209, 0.5761057194395589, 0.7531886577130373]
Stacktrace:
  [1] macro expansion
    @ ./REPL[8]:7 [inlined]
  [2] top-level scope
    @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Test/src/Test.jl:1226 [inlined]
  [3] top-level scope
    @ ./REPL[8]:0
  [4] eval
    @ ./boot.jl:360 [inlined]
  [5] eval_user_input(ast::Any, backend::REPL.REPLBackend)
    @ REPL /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl:139
  [6] repl_backend_loop(backend::REPL.REPLBackend)
    @ REPL /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl:200
  [7] start_repl_backend(backend::REPL.REPLBackend, consumer::Any)
    @ REPL /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl:185
  [8] run_repl(repl::REPL.AbstractREPL, consumer::Any; backend_on_current_task::Bool)
    @ REPL /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl:317
  [9] run_repl(repl::REPL.AbstractREPL, consumer::Any)
    @ REPL /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl:305
 [10] (::Base.var"#874#876"{Bool, Bool, Bool})(REPL::Module)
    @ Base ./client.jl:387
 [11] #invokelatest#2
    @ ./essentials.jl:708 [inlined]
 [12] invokelatest
    @ ./essentials.jl:706 [inlined]
 [13] run_main_repl(interactive::Bool, quiet::Bool, banner::Bool, history_file::Bool, color_set::Bool)
    @ Base ./client.jl:372
 [14] exec_options(opts::Base.JLOptions)
    @ Base ./client.jl:302
 [15] _start()
    @ Base ./client.jl:485
Test Summary: | Fail  Total
N=15          |    1      1
Test Summary: | Fail  Total
N=15          |    1      1
ERROR: Some tests did not pass: 0 passed, 1 failed, 0 errored, 0 broken.

caused by: Some tests did not pass: 0 passed, 1 failed, 0 errored, 0 broken.

julia>

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions