Skip to content

Conversation

heliosdrm
Copy link
Collaborator

Fixes #169 using Channel.

@heliosdrm
Copy link
Collaborator Author

This is an alternative to #170.

@heliosdrm heliosdrm marked this pull request as draft April 12, 2025 08:15
@heliosdrm
Copy link
Collaborator Author

heliosdrm commented Apr 12, 2025

I can't tell why the tests are failing: the error is happening in matrices/distance_matrix.jl that has not been modified. It's the same error seen in #170 , although the modifications made there are quite different from each other. I suspect that the problem might come from elsewhere. Can the current main be retested to see if the error happens now even without changes?

@heliosdrm heliosdrm marked this pull request as ready for review April 13, 2025 13:58
@heliosdrm
Copy link
Collaborator Author

I think I have a clue of what's happening in the failing CI check. The errors are reported to happen in test/rmatrix_analytic.jl:16 and test/windowtest.jl:8, which eventually hit this:

[2] _distancematrix(x::StateSpaceSet{2, Float64, SVector{2, Float64}}, metric::Euclidean, ::Val{false})
      @ RecurrenceAnalysis ~/work/RecurrenceAnalysis.jl/RecurrenceAnalysis.jl/src/matrices/distance_matrix.jl:103

Now, that line is (as part of a loop):

@inbounds d[i, j] = evaluate(metric, x[i], x[j])

And the innermost error is reported to happen in a call to `setindex!, with the following message:

LoadError: MethodError: Cannot `convert` an object of type Float64 to an object of type SVector{2, Float64}

This is as if evaluate(metric, x[i], x[j]) returned a Float64 (which it does when I run the test locally), but tried to assign it to an array of SVector{2, Float64}, instead of an array of Float64 (or a compatible type).

Here's where I am lost, because d is defined in src/matrices/distance_matrix.jl:100 as:

d = zeros(eltype(x), length(x), length(x))

When I debug it locally, at this point eltype(x) returns Float64, i.e. d is a matrix of floats, and the test runs fine. On the other hand, eltype(typeof(x)) returns SVector{2, Float64}. This makes me think that there is some change in the dependencies (which are not updated in my local Manifest, but I didn't dig further) which makes eltype(x) return SVector{2, Float} as well.

@Datseris : are you aware of such a change? If that's the case, maybe that change was considered a bug fix - normally eltype(x) returns the same as eltype(typeof(x)) and that should be "the type of elements generated by iterating a collection", according to the Julia docs, so it was not marked as a major version change, but it actually breaks this code.

@heliosdrm
Copy link
Collaborator Author

Yes, I think I was in the right track. In version 2.4 of StateSpaceSets.jl:

Fixed a critical bug where eltype(ssset) would return the element type of the stored points (e.g., Float64), instead of the type of the points themselves (e.g., SVector{2, Float64}). To be consistent with the Array interface and base Julia the correct return value is the type of the points themselves. Use eltype(eltype(ssset)) to obtain the number type of the points.

But why the last change I attempted to test this doesn't work? In the compat section of Project.toml:

StateSpaceSets = "2.0, 2.1, 2.2, 2.3"

But in the ci report:

[40b095a5] StateSpaceSets v2.4.0

@heliosdrm
Copy link
Collaborator Author

Ok, now the test is passing, but I'm still a bit concerned about the robustness of the fix. The last change is in a method with the signature:

_distancematrix(x::Array_or_SSSet, metric::Metric, ::Val{false})

where Array_or_SSSet is defined as Union{AbstractArray{<:Real}, AbstractStateSpaceSet}.

There I have changed eltype(x) to eltype(eltype(x)), as recommended in the changelog of StateSpaceSets.jl. It turns out that this still works when the input is an array, although in that case it should be eltype(x), because eltype(T<:Real) == T. So, in the end there is no need to change anything else, but I feel that this is just by chance, and it might make future unforeseen errors difficult to debug.

@heliosdrm
Copy link
Collaborator Author

Also, I have not checked the impact of these changes in performance, specially in threaded vs. "normal" versions, which in #170 suffered a regression according to @asinghvi17

@Datseris
Copy link
Member

Datseris commented Apr 13, 2025

There I have changed eltype(x) to eltype(eltype(x)), as recommended in the changelog of StateSpaceSets.jl. It turns out that this still works when the input is an array, although in that case it should be eltype(x), because eltype(T<:Real) == T. So, in the end there is no need to change anything else, but I feel that this is just by chance, and it might make future unforeseen errors difficult to debug.

I am sorry it took a while to answer. Yes, you have found the problem correctly: i have fixed the eltype(ssset) behavior in the StateSpaceSets.jl package. This change has disrupted a bit the JuliaDynamics ecosystem and I am slowly putting the fires away...

The concern you state in the above message, I don't think you should be concerned about it. Since eltype is part of Julia Core API, there are no ways it can break in the future. eltype(T<:Real) == T will be the case always. So you are completely fine in writing eltype(eltype(matrix_or_ssset)) trhoguhout the codebase. It will always give you the numeric type. THis is robust.

@asinghvi17 do you mind reviewing here and discussing the differneces between this PR and yours? I have seen only your own comments on your PR that say that actually the PR's threaded version is drastically slower (and hence something must not be right?).

@Datseris
Copy link
Member

@heliosdrm do you have any perfomrance problems with this PR? Is the threaded version faster as it should be?

@heliosdrm
Copy link
Collaborator Author

@heliosdrm do you have any perfomrance problems with this PR? Is the threaded version faster as it should be?

I haven't checked yet. This weekend I only had acces to a 10+ year old computer, and I think that benchmarks made with it would not be representative of "real world" performance. I'll try today or in coming days. Would benchmarks/parallel_vs_serial.jl be the right script to run?

@Datseris
Copy link
Member

yes. But you could also just run the same function with starting julia with 1 thread or with max threads.

@heliosdrm
Copy link
Collaborator Author

I tried with the script, although it is very outdated and needed a bit of a hack - using an old version of DynamicalSystemsBase in order to create the data, although that should not be relevant for what we are comparing here. The results are not great: the parallel version is only a bit faster when garbage collection starts (N > 1000), with a very narrow improvement (between 6% and 20% of time).

This is the output:

I am using 20 threads.

For N = 101...
Lower triangle only
Speedups:
    3D=0.93
    1D=0.535
Raw timings:
    3D serial:   134.600 μs (0.00% GC)
    3D parallel: 144.800 μs (0.00% GC)
    1D serial:   84.600 μs (0.00% GC)
    1D parallel: 158.000 μs (0.00% GC)

Full matrix
Speedups:
    3D=0.375
    1D=0.342
Raw timings:
    3D serial:   113.200 μs (0.00% GC)
    3D parallel: 301.500 μs (0.00% GC)
    1D serial:   106.000 μs (0.00% GC)
    1D parallel: 310.000 μs (0.00% GC)


For N = 317...
Lower triangle only
Speedups:
    3D=0.766
    1D=0.675
Raw timings:
    3D serial:   874.050 μs (0.00% GC)
    3D parallel: 1.140 ms (0.00% GC)
    1D serial:   614.000 μs (0.00% GC)
    1D parallel: 910.150 μs (0.00% GC)

Full matrix
Speedups:
    3D=0.647
    1D=0.624
Raw timings:
    3D serial:   1.550 ms (0.00% GC)
    3D parallel: 2.394 ms (0.00% GC)
    1D serial:   1.306 ms (0.00% GC)
    1D parallel: 2.091 ms (0.00% GC)


For N = 1001...
Lower triangle only
Speedups:
    3D=1.08
    1D=1.1
Raw timings:
    3D serial:   5.830 ms (0.00% GC)
    3D parallel: 5.385 ms (0.00% GC)
    1D serial:   5.143 ms (0.00% GC)
    1D parallel: 4.691 ms (0.00% GC)

Full matrix
Speedups:
    3D=0.957
    1D=0.985
Raw timings:
    3D serial:   8.471 ms (0.00% GC)
    3D parallel: 8.850 ms (0.00% GC)
    1D serial:   7.541 ms (0.00% GC)
    1D parallel: 7.659 ms (0.00% GC)


For N = 3163...
Lower triangle only
Speedups:
    3D=1.13
    1D=1.06
Raw timings:
    3D serial:   47.096 ms (9.61% GC)
    3D parallel: 41.819 ms (13.70% GC)
    1D serial:   51.045 ms (9.41% GC)
    1D parallel: 48.038 ms (13.58% GC)

Full matrix
Speedups:
    3D=1.14
    1D=1.14
Raw timings:
    3D serial:   70.604 ms (6.47% GC)
    3D parallel: 62.076 ms (10.80% GC)
    1D serial:   74.813 ms (7.50% GC)
    1D parallel: 65.794 ms (11.59% GC)


For N = 10001...
Lower triangle only
Speedups:
    3D=1.2
    1D=1.17
Raw timings:
    3D serial:   510.174 ms (12.84% GC)
    3D parallel: 426.830 ms (8.93% GC)
    1D serial:   615.997 ms (13.96% GC)
    1D parallel: 526.019 ms (8.13% GC)

Full matrix
Speedups:
    3D=1.07
    1D=1.13
Raw timings:
    3D serial:   741.330 ms (10.13% GC)
    3D parallel: 695.768 ms (16.54% GC)
    1D serial:   945.038 ms (14.17% GC)
    1D parallel: 832.708 ms (19.12% GC)


For N = 31624...
Lower triangle only
Speedups:
    3D=1.1
    1D=1.06
Raw timings:
    3D serial:   6.428 s (10.65% GC)
    3D parallel: 5.851 s (12.02% GC)
    1D serial:   7.147 s (9.71% GC)
    1D parallel: 6.761 s (13.14% GC)

Full matrix
Speedups:
    3D=1.17
    1D=1.09
Raw timings:
    3D serial:   10.529 s (10.78% GC)
    3D parallel: 9.037 s (12.66% GC)
    1D serial:   10.801 s (9.00% GC)
    1D parallel: 9.919 s (13.36% GC)

Julia Version 1.10.8
Commit 4c16ff44be (2025-01-22 10:06 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: 20 × 12th Gen Intel(R) Core(TM) i7-12700H
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-15.0.7 (ORCJIT, alderlake)
Threads: 20 default, 0 interactive, 10 GC (on 20 virtual cores)
Environment:
JULIA_DEPOT_PATH = D:.julia
JULIA_EDITOR = code

@Datseris
Copy link
Member

Oh... For reference, if you run the same script in the master branch, what do you get?

@heliosdrm
Copy link
Collaborator Author

In main (applying only the fix to eltype to make it work with the same dependencies) the results are better, although still the same order of magnitude. I expected it, because this PR involves just does the same as the current code, but with more operations (taking and putting vectors from the channel). In my opinion this can be an acceptable quick fix to #169 at the expense of a small slowdown, while @asinghvi17 sorts out if the alternative approach based on tasks can be adjusted to improve performance.

This is the output in main (same machine and dependencies as before):

For N = 101...
Lower triangle only
Speedups:
    3D=0.565
    1D=0.556
Raw timings:
    3D serial:   92.000 μs (0.00% GC)
    3D parallel: 162.850 μs (0.00% GC)
    1D serial:   85.300 μs (0.00% GC)
    1D parallel: 153.400 μs (0.00% GC)

Full matrix
Speedups:
    3D=0.538
    1D=0.532
Raw timings:
    3D serial:   120.800 μs (0.00% GC)
    3D parallel: 224.400 μs (0.00% GC)
    1D serial:   117.300 μs (0.00% GC)
    1D parallel: 220.350 μs (0.00% GC)


For N = 317...
Lower triangle only
Speedups:
    3D=0.799
    1D=0.819
Raw timings:
    3D serial:   505.000 μs (0.00% GC)
    3D parallel: 631.700 μs (0.00% GC)
    1D serial:   389.100 μs (0.00% GC)
    1D parallel: 475.300 μs (0.00% GC)

Full matrix
Speedups:
    3D=0.942
    1D=0.939
Raw timings:
    3D serial:   749.100 μs (0.00% GC)
    3D parallel: 795.500 μs (0.00% GC)
    1D serial:   656.400 μs (0.00% GC)
    1D parallel: 699.100 μs (0.00% GC)


For N = 1001...
Lower triangle only
Speedups:
    3D=1.28
    1D=1.28
Raw timings:
    3D serial:   3.512 ms (0.00% GC)
    3D parallel: 2.749 ms (0.00% GC)
    1D serial:   3.816 ms (0.00% GC)
    1D parallel: 2.989 ms (0.00% GC)

Full matrix
Speedups:
    3D=1.42
    1D=1.34
Raw timings:
    3D serial:   6.530 ms (0.00% GC)
    3D parallel: 4.593 ms (0.00% GC)
    1D serial:   7.183 ms (0.00% GC)
    1D parallel: 5.367 ms (0.00% GC)


For N = 3163...
Lower triangle only
Speedups:
    3D=1.34
    1D=1.11
Raw timings:
    3D serial:   46.260 ms (9.49% GC)
    3D parallel: 34.615 ms (0.00% GC)
    1D serial:   48.628 ms (10.33% GC)
    1D parallel: 43.894 ms (0.00% GC)

Full matrix
Speedups:
    3D=1.43
    1D=1.53
Raw timings:
    3D serial:   70.700 ms (5.72% GC)
    3D parallel: 49.296 ms (0.00% GC)
    1D serial:   76.450 ms (4.78% GC)
    1D parallel: 50.077 ms (0.00% GC)


For N = 10001...
Lower triangle only
Speedups:
    3D=1.18
    1D=1.32
Raw timings:
    3D serial:   504.686 ms (9.21% GC)
    3D parallel: 427.905 ms (9.25% GC)
    1D serial:   606.391 ms (5.74% GC)
    1D parallel: 459.401 ms (8.10% GC)

Full matrix
Speedups:
    3D=1.43
    1D=1.39
Raw timings:
    3D serial:   732.634 ms (4.74% GC)
    3D parallel: 512.699 ms (16.41% GC)
    1D serial:   921.632 ms (8.01% GC)
    1D parallel: 660.974 ms (15.84% GC)


For N = 31624...
Lower triangle only
Speedups:
    3D=1.27
    1D=1.32
Raw timings:
    3D serial:   6.335 s (8.06% GC)
    3D parallel: 4.983 s (16.14% GC)
    1D serial:   7.568 s (11.04% GC)
    1D parallel: 5.753 s (12.96% GC)

Full matrix
Speedups:
    3D=1.59
    1D=1.38
Raw timings:
    3D serial:   9.898 s (7.56% GC)
    3D parallel: 6.236 s (14.83% GC)
    1D serial:   10.382 s (8.19% GC)
    1D parallel: 7.535 s (14.34% GC)

@Datseris
Copy link
Member

Let's wait for @asinghvi17 's input before.

@heliosdrm
Copy link
Collaborator Author

The last commit makes the code look much more as it was before: I have used Channels just to replace the call to Threads.threadid, and thus ensure that no thread is using the same vectors as any other that is running at the same time. With this change, the benchmark also gives results more similar to the main branch.

Although to be fair, after re-running the benchmark script various times, I see that the timings are quite variable, and the regression seen between the two that I reported before is within the range of that variability.

@Datseris
Copy link
Member

@heliosdrm thanks for doing this. So if I understand correctly, now with also using reduce, this PR has the same performance as current main branch? And we can go ahead and merge it and release a new version (patch release)?

@Datseris
Copy link
Member

@asinghvi17 perhaps you also want to have a final look?

@heliosdrm
Copy link
Collaborator Author

@heliosdrm thanks for doing this. So if I understand correctly, now with also using reduce, this PR has the same performance as current main branch? And we can go ahead and merge it and release a new version (patch release)?

That is how I see it, yes.

Copy link
Member

@asinghvi17 asinghvi17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thanks @heliosdrm!

@heliosdrm
Copy link
Collaborator Author

@Datseris : ready to merge?

@Datseris
Copy link
Member

yes, sorrry for the delay I am a bit swamped! please give me a couple of days and I will merge this in!

@Datseris Datseris merged commit 318f0aa into main May 19, 2025
2 checks passed
@Datseris Datseris deleted the fix_threads branch May 19, 2025 08:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Likely erroneous use of Threads.nthreads and Threads.threadid
3 participants