- 
                Notifications
    You must be signed in to change notification settings 
- Fork 55
[WIP] Vectorize fusiontree manipulations #261
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| ((trivialtuple..., N + 1), ())) | ||
| end | ||
|  | ||
| # TODO: is this piracy? | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I think this is bordering on type piracy, so we should probably just give this method a dedicated name and then simply call that one.
| I think a first general design remark is that now all the manipulations on the      trees_src = fusiontrees(fs_src)
    trees_dst = fusiontrees(fs_dst)
    indexmap = Dict(f => ind for (ind, f) in enumerate(trees_dst))If you know start composing elementary operations to make more complex ones (e.g.  It might thus be a better idea to store the blocksectors and associated fusion tree pairs (and there index mapping) within the  Of course, this way, the  Happy to discuss when you have time. | 
4e40fc3    to
    50db028      
    Compare
  
    | Codecov Report❌ Patch coverage is  Additional details and impacted files@@            Coverage Diff             @@
##           master     #261      +/-   ##
==========================================
- Coverage   82.85%   82.73%   -0.13%     
==========================================
  Files          44       45       +1     
  Lines        5757     6139     +382     
==========================================
+ Hits         4770     5079     +309     
- Misses        987     1060      +73     ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
 | 
| So I compared the performances with  TensorKit v0.14.9 | 
| As a small update here, I think I further optimized the implementations slightly, however there definitely is still some room for improvement. In particular the  I'll try and run some profilers next week to see what is now actually the bottleneck, to verify what needs further improvements. In particular, I can try and reduce some allocations by trying to re-use a bunch of the unitary matrices, or we can try and cache the  | 
| BTW does @ogauthe have a MWE of what's causing the slowdown that could be shared, perhaps to run as part of a perf regression testsuite 👀 ? | 
| using TensorKit
using TensorKit: treepermuter
Vsrc = (Rep[ℤ₂ × SU₂]((0, 0) => 1, (0, 1) => 1, (1, 1) => 1) ⊗ Rep[ℤ₂ × SU₂]((0, 0) => 1, (0, 1) => 1, (1, 1) => 1) ⊗ Rep[ℤ₂ × SU₂]((0, 0) => 1, (0, 1) => 1, (1, 1) => 1)' ⊗ Rep[ℤ₂ × SU₂]((0, 0) => 1, (0, 1) => 1, (1, 1) => 1)' ⊗ Rep[ℤ₂ × SU₂]((0, 0) => 1, (1, 1) => 1)) ← (Rep[ℤ₂ × SU₂]((0, 0) => 1, (1, 1) => 1) ⊗ Rep[ℤ₂ × SU₂]((0, 0) => 1, (0, 1) => 1, (1, 1) => 1) ⊗ Rep[ℤ₂ × SU₂]((0, 0) => 1, (0, 1) => 1, (1, 1) => 1) ⊗ Rep[ℤ₂ × SU₂]((0, 0) => 1, (0, 1) => 1, (1, 1) => 1)' ⊗ Rep[ℤ₂ × SU₂]((0, 0) => 1, (0, 1) => 1, (1, 1) => 1)')
Vdst = (Rep[ℤ₂ × SU₂]((0, 0) => 1, (1, 1) => 1) ⊗ Rep[ℤ₂ × SU₂]((0, 0) => 1, (1, 1) => 1)') ← (Rep[ℤ₂ × SU₂]((0, 0) => 1, (0, 1) => 1, (1, 1) => 1)' ⊗ Rep[ℤ₂ × SU₂]((0, 0) => 1, (0, 1) => 1, (1, 1) => 1) ⊗ Rep[ℤ₂ × SU₂]((0, 0) => 1, (0, 1) => 1, (1, 1) => 1)' ⊗ Rep[ℤ₂ × SU₂]((0, 0) => 1, (0, 1) => 1, (1, 1) => 1) ⊗ Rep[ℤ₂ × SU₂]((0, 0) => 1, (0, 1) => 1, (1, 1) => 1) ⊗ Rep[ℤ₂ × SU₂]((0, 0) => 1, (0, 1) => 1, (1, 1) => 1)' ⊗ Rep[ℤ₂ × SU₂]((0, 0) => 1, (0, 1) => 1, (1, 1) => 1) ⊗ Rep[ℤ₂ × SU₂]((0, 0) => 1, (0, 1) => 1, (1, 1) => 1)')
p = ((5, 6), (1, 7, 2, 8, 3, 9, 4, 10))
TensorKit.empty_globalcaches!()
@time treepermuter(Vdst, Vsrc, p);
Profile.clear()
TensorKit.empty_globalcaches!()
Profile.init(n = 10^7, delay = 0.01)
@profile treepermuter(Vdst, Vsrc, p);Here's a relevant case I took from the code at some point. It's really just a matter of taking one of the peps contractions and looking for the dominant treepermuter cost, which you can typically identify by adding  | 
| @kshyatt here is a slightly larger example. @lkdvos minimal case should appear at some point. The timing was done using this branch as of August 15th. using TensorOperations: @tensor
using TensorKit
function proj_braket(rD)
    fs = map(s -> sign(frobeniusschur(s)), sectors(rD))
    !(all(fs .== 1) || all(fs .== -1)) && return error("Cannot handle quaternionic irreps")
    rD2 = fuse(rD, rD')
    isoD2 = isomorphism(rD2, rD ⊗ rD')
    permuted_iso = flip(permute(isoD2', ((3,), (2, 1))), (1,))
    proj2_even_fullspace = isoD2 + permuted_iso
    proj2_odd_fullspace = isoD2 - permuted_iso
    proj2_even = rightnull(proj2_odd_fullspace; alg=SVD(), atol=1e-12)
    proj2_odd = rightnull(proj2_even_fullspace; alg=SVD(), atol=1e-12)
    return proj2_even, proj2_odd
end
function project_Z2(rd, rD)
    tket = randn(rd ← rD ⊗ rD ⊗ rD' ⊗ rD')
    pd_even_oddd = proj_braket(rd)
    projN = map(t -> permute(t, ((2, 3), (1,))), proj_braket(rD))
    projE = map(t -> permute(t, ((2, 3), (1,))), proj_braket(rD))
    projS = map(t -> permute(t, ((2, 3), (1,))), proj_braket(rD'))
    projW = map(t -> permute(t, ((2, 3), (1,))), proj_braket(rD'))
    t_double = permute(tket' ⊗ tket, ((5, 6), (1, 7, 2, 8, 3, 9, 4, 10)))
    for i_d in 1:2
        projectedd = permute(pd_even_oddd[i_d] * t_double, ((1, 4, 5, 6, 7, 8, 9), (2, 3)))
        for iN in 1:2
            @tensor projectedN[d2, D2N, ketW, braW, ketS, braS; ketE, braE] :=
                projectedd[d2, ketE, braE, ketS, braS, ketW, braW; ketN, braN] *
                projN[iN][ketN, braN, D2N]
            for iE in 1:2
                @tensor projectedNE[d2, D2N, D2E, ketW, braW; ketS, braS] :=
                    projectedN[d2, D2N, ketS, braS, ketW, braW; ketE, braE] *
                    projE[iE][ketE, braE, D2E]
                for iS in 1:2
                    iW = mod1(i_d + iN + iE + iS + 1, 2)
                    @tensor projected[d2, D2N, D2E, D2S, D2W] :=
                        projectedNE[d2, D2N, D2E, ketW, braW; ketS, braS] *
                        projS[iS][ketS, braS, D2S] *
                        projW[iW][ketW, braW, D2W]
                end
            end
        end
    end
    return 1
end
rd = Rep[ℤ₂ × SU₂]((0, 0) => 1, (1, 1) => 1)
rD7 = Rep[ℤ₂ × SU₂]((0, 0)=>1, (0, 1)=>1, (1, 1)=>1)
rD11 = Rep[ℤ₂ × SU₂]((0, 0) => 2, (0, 1) => 1, (1, 1) => 2)
rD16 = Rep[ℤ₂ × SU₂]((0, 0) => 2, (0, 1) => 1, (1, 1) => 2, (0, 2) => 1)
@time project_Z2(rd, rD7)
@time project_Z2(rd, rD7)
@time project_Z2(rd, rD11)
@time project_Z2(rd, rD11)
# 1st run D = 7
@time project_Z2(rd, rD7)
# 400.344116 seconds (1.39 G allocations: 465.458 GiB, 7.96% gc time, 24.47% compilation time)
# 2nd run D = 7
@time project_Z2(rd, rD7)
#5.408577 seconds (20.24 M allocations: 2.738 GiB, 5.07% gc time, 5.34% compilation time)
# 1st run D = 11  (*after D = 7*)
@time project_Z2(rd, rD11)
# 319.394331 seconds (1.36 G allocations: 482.086 GiB, 8.63% gc time, 0.00% compilation time)
# 2nd run D = 11
@time project_Z2(rd, rD11)
#  7.989900 seconds (20.06 M allocations: 4.710 GiB, 4.31% gc time) | 
Given the speed-up of the permutations, the bottleneck has now migrated to the actual computation of the treetransformers. (mostly for tensors with many legs again)
@ogauthe has reported cases where the first run takes ~2000seconds, while the second run is ~5seconds.
While I haven't really benchmarked or profiled myself yet (shame on me!), I immediately remembered that we are not really composing the fusion tree manipulations in a very optimal way, where we are effectively manually writing out the matrix multiplications on the one hand, but also the recursive implementation means that we are recomputing a lot of transformations that we would otherwise already have encountered. This never really showed up since these computations are somewhat fast and cached, but for large amount of fusiontrees these loops become prohibitively expensive.
This PR aims to address these issues in a very similar fashion as the previous optimization run: we consider the set of all fusion trees with equal uncoupled charges, and act on these as a block. If my back-of-the-envelope calculations are correct, if there are$\sim N^L$  operations due to the recursive nature, while now this is $\sim L N^3$ .
Nsuch fusion trees and we are doingLsubsteps in the manipulation, we were previously doingBefore I take the time to properly finish this up, this might be good to have a proper review about some design choices. (including, but not limited to, the names). For example, there are now a bunch of "scalar" fusion tree manipulation methods that are more or less obsolete, in the sense that they more or less only make sense in the
UniqueFusioncase. We could consider removing them, but that becomes a bit larger of a change.Additionally, I took the time to canonicalize the calling signature of many of these functions, consistently making use of
Index2Tupleetc. I also fixed a TODO where we swapped the order ofbraidarguments for fusiontrees compared to for tensors. (changed the fusiontree one, since I think this is the least invasive)Obviously this still requires more tests etc as well.