You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+66-25Lines changed: 66 additions & 25 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,8 +1,8 @@
1
1
<divalign="center">
2
2
<h1>Tullio.jl</h1>
3
3
4
-
[](https://github.com/mcabbott/Tullio.jl/actions?query=workflow%3ACI)
5
-
[](https://buildkite.com/julialang/tullio-dot-jl)
@@ -25,10 +25,10 @@ But it also co-operates with various other packages, provided they are loaded be
25
25
26
26
* It uses [`LoopVectorization.@avx`](https://github.com/chriselrod/LoopVectorization.jl) to speed many things up. (Disable with `avx=false`.) On a good day this will match the speed of OpenBLAS for matrix multiplication.
27
27
28
-
* It uses [`TensorOperations.@tensor`](https://github.com/Jutho/TensorOperations.jl) on expressions which this understands. (Disable with `tensor=false`.) These must be Einstein-convention contractions of one term; none of the examples above qualify.
29
-
30
28
* It uses [`KernelAbstractions.@kernel`](https://github.com/JuliaGPU/KernelAbstractions.jl) to make a GPU version. (Disable with `cuda=false`.) This is somewhat experimental, and may not be fast.
31
29
30
+
* It uses [`TensorOperations.@tensor`](https://github.com/Jutho/TensorOperations.jl) on expressions which this understands. (Disable with `tensor=false`.) These must be Einstein-convention contractions of one term; none of the examples above qualify.
31
+
32
32
The macro also tries to provide a gradient for use with [Tracker](https://github.com/FluxML/Tracker.jl) or [Zygote](https://github.com/FluxML/Zygote.jl). <!-- or [ReverseDiff](https://github.com/JuliaDiff/ReverseDiff.jl). -->
33
33
(Disable with `grad=false`, or `nograd=A`.) This is done in one of two ways:
34
34
@@ -63,8 +63,38 @@ And `verbose=2` will print everything.
63
63
64
64
<details><summary><b>Notation</b></summary>
65
65
66
+
Index notation for some simple functions:
67
+
66
68
```julia
67
69
using Pkg; Pkg.add("Tullio")
70
+
using Tullio, Test
71
+
M =rand(1:20, 3, 7)
72
+
73
+
@tullio S[1,c] := M[r,c] # sum over r ∈ 1:3, for each c ∈ 1:7
74
+
@test S ==sum(M, dims=1)
75
+
76
+
@tullio Q[ρ,c] := M[ρ,c] +sqrt(S[1,c]) # loop over ρ & c, no sum -- broadcasting
77
+
@test Q ≈ M .+sqrt.(S)
78
+
79
+
mult(M,Q) =@tullio P[x,y] := M[x,c] * Q[y,c] # sum over c ∈ 1:7 -- matrix multiplication
80
+
@testmult(M,Q) ≈ M *transpose(Q)
81
+
82
+
R = [rand(Int8, 3, 4) for δ in1:5]
83
+
84
+
@tullio T[j,i,δ] := R[δ][i,j] +10im# three nested loops -- concatenation
85
+
@test T ==permutedims(cat(R...; dims=3), (2,1,3)) .+10im
86
+
87
+
@tullio (max) X[i] :=abs2(T[j,i,δ]) # reduce using max, over j and δ
88
+
@test X ==dropdims(maximum(abs2, T, dims=(1,3)), dims=(1,3))
89
+
90
+
dbl!(M, S) =@tullio M[r,c] =2* S[1,c] # write into existing matrix, M .= 2 .* S
91
+
dbl!(M, S)
92
+
@testall(M[r,c] ==2*S[1,c] for r ∈1:3, c ∈1:7)
93
+
```
94
+
95
+
More complicated examples:
96
+
97
+
```julia
68
98
using Tullio
69
99
A = [abs2(i -11) for i in1:21]
70
100
@@ -114,38 +144,32 @@ using NamedDims, AxisKeys # Dimension names, plus pretty printing:
114
144
</details>
115
145
<details><summary><b>Fast & slow</b></summary>
116
146
117
-
When used with LoopVectorization, on straightforward matrix multiplication of real numbers,
118
-
`@tullio` tends to be about as fast as OpenBLAS. Depending on the size, and on your computer.
147
+
When used with LoopVectorization, on straightforward matrix multiplication of real numbers,
148
+
`@tullio` tends to be about as fast as OpenBLAS. Depending on the size, and on your computer.
119
149
Here's a speed comparison on mine: [v2.5](https://github.com/mcabbott/Tullio.jl/blob/master/benchmarks/02/matmul-0.2.5-Float64-1.5.0.png).
120
150
121
-
This is a useful diagnostic, but isn't really the goal. Two things `@tullio` is often
122
-
very fast at are weird tensor contractions (for which you'd need `permutedims`),
123
-
and broadcast-reductions (where it can avoid large allocations). For example:
151
+
This race is a useful diagnostic, but isn't really the goal. There is little point in avoiding
152
+
using BLAS libraries, if you want precisely what they are optimised to give you.
153
+
One of the things `@tullio` is often very fast at is weird tensor contractions,
154
+
for which you would otherwise need `permutedims`:
124
155
125
156
```julia
126
157
using Tullio, LoopVectorization, NNlib, BenchmarkTools
127
158
128
-
# Batched matmul with batch index first in B, defined with @avx loops:
159
+
# Batched matmul, with batch index first in B:
129
160
bmm_rev(A, B) =@tullio C[i,k,b] := A[i,j,b] * B[b,k,j] # (sum over j)
130
161
131
162
A =randn(20,30,500); B =randn(500,40,30);
132
-
bmm_rev(A, B) ≈ NNlib.batched_mul(A, permutedims(B, (3,2,1))) # true
163
+
bmm_rev(A, B) ≈ NNlib.batched_mul(A, permutedims(B, (3,2,1))) # true
133
164
134
-
@btimebmm_rev($A, $B); # 317.526 μs, same speed as un-permuted bmm
135
-
@btime NNlib.batched_mul($A, permutedims($B, (3,2,1))); # 1.478 ms, with MKL
136
-
137
-
# Complete reduction, without first materialising X .* log.(Y')
Another thing Tullio can be very fast at broadcast reductions, where it can avoid large allocations. Here LoopVectorization is speeding up `log`, and Tullio is handling tiled memory access and multi-threading:
0 commit comments