Skip to content

Commit cf5e2aa

Browse files
author
Michael Abbott
committed
readme
1 parent a89869e commit cf5e2aa

File tree

1 file changed

+66
-25
lines changed

1 file changed

+66
-25
lines changed

README.md

Lines changed: 66 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
<div align="center">
22
<h1>Tullio.jl</h1>
33

4-
[![GitHub Actions CPU CI](https://github.com/mcabbott/Tullio.jl/workflows/CI/badge.svg)](https://github.com/mcabbott/Tullio.jl/actions?query=workflow%3ACI)
5-
[![Build status (Buildkite GPU CI)](https://badge.buildkite.com/7f7fec35c774174a59cf616fc6e1711c70e94c088248088758.svg?branch=master&step=Julia%201.5)](https://buildkite.com/julialang/tullio-dot-jl)
4+
[![GitHub CI](https://img.shields.io/github/workflow/status/mcabbott/Tullio.jl/CI?logo=github)](https://github.com/mcabbott/Tullio.jl/actions?query=workflow%3ACI)
5+
[![Buildkite GPU CI](https://img.shields.io/buildkite/7f7fec35c774174a59cf616fc6e1711c70e94c088248088758?color=eee&label=gpu&logo=nvidia)](https://buildkite.com/julialang/tullio-dot-jl)
66
[![Tag Version](https://img.shields.io/github/v/tag/mcabbott/Tullio.jl?color=red&logo=)](https://github.com/mcabbott/Tullio.jl/releases)
77
</div>
88

@@ -25,10 +25,10 @@ But it also co-operates with various other packages, provided they are loaded be
2525

2626
* It uses [`LoopVectorization.@avx`](https://github.com/chriselrod/LoopVectorization.jl) to speed many things up. (Disable with `avx=false`.) On a good day this will match the speed of OpenBLAS for matrix multiplication.
2727

28-
* It uses [`TensorOperations.@tensor`](https://github.com/Jutho/TensorOperations.jl) on expressions which this understands. (Disable with `tensor=false`.) These must be Einstein-convention contractions of one term; none of the examples above qualify.
29-
3028
* It uses [`KernelAbstractions.@kernel`](https://github.com/JuliaGPU/KernelAbstractions.jl) to make a GPU version. (Disable with `cuda=false`.) This is somewhat experimental, and may not be fast.
3129

30+
* It uses [`TensorOperations.@tensor`](https://github.com/Jutho/TensorOperations.jl) on expressions which this understands. (Disable with `tensor=false`.) These must be Einstein-convention contractions of one term; none of the examples above qualify.
31+
3232
The macro also tries to provide a gradient for use with [Tracker](https://github.com/FluxML/Tracker.jl) or [Zygote](https://github.com/FluxML/Zygote.jl). <!-- or [ReverseDiff](https://github.com/JuliaDiff/ReverseDiff.jl). -->
3333
(Disable with `grad=false`, or `nograd=A`.) This is done in one of two ways:
3434

@@ -63,8 +63,38 @@ And `verbose=2` will print everything.
6363

6464
<details><summary><b>Notation</b></summary>
6565

66+
Index notation for some simple functions:
67+
6668
```julia
6769
using Pkg; Pkg.add("Tullio")
70+
using Tullio, Test
71+
M = rand(1:20, 3, 7)
72+
73+
@tullio S[1,c] := M[r,c] # sum over r ∈ 1:3, for each c ∈ 1:7
74+
@test S == sum(M, dims=1)
75+
76+
@tullio Q[ρ,c] := M[ρ,c] + sqrt(S[1,c]) # loop over ρ & c, no sum -- broadcasting
77+
@test Q M .+ sqrt.(S)
78+
79+
mult(M,Q) = @tullio P[x,y] := M[x,c] * Q[y,c] # sum over c ∈ 1:7 -- matrix multiplication
80+
@test mult(M,Q) M * transpose(Q)
81+
82+
R = [rand(Int8, 3, 4) for δ in 1:5]
83+
84+
@tullio T[j,i,δ] := R[δ][i,j] + 10im # three nested loops -- concatenation
85+
@test T == permutedims(cat(R...; dims=3), (2,1,3)) .+ 10im
86+
87+
@tullio (max) X[i] := abs2(T[j,i,δ]) # reduce using max, over j and δ
88+
@test X == dropdims(maximum(abs2, T, dims=(1,3)), dims=(1,3))
89+
90+
dbl!(M, S) = @tullio M[r,c] = 2 * S[1,c] # write into existing matrix, M .= 2 .* S
91+
dbl!(M, S)
92+
@test all(M[r,c] == 2*S[1,c] for r 1:3, c 1:7)
93+
```
94+
95+
More complicated examples:
96+
97+
```julia
6898
using Tullio
6999
A = [abs2(i - 11) for i in 1:21]
70100

@@ -114,38 +144,32 @@ using NamedDims, AxisKeys # Dimension names, plus pretty printing:
114144
</details>
115145
<details><summary><b>Fast & slow</b></summary>
116146

117-
When used with LoopVectorization, on straightforward matrix multiplication of real numbers,
118-
`@tullio` tends to be about as fast as OpenBLAS. Depending on the size, and on your computer.
147+
When used with LoopVectorization, on straightforward matrix multiplication of real numbers,
148+
`@tullio` tends to be about as fast as OpenBLAS. Depending on the size, and on your computer.
119149
Here's a speed comparison on mine: [v2.5](https://github.com/mcabbott/Tullio.jl/blob/master/benchmarks/02/matmul-0.2.5-Float64-1.5.0.png).
120150

121-
This is a useful diagnostic, but isn't really the goal. Two things `@tullio` is often
122-
very fast at are weird tensor contractions (for which you'd need `permutedims`),
123-
and broadcast-reductions (where it can avoid large allocations). For example:
151+
This race is a useful diagnostic, but isn't really the goal. There is little point in avoiding
152+
using BLAS libraries, if you want precisely what they are optimised to give you.
153+
One of the things `@tullio` is often very fast at is weird tensor contractions,
154+
for which you would otherwise need `permutedims`:
124155

125156
```julia
126157
using Tullio, LoopVectorization, NNlib, BenchmarkTools
127158

128-
# Batched matmul with batch index first in B, defined with @avx loops:
159+
# Batched matmul, with batch index first in B:
129160
bmm_rev(A, B) = @tullio C[i,k,b] := A[i,j,b] * B[b,k,j] # (sum over j)
130161

131162
A = randn(20,30,500); B = randn(500,40,30);
132-
bmm_rev(A, B) NNlib.batched_mul(A, permutedims(B, (3,2,1))) # true
163+
bmm_rev(A, B) NNlib.batched_mul(A, permutedims(B, (3,2,1))) # true
133164

134-
@btime bmm_rev($A, $B); # 317.526 μs, same speed as un-permuted bmm
135-
@btime NNlib.batched_mul($A, permutedims($B, (3,2,1))); # 1.478 ms, with MKL
136-
137-
# Complete reduction, without first materialising X .* log.(Y')
138-
sum_opp(X, Y=X) = @tullio s := X[i,j] * log(Y[j,i])
139-
140-
X = rand(1000,1000);
141-
@btime sum_opp($X) # 499.814 μs (173 allocations: 14.20 KiB)
142-
@btime sum($X .* log.(transpose($X))) # 8.759 ms (2 allocations: 7.63 MiB)
165+
@btime bmm_rev($A, $B); # 317.526 μs μs, same speed as un-permuted
166+
@btime NNlib.batched_mul($A, permutedims($B, (3,2,1))); # 1.478 ms, with MKL
143167
```
144168

145169
Complex numbers aren't handled by LoopVectorization, so will be much slower.
146170

147171
Chained multiplication is also very slow, because it doesn't know there's a better
148-
algorithm. Here it just makes 4 loops, instead of multiplying sequentially,
172+
algorithm. Here it just makes 4 loops, instead of multiplying sequentially,
149173
`30^4` instead of `2 * 30^3` operations:
150174

151175
```julia
@@ -154,6 +178,20 @@ M1, M2, M3 = randn(30,30), randn(30,30), randn(30,30);
154178
@btime @tullio M4[i,l] := $M1[i,j] * $M2[j,k] * $M3[k,l]; # 30.401 μs
155179
```
156180

181+
Another thing Tullio can be very fast at broadcast reductions, where it can avoid large allocations. Here LoopVectorization is speeding up `log`, and Tullio is handling tiled memory access and multi-threading:
182+
183+
```julia
184+
sum_opp(X, Y=X) = @tullio s := X[i,j] * log(Y[j,i])
185+
sum_part(X, Y=X) = @tullio S[i] := X[i,j] * log(Y[j,i])
186+
187+
X = rand(1000,1000);
188+
@btime sum_opp($X) # 499.814 μs (93 allocations: 3.97 KiB)
189+
@btime sum($X .* log.(transpose($X))) # 8.759 ms (2 allocations: 7.63 MiB)
190+
191+
@btime sum_part($X)' # 1.599 ms (not the same computer!)
192+
@btime sum($X .* log.(transpose($X)), dims=2) # 13.292 ms
193+
```
194+
157195
At present indices using `pad`, `clamp` or `mod` are also slow. These result in extra
158196
checks or operations at every iteration, not just around the edges:
159197

@@ -163,15 +201,18 @@ conv2(x,k) = @tullio y[i+_, j+_] := x[2i-a, 2j-b] * k[a,b]
163201
conv3(x,k) = @tullio y[i+_, j+_] := x[pad(i-a,3), pad(j-b,3)] * k[a,b]
164202

165203
x100 = rand(100,100); k7 = randn(7,7);
166-
@btime conv1($x100, $k7); # 15.690 μs
167-
@btime conv2($x100, $k7); # 37.835 μs
168-
@btime conv3($x100, $k7); # 283.634 μs
204+
@btime conv1($x100, $k7); # 25.574 μs
205+
@btime conv2($x100, $k7); # 44.590 μs
206+
@btime conv3($x100, $k7); # 86.228 μs
169207

170208
using Flux
171-
x104 = reshape(x100,(100,100,1,1)); k74 = reshape(k7,(7,7,1,1));
209+
x104 = reshape(x100,(100,100,1,1)); k74 = reshape(k7,(7,7,1,1));
172210
conv1(x100, k7) @btime CrossCor($k74, false)($x104) # 586.694 μs
173211
conv2(x100, k7) @btime Conv($k74, false, stride=2)($x104) # 901.573 μs
174212
conv3(x100, k7) @btime Conv($k74, false, pad=3)($x104) # 932.658 μs
213+
214+
using DSP
215+
@btime DSP.conv($x100, $k7); # 198.331 μs
175216
```
176217

177218
</details>

0 commit comments

Comments
 (0)