Skip to content

Commit 1421459

Browse files
authored
Fix several typos (#482)
Signed-off-by: Alexander Seiler <[email protected]>
1 parent 633f353 commit 1421459

19 files changed

+30
-30
lines changed

docs/src/devdocs/loopset_structure.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -56,7 +56,7 @@ References to arrays are represented with an `ArrayReferenceMeta` data structure
5656
julia> LoopVectorization.operations(lsAmulB)[3].ref
5757
LoopVectorization.ArrayReferenceMeta(LoopVectorization.ArrayReference(:A, [:m, :k], Int8[0, 0]), Bool[1, 1], Symbol("##vptr##_A"))
5858
```
59-
It contains the name of the parent array (`:A`), the indicies `[:m,:k]`, and a boolean vector (`Bool[1, 1]`) indicating whether these indices are loop iterables. Note that the optimizer assumes arrays are column-major, and thus that it is efficient to read contiguous elements from the first index. In lower level terms, it means that [high-throughput vmov](https://www.felixcloutier.com/x86/movupd) instructions can be used rather than [low-throughput](https://www.felixcloutier.com/x86/vgatherdpd:vgatherqpd) [gathers](https://www.felixcloutier.com/x86/vgatherqps:vgatherqpd). Similar story for storing elements.
59+
It contains the name of the parent array (`:A`), the indices `[:m,:k]`, and a boolean vector (`Bool[1, 1]`) indicating whether these indices are loop iterables. Note that the optimizer assumes arrays are column-major, and thus that it is efficient to read contiguous elements from the first index. In lower level terms, it means that [high-throughput vmov](https://www.felixcloutier.com/x86/movupd) instructions can be used rather than [low-throughput](https://www.felixcloutier.com/x86/vgatherdpd:vgatherqpd) [gathers](https://www.felixcloutier.com/x86/vgatherqps:vgatherqpd). Similar story for storing elements.
6060
When no axis has unit stride, the first given index will be the dummy `Symbol("##DISCONTIGUOUSSUBARRAY##")`.
6161

6262
!!! warning

docs/src/examples/array_interface.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ LoopVectorization uses [ArrayInterface.jl](https://github.com/SciML/ArrayInterfa
99
that wasn't optimized by `LoopVectorization`, but instead simply had `@inbounds @fastmath` applied to the loop. This can often still yield reasonable to good performance, saving you from having to write more than one version of the loop
1010
to get good performance and correct behavior just because the array types happen to be different.
1111

12-
By supporting the interface, using `LoopVectorization` can simplify implementing many operations like matrix multiply while still getting good performance. For example, instead of [a few hundred lines of code](https://github.com/JuliaArrays/StaticArrays.jl/blob/0e431022954f0207eeb2c4f661b9f76936105c8a/src/matrix_multiply.jl#L4) to define matix multiplication in `StaticArrays`, one could simply write:
12+
By supporting the interface, using `LoopVectorization` can simplify implementing many operations like matrix multiply while still getting good performance. For example, instead of [a few hundred lines of code](https://github.com/JuliaArrays/StaticArrays.jl/blob/0e431022954f0207eeb2c4f661b9f76936105c8a/src/matrix_multiply.jl#L4) to define matrix multiplication in `StaticArrays`, one could simply write:
1313
```julia
1414
using StaticArrays, LoopVectorization
1515

docs/src/examples/dot_product.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ Thus, in 4 clock cycles, we can do up to 8 loads. But each `fma` requires 2 load
2525

2626
Double precision benchmarks pitting Julia's builtin dot product, and code compiled with a variety of compilers:
2727
![dot](https://raw.githubusercontent.com/JuliaSIMD/LoopVectorization.jl/docsassets/docs/src/assets/bench_dot_v2.svg)
28-
What we just described is the core of the approach used by all these compilers. The variation in results is explained mostly by how they handle vectors with lengths that are not an integer multiple of `W`. I ran these on a computer with AVX512 so that `W = 8`. LLVM, the backend compiler of both Julia and Clang, shows rapid performance degredation as `N % 4W` increases, where `N` is the length of the vectors.
28+
What we just described is the core of the approach used by all these compilers. The variation in results is explained mostly by how they handle vectors with lengths that are not an integer multiple of `W`. I ran these on a computer with AVX512 so that `W = 8`. LLVM, the backend compiler of both Julia and Clang, shows rapid performance degradation as `N % 4W` increases, where `N` is the length of the vectors.
2929
This is because, to handle the remainder, it uses a scalar loop that runs as written: multiply and add single elements, one after the other.
3030

3131
Initially, GCC (gfortran) stumbled in throughput, because it does not use separate accumulation vectors by default except on Power, even with `-funroll-loops`.
@@ -36,7 +36,7 @@ LoopVectorization uses `if/ifelse` checks to determine how many extra vectors ar
3636

3737
Neither GCC nor LLVM use masks (without LoopVectorization's assitance).
3838

39-
I am not certain, but I believe Intel and GCC check for the vector's alignment, and align them if neccessary. Julia guarantees that the start of arrays beyond a certain size are aligned, so this is not an optimization I have implemented. But it may be worthwhile for handling large matrices with a number of rows that isn't an integer multiple of `W`. For such matrices, the first column may be aligned, but the next will not be.
39+
I am not certain, but I believe Intel and GCC check for the vector's alignment, and align them if necessary. Julia guarantees that the start of arrays beyond a certain size are aligned, so this is not an optimization I have implemented. But it may be worthwhile for handling large matrices with a number of rows that isn't an integer multiple of `W`. For such matrices, the first column may be aligned, but the next will not be.
4040

4141
## Dot-Self
4242

docs/src/examples/matrix_multiplication.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ Letting all three matrices be square and `Size` x `Size`, we attain the followin
2222
This is classic GEMM, `𝐂 = 𝐀 * 𝐁`. GFortran's intrinsic `matmul` function does fairly well. But all the compilers are well behind LoopVectorization here, which falls behind MKL's `gemm` beyond 70x70 or so. The problem imposed by alignment is also striking: performance is much higher when the sizes are integer multiplies of 8. Padding arrays so that each column is aligned regardless of the number of rows can thus be very profitable. [PaddedMatrices.jl](https://github.com/JuliaSIMD/PaddedMatrices.jl) offers just such arrays in Julia. I believe that is also what the [-pad](https://software.intel.com/en-us/fortran-compiler-developer-guide-and-reference-pad-qpad) compiler flag does when using Intel's compilers.
2323

2424
![AmulBt](https://github.com/JuliaSIMD/LoopVectorization.jl/raw/docsassets/docs/src/assets/bench_AmulBt_v2.svg)
25-
The optimal pattern for `𝐂 = 𝐀 * 𝐁ᵀ` is almost identical to that for `𝐂 = 𝐀 * 𝐁`. Yet, gfortran's `matmul` instrinsic stumbles, surprisingly doing much worse than gfortran + loops, and almost certainly worse than allocating memory for `𝐁ᵀ` and creating the ecplicit copy.
25+
The optimal pattern for `𝐂 = 𝐀 * 𝐁ᵀ` is almost identical to that for `𝐂 = 𝐀 * 𝐁`. Yet, gfortran's `matmul` intrinsic stumbles, surprisingly doing much worse than gfortran + loops, and almost certainly worse than allocating memory for `𝐁ᵀ` and creating the explicit copy.
2626

2727
ifort did equally well whethor or not `𝐁` was transposed, while LoopVectorization's performance degraded slightly faster as a function of size in the transposed case, because strides between memory accesses are larger when `𝐁` is transposed. But it still performed best of all the compiled loops over this size range, losing out to MKL and eventually OpenBLAS.
2828
icc interestingly does better when it is transposed.

docs/src/examples/multithreading.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -188,7 +188,7 @@ end
188188
```
189189
![complexdot3](https://github.com/JuliaSIMD/LoopVectorization.jl/raw/docsassets/docs/src/assets/threadedcomplexdot3product.svg)
190190

191-
When testing on my laptop, the `C` implentation ultimately won, but I will need to investigate further to tell whether this benchmark benefits from hyperthreading,
191+
When testing on my laptop, the `C` implementation ultimately won, but I will need to investigate further to tell whether this benchmark benefits from hyperthreading,
192192
or if it's because LoopVectorization's memory access patterns are less friendly.
193193
I plan to work on cache-level blocking to increase memory friendliness eventually, and will likely also allow it to take advantage of hyperthreading/simultaneous multithreading, although I'd prefer a few motivating test problems to look at first. Note that a single core of this CPU is capable of exceeding 100 GFLOPS of double precision compute. The execution units are spending most of their time idle. So the question of whether hypthreading helps may be one of whether or not we are memory-limited.
194194

@@ -218,7 +218,7 @@ julia> doubles_per_l2 = (2 ^ 20) ÷ 8
218218
julia> total_doubles_in_l2 = doubles_per_l2 * (Sys.CPU_THREADS ÷ 2) # doubles_per_l2 * 18
219219
2359296
220220

221-
julia> doubles_per_mat = total_doubles_in_l2 ÷ 3 # divide up amoung 3 matrices
221+
julia> doubles_per_mat = total_doubles_in_l2 ÷ 3 # divide up among 3 matrices
222222
786432
223223

224224
julia> sqrt(ans)

docs/src/examples/special_functions.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -16,12 +16,12 @@ end
1616
While Intel's proprietary compilers do the best, LoopVectorization performs very well among open source alternatives. A complicating
1717
factor to the above benchmark is that in accessing the diagonals, we are not accessing contiguous elements. A benchmark
1818
simply exponentiating a vector shows that `gcc` also has efficient special function vectorization, but that the autovectorizer
19-
disagrees with the discontiguous memory acesses:
19+
disagrees with the discontiguous memory accesses:
2020

2121
![selfdot](https://github.com/JuliaSIMD/LoopVectorization.jl/raw/docsassets/docs/src/assets/bench_exp_v2.svg)
2222

2323
The similar performance between `gfortran` and `LoopVectorization` at multiples of 8 is no fluke: on Linux systems with a recent GLIBC, SLEEFPirates.jl --
2424
which LoopVectorization depends on to vectorize these special functions -- looks for the GNU vector library and uses these functions
2525
if available. Otherwise, it will use native Julia implementations that tend to be slower. As the modulus of vector length and vector width (8, on the
26-
host system thanks to AVX512) increases, `gfortran` shows the performance degredation pattern typical of LLVM-vectorized code.
26+
host system thanks to AVX512) increases, `gfortran` shows the performance degradation pattern typical of LLVM-vectorized code.
2727

src/codegen/lower_compute.jl

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -529,7 +529,7 @@ function lower_compute!(
529529
parents_op = parents(op)
530530
nparents = length(parents_op)
531531
# __u₂max = ls.unrollspecification.u₂
532-
# TODO: perhaps allos for swithcing unrolled axis again
532+
# TODO: perhaps allow for switching unrolled axis again
533533
mvar, u₁unrolledsym, u₂unrolledsym =
534534
variable_name_and_unrolled(op, u₁loopsym, u₂loopsym, vloopsym, suffix, ls)
535535
opunrolled = u₁unrolledsym || isu₁unrolled(op)

src/codegen/lower_load.jl

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -157,15 +157,15 @@ function pushbroadcast!(q::Expr, mvar::Symbol)
157157
)
158158
end
159159

160-
function child_cost_untill_vectorized(op::Operation)
160+
function child_cost_until_vectorized(op::Operation)
161161
isvectorized(op) && return 0.0
162162
c = 0.0
163163
for child children(op)
164164
if (!isvectorized(child) & iscompute(child))
165165
# FIXME: can double count
166166
c +=
167167
instruction_cost(instruction(child)).scalar_reciprocal_throughput +
168-
child_cost_untill_vectorized(child)
168+
child_cost_until_vectorized(child)
169169
end
170170
end
171171
c
@@ -174,7 +174,7 @@ function vectorization_profitable(op::Operation)
174174
# if op is vectorized itself, return true
175175
isvectorized(op) && return true
176176
# otherwise, check if descendents until hitting a vectorized portion are expensive enough
177-
child_cost_untill_vectorized(op) 5
177+
child_cost_until_vectorized(op) 5
178178
end
179179

180180
function lower_load_no_optranslation!(

src/codegen/lower_store.jl

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -377,7 +377,7 @@ function lower_tiled_store!(
377377
inds_calc_by_ptr_offset = indices_calculated_by_pointer_offsets(ls, op.ref)
378378

379379
if donot_tile_store(ls, op, reductfunc, u₂)
380-
# If we have a reductfunc, we're using a reducing store instead of a contiuguous or shuffle store anyway
380+
# If we have a reductfunc, we're using a reducing store instead of a contiguous or shuffle store anyway
381381
# so no benefit to being able to handle that case here, vs just calling the default `lower_store!` method
382382
@unpack u₁, u₂max = ua
383383
for t 0:u₂-1
@@ -408,7 +408,7 @@ function lower_tiled_store!(
408408
u = Core.ifelse(isu₁, u₁, 1)
409409
tup = Expr(:tuple)
410410
for t 0:u₂-1
411-
# tiled stores cannot be loop values, as they're necessarilly
411+
# tiled stores cannot be loop values, as they're necessarily
412412
# functions of at least two loops, meaning we do not need to handle them here.
413413
push!(tup.args, Symbol(variable_name(opp, ifelse(isu₂, t, -1)), '_', u))
414414
end

src/codegen/lower_threads.jl

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -977,10 +977,10 @@ function avx_threads_expr(
977977
LPSYM::Expr
978978
)
979979
valid_thread_loop, ua, c = valid_thread_loops(ls)
980-
num_candiates = sum(valid_thread_loop)
981-
if (num_candiates == 0) || (nt 1) # it was called from `avx_body` but now `nt` was set to `1`
980+
num_candidates = sum(valid_thread_loop)
981+
if (num_candidates == 0) || (nt 1) # it was called from `avx_body` but now `nt` was set to `1`
982982
avx_body(ls, UNROLL)
983-
elseif (num_candiates == 1) || (nt 3)
983+
elseif (num_candidates == 1) || (nt 3)
984984
thread_one_loops_expr(
985985
ls,
986986
ua,

0 commit comments

Comments
 (0)