Need for Speed fixed

jlperla · jlperla · commit 424ed29412ae · 2021-08-17T11:23:09.000-07:00
diff --git a/lectures/software_engineering/need_for_speed.md b/lectures/software_engineering/need_for_speed.md
@@ -131,15 +131,15 @@ For example, we can't at present add an integer and a string in Julia (i.e. `100
 
 This is sensible behavior, but if you want to change it there's nothing to stop you.
 
-```{code-cell} julia
+```{code-cell} none
 import Base: +  # enables adding methods to the + function
-
 +(x::Integer, y::String) = x + parse(Int, y)
-
 @show +(100, "100")
 @show 100 + "100";  # equivalent
 ```
 
+The above code is not executed to avoid any chance of a [method invalidation](https://julialang.org/blog/2020/08/invalidations/), where are a source of compile-time latency.
+
 ### Understanding the Compilation Process
 
 We can now be a little bit clearer about what happens when you call a function on given types.
@@ -535,8 +535,9 @@ To illustrate, consider this code, where `b` is global
 b = 1.0
 function g(a)
     global b
+    tmp = a
     for i in 1:1_000_000
-        tmp = a + b
+        tmp = tmp + a + b
     end
     return tmp
 end
@@ -560,24 +561,26 @@ If we eliminate the global variable like so
 
 ```{code-cell} julia
 function g(a, b)
+    tmp = a
     for i in 1:1_000_000
-        tmp = a + b
+        tmp = tmp + a + b
     end
-    return tmp    
+    return tmp
 end
 ```
 
-then execution speed improves dramatically
+then execution speed improves dramatically.  Furthermore, the number of allocations has dropped to zero.
 
 ```{code-cell} julia
 @btime g(1.0, 1.0)
 ```
 
-Note that the second run was dramatically faster than the first.
+Note that if you called `@time` instead, the first call would be slower as it would need to compile the function.  Conversely, `@btime` discards the first call and runs it multiple times.
 
-That's because the first call included the time for JIT compilation.
-
-Notice also how small the memory footprint of the execution is.
+More information is available with `@benchmark` instead,
+```{code-cell} julia
+@benchmark g(1.0, 1.0)
+```
 
 Also, the machine code is simple and clean
 
@@ -588,20 +591,19 @@ Also, the machine code is simple and clean
 Now the compiler is certain of types throughout execution of the function and
 hence can optimize accordingly.
 
-#### The `const` keyword
-
-Another way to stabilize the code above is to maintain the global variable but
-prepend it with `const`
+If global variations are strictly needed (and they almost never are) then you can declare them with a `const` to declare to Julia that the type never changes (the value can).  For example, 
 
 ```{code-cell} julia
 const b_const = 1.0
-function g(a)
+function g_const(a)
     global b_const
+    tmp = a
     for i in 1:1_000_000
-        tmp = a + b_const
+        tmp = tmp + a + b_const
     end
     return tmp
 end
+@btime g_const(1)
 ```
 
 Now the compiler can again generate efficient machine code.
@@ -626,7 +628,7 @@ As we'll see, the last of these options gives us the best performance, while sti
 Here's the untyped case
 
 ```{code-cell} julia
-struct Foo_generic
+struct Foo_any
     a
 end
 ```
@@ -639,7 +641,7 @@ struct Foo_abstract
 end
 ```
 
-Finally, here's the parametrically typed case
+Finally, here's the parametrically typed case (where the `{T <: Real}` is not necessary for performance, and could simply be `{T}`
 
 ```{code-cell} julia
 struct Foo_concrete{T <: Real}
@@ -650,7 +652,7 @@ end
 Now we generate instances
 
 ```{code-cell} julia
-fg = Foo_generic(1.0)
+fg = Foo_any(1.0)
 fa = Foo_abstract(1.0)
 fc = Foo_concrete(1.0)
 ```
@@ -669,14 +671,15 @@ Here's a function that uses the field `a` of our objects
 
 ```{code-cell} julia
 function f(foo)
+    tmp = foo.a
     for i in 1:1_000_000
         tmp = i + foo.a
     end
     return tmp
 end
 ```
 
-Let's try timing our code, starting with the generic case:
+Let's try timing our code, starting with the case without any constraints:
 
 ```{code-cell} julia
 @btime f($fg)
@@ -690,7 +693,7 @@ Here's the nasty looking machine code
 @code_native f(fg)
 ```
 
-The abstract case is similar
+The abstract case is almost identical,
 
 ```{code-cell} julia
 @btime f($fa)
@@ -706,14 +709,18 @@ Finally, let's look at the parametrically typed version
 @btime f($fc)
 ```
 
-Some of this time is JIT compilation, and one more execution gets us down to.
+Which is improbably small - since a runtime of 1-2 nanoseconds without any allocations suggests no computations really took place.
 
-Here's the corresponding machine code
+A hint is in the simplicity of the corresponding machine code
 
 ```{code-cell} julia
 @code_native f(fc)
 ```
 
+This machine code has none of the hallmark assembly instructions associated with a loop, in particular loops (e.g. `for` in julia) end up as jumps in the machine code (e.g. `jne`).
+
+Here, the compiler was smart enough to realize that only the final step in the loop matters, i.e. it could generate the equivalent to `f(a) = 1_000_000 + foo.a` and then it can return that directly and skip the loop.  These sorts of code optimizations are only possible if the compiler is able to use a great deal of information about the types.
+
 
 Finally, note that if we compile a slightly different version of the function, which doesn't actually return the value
 ```{code-cell} julia
@@ -722,26 +729,11 @@ function f_no_return(foo)
         tmp = i + foo.a
     end
 end
-```
-That
-```{code-cell} julia
-@btime f_no_return($fc)
-```
-Which seems improbably small.  The machine code gives a hint,
-```{code-cell} julia
 @code_native f_no_return(fc)
 ```
+We see that the code is even simpler.  In effect, if figured out that because `tmp` wasn't returned, and the `foo.a` could have no side effects (since it knows the type of `a`), that it doesn't even need to execute any of the code in the function.
 
-Note that in this case, the machine code is doing nothing.  The compiler was able to prove that the function had no side effects and hence it could simply ignore the inputs.
-
-This is not the case with the abstract case, because the compiler is unable to prove this invariant.
-
-```{code-cell} julia
-@btime f_no_return($fa)
-```
-### Abstract Containers
-
-Another way we can run into trouble is with abstract container types.
+### Type Inference
 
 Consider the following function, which essentially does the same job as Julia's `sum()` function but acts only on floating point data
 
@@ -758,22 +750,45 @@ end
 Calls to this function run very quickly
 
 ```{code-cell} julia
-x = range(0,  1, length = Int(1e6))
-x = collect(x)
+x_range = range(0,  1, length = 100_000)
+x = collect(x_range)
 typeof(x)
 ```
 
 ```{code-cell} julia
 @btime sum_float_array($x)
 ```
 
-When Julia compiles this function, it knows that the data passed in as `x` will be an array of 64 bit floats.
+When Julia compiles this function, it knows that the data passed in as `x` will be an array of 64 bit floats.  Hence it's known to the compiler that the relevant method for `+` is always addition of floating point numbers.
+
+But consider a version without that type annotation
+
+```{code-cell} julia
+function sum_array(x)
+    sum = 0.0
+    for i in eachindex(x)
+        sum += x[i]
+    end
+    return sum
+end
+@btime sum_array($x)
+```
+
+Note that this has the same running time as the one with the explicit types.  In julia, there is (almost never) performance gain from declaring types, and if anything they can make things worse by limited the potential for specialized algorithms.  See {doc}`generic programming <../more_julia/generic_programming>` for more.
+
+As an example within Julia code, look at the built-in sum for array
 
-Hence it's known to the compiler that the relevant method for `+` is always addition of floating point numbers.
+```{code-cell} julia
+@btime sum($x)
+```
+
+Versus the underlying range
 
-Moreover, the data can be arranged into continuous 64 bit blocks of memory to simplify memory access.
+```{code-cell} julia
+@btime sum($x_range)
+```
 
-Finally, data types are stable --- for example, the local variable `sum` starts off as a float and remains a float throughout.
+Note that the difference in speed is enormous---suggesting it is better to keep things in their more structured forms as long as possible.  You can check the underlying source used for this with `@which sum(x_range)` to see the specialized algorithm used.
 
 #### Type Inferences