Reorganization of the Beyond Numpy chapter

rougier · rougier · commit 445091a1e8fe · 2016-12-19T09:40:55.000+01:00
diff --git a/book/beyond-numpy.rst b/book/beyond-numpy.rst
@@ -7,14 +7,309 @@ Beyond Numpy
 Back to Python
 --------------
 
-Cython vs Python
-----------------
+You've almost reached the end of the book and, hopefully, you've learned that
+Numpy is a very versatile and powerful library. However in the meantime, you've
+to remember that Python is also quite a powerful tool. In fact, in some few
+specific cases, it might be more powerful than Numpy. Let's consider for
+example an interesting exercise that has been proposed by Tucker Balch in his
+`Coursera's Computational Investing
+<https://www.coursera.org/learn/computational-investing>`_ course. The exercise
+can be written as:
+
+Write the most succinct code possible to compute all "legal" allocations to 4
+stocks such that:
+
+* The allocations are in 1.0 chunks, and the allocations sum to 10.0
+* Only "pure" NumPy is allowed (no external libraries)
+* Can you do it without a "for"?"
+
+`Yaser Martinez <http://yasermartinez.com/blog/index.html>`_ collected the
+different answers from the community and the proposed solutions yield
+surprising results. But let's start with he most obvious Python solution:
+
+.. code:: python
+
+   def solution_1():
+       # Brute force
+       # 14641 (=11*11*11*11) iterations & tests
+       Z = []
+       for i in range(11):
+           for j in range(11):
+               for k in range(11):
+                   for l in range(11):
+                       if i+j+k+l == 10:
+                           Z.append((i,j,k,l))
+       return Z
+
+This solution is the slowest solution because it requires 4 loops, and more
+importantly, it tests all the different combinations (11641) of 4 integers
+between 0 and 10 to retain only combinations whose sum is 10. We can of course
+get rid of the 4 loops using itertools, but the code remains slow:
+
+.. code:: python
+
+   import itertools as it
+
+   def solution_2():
+       # Itertools
+       # 14641 (=11*11*11*11) iterations & tests
+       return [(i,j,k,l)
+               for i,j,k,l in it.product(range(11),repeat=4) if i+j+k+l == 10]
+
+One of the best solution that has been proposed by Nick Popplas takes advantage
+of the fact we can have intelligent imbricated loops that will allow us to
+directly build each tuple without any test as shown below:
+
+.. code:: python
+
+   def solution_3():
+       return [(a, b, c, (10 - a - b - c))
+               for a in range(11)
+               for b in range(11 - a)
+               for c in range(11 - a - b)]
+
+The best numpy solution by Yaser Martinez uses a different strategy with a
+restriced set of tests:
+
+.. code:: python
+
+   def solution_4():
+       X123 = np.indices((11,11,11)).reshape(3,11*11*11)
+       X4 = 10 - X123.sum(axis=0)
+       return np.vstack((X123, X4)).T[X4 > -1]
+
+If we benchmark these methods, we get:
+
+.. code:: pycon
 
-OpenGL made easy
+   >>> timeit("solution_1()", globals())
+   100 loops, best of 3: 1.9 msec per loop
+   >>> timeit("solution_2()", globals())
+   100 loops, best of 3: 1.67 msec per loop
+   >>> timeit("solution_3()", globals())
+   1000 loops, best of 3: 60.4 usec per loop
+   >>> timeit("solution_4()", globals())
+   1000 loops, best of 3: 54.4 usec per loop
+
+The Numpy solution is the fastest but the pure Python solution is comparable.
+But let me introduce a small modification to the Python solution:
+
+.. code:: python
+
+   def solution_3_bis():
+       return ((a, b, c, (10 - a - b - c))
+               for a in range(11)
+               for b in range(11 - a)
+               for c in range(11 - a - b))
+
+If we benchmark it, we get:
+
+.. code:: pycon
+
+   >>> timeit("solution_3_bis()", globals())
+   10000 loops, best of 3: 0.643 usec per loop
+
+You read it right, we have gained a factor 100 just by replacing square
+brackets with parenthesis. How is that possible ? The explanation can be found
+by looking at the type of the returned object:
+
+.. code:: pycon
+
+    >>> print(type(solution_3()))
+    <class 'list'>
+    >>> print(type(solution_3_bis()))
+    <class 'generator'>
+
+The `solution_3_bis()` returns a generator that can be used to generate the
+full list or to iterate over all the different elements. In any case, the huge
+speedup comes from the non-instantiation of the full list and it is this
+important to wonder if you need an actual instance of your result or if a
+simple generator might do the job.
+
+
+Friends of Numpy
 ----------------
 
-Scikits
--------
+Beyond numpy, there are several other Python packages that are worth a look
+because they address similar yet different class of problems using different
+technology (compilation, virtual machine, just in time compilation, GPU,
+compression, etc.). Depending on your specific problem and your hardware, one
+package may be better than the other. Let's illustrate their usage using a very
+simple example where we want to compute an expression based on two float
+vectors:
+
+.. code:: python
+
+   import numpy as np
+   a = np.random.uniform(0, 1, 1000).astype(np.float32)
+   b = np.random.uniform(0, 1, 1000).astype(np.float32)
+   c = 2*a + 3*b
+
+   
+NumExpr
++++++++
+
+The `numexpr <https://github.com/pydata/numexpr/wiki/Numexpr-Users-Guide>`_
+package supplies routines for the fast evaluation of array expressions
+elementwise by using a vector-based virtual machine. It's comparable to SciPy's
+weave package, but doesn't require a separate compile step of C or C++ code.
+
+.. code:: python
+
+   import numpy as np
+   import numexpr as ne
+
+   a = np.random.uniform(0, 1, 1000).astype(np.float32)
+   b = np.random.uniform(0, 1, 1000).astype(np.float32)
+   c = ne.evaluate("2*a + 3*b")
+
+   
+Cython
+++++++
+
+`Cython <http://cython.org>`_ is an optimising static compiler for both the
+Python programming language and the extended Cython programming language (based
+on Pyrex). It makes writing C extensions for Python as easy as Python itself.
+
+.. code:: python
+
+   import numpy as np
+          
+   def evaluate(np.ndarray a, np.ndarray b):
+       cdef int i
+       cdef np.ndarray c = np.zeros_like(a)
+       for i in range(a.size):
+           c[i] = 2*a[i] + 3*b[i]
+       return c
+
+   a = np.random.uniform(0, 1, 1000).astype(np.float32)
+   b = np.random.uniform(0, 1, 1000).astype(np.float32)
+   c = evaluate(a, b)
+   
+   
+Numba
++++++
+
+`Numba <http://numba.pydata.org>`_ gives you the power to speed up your
+applications with high performance functions written directly in Python. With a
+few annotations, array-oriented and math-heavy Python code can be just-in-time
+compiled to native machine instructions, similar in performance to C, C++ and
+Fortran, without having to switch languages or Python interpreters.
+
+.. code:: python
+
+   from numba import jit
+   import numpy as np
+
+   @jit
+   def evaluate(np.ndarray a, np.ndarray b):
+       c = np.zeros_like(a)
+       for i in range(a.size):
+           c[i] = 2*a[i] + 3*b[i]
+       return c
+
+   a = np.random.uniform(0, 1, 1000).astype(np.float32)
+   b = np.random.uniform(0, 1, 1000).astype(np.float32)
+   c = evaluate(a, b)
+
+
+Theano
+++++++
+
+`Theano <http://www.deeplearning.net/software/theano/>`_ is a Python library
+that allows you to define, optimize, and evaluate mathematical expressions
+involving multi-dimensional arrays efficiently. Theano features tight
+integration with NumPy, transparent use of a GPU, efficient symbolic
+differentiation, speed and stability optimizations, dynamic C code generation
+and extensive unit-testing and self-verification.
+
+.. code:: python
+
+   import numpy as np
+   import theano.tensor as T
+
+   x = T.fvector('x')
+   y = T.fvector('y')
+   z = 2*x + 3*y
+   f = function([x, y], z)
+
+   a = np.random.uniform(0, 1, 1000).astype(np.float32)
+   b = np.random.uniform(0, 1, 1000).astype(np.float32)
+   c = f(a, b)
+
+   
+PyCUDA
+++++++
+
+`PyCUDA <http://mathema.tician.de/software/pycuda>`_ lets you access Nvidia's
+CUDA parallel computation API from Python.
+
+.. code:: python
+
+   import numpy as np
+   import pycuda.autoinit
+   import pycuda.driver as drv
+   from pycuda.compiler import SourceModule
+   
+   mod = SourceModule("""
+       __global__ void evaluate(float *c, float *a, float *b)
+       {
+         const int i = threadIdx.x;
+         c[i] = 2*a[i] + 3*b[i];
+       }
+   """)
+
+   evaluate = mod.get_function("evaluate")
+
+   a = np.random.uniform(0, 1, 1000).astype(np.float32)
+   b = np.random.uniform(0, 1, 1000).astype(np.float32)
+   c = np.zeros_like(a)
+   
+   evaluate(drv.Out(c), drv.In(a), drv.In(b),
+            block=(400,1,1), grid=(1,1))
+
+
+PyOpenCL
+++++++++
+
+`PyOpenCL <http://mathema.tician.de/software/pyopencl>`_ lets you access GPUs
+and other massively parallel compute devices from Python.
+
+.. code:: python
+          
+   import numpy as np
+   import pyopencl as cl
+
+   a = np.random.uniform(0, 1, 1000).astype(np.float32)
+   b = np.random.uniform(0, 1, 1000).astype(np.float32)
+   c = np.empty_like(a)
+   
+   ctx = cl.create_some_context()
+   queue = cl.CommandQueue(ctx)
+
+   mf = cl.mem_flags
+   gpu_a = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=a)
+   gpu_b = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=b)
+
+   evaluate = cl.Program(ctx, """
+       __kernel void evaluate(
+           __global const float *gpu_a;
+           __global const float *gpu_b;
+           __global float *gpu_c)
+       {
+         int gid = get_global_id(0);
+         gpu_c[gid] = 2*gpu_a[gid] + 3*gpu_b[gid];
+       }
+   """).build()
+
+   gpu_c = cl.Buffer(ctx, mf.WRITE_ONLY, a.nbytes)
+   evaluate.evaluate(queue, a.shape, None, gpu_a, gpu_b, gpu_c)
+   cl.enqueue_copy(queue, c, gpu_c)
+
+
+
+Scipy & friends
+---------------
 
 Here is a very short list of packages that are well-maintained, well tested and
 may simplify your scientific life (depending on your domain). There are of
@@ -57,13 +352,13 @@ some spare time. For an extensive list, have a look at the `Awesome python list
 Conclusion
 ----------
 
-If numpy is a very versatile library, it does not mean you have to use in every
-situation. In this chapter, we've seen some alternatives (including Python
-itself) that are worth a look. As always, the choice belongs to you and you
+Numpy is a very versatile library but still, it does not mean you have to use
+it in every situation. In this chapter, we've seen some alternatives (including
+Python itself) that are worth a look. As always, the choice belongs to you. You
 have to consider what is the best solution for you in term of development time,
-computation time and effort in maintenance. If you design you own solution,
-you'll have to test it and to maintain it but in exchange, you're free to
-design it the way you want. On the other side, if you decide to rely on a
-third-party package, you'll save time in development and benefit from
-community-support even though you might have to adapt the package to your
+computation time and effort in maintenance. In onen hand, if you design your
+own solution, you'll have to test it and to maintain it, but in exchange,
+you'll be free to design it the way you want. On the other hand, if you decide
+to rely on a third-party package, you'll save time in development and benefit
+from community-support even though you might have to adapt the package to your
 specific needs. The choice is up to you.
diff --git a/book/preface.rst b/book/preface.rst
@@ -50,15 +50,15 @@ The sources are available from https://github.com/rougier/numpy-book.
 
 
 Pre-requisites
---------------
+++++++++++++++
 
 This is not a Python beginner guide and you should have an intermediate level in
 Python and ideally a beginner level in Numpy. If this is not the case, have
 a look at the bibliography_ for a curated list of resources.
 
 
 Conventions
------------
++++++++++++
 
 We will use usual naming conventions. If not stated explicitely, each script
 should import numpy, scipy and matplotlib as:
@@ -86,7 +86,7 @@ Matplotlib  1.5.3
 =========== =========
 
 How to contribute
------------------
++++++++++++++++++
 
 If you want to contribute to this book, you can do it by:
 
diff --git a/book/problem-vectorization.rst b/book/problem-vectorization.rst
@@ -173,7 +173,6 @@ Path finding
    propagated values from the end-point of the maze (bottom-right). Path is
    found by ascending gradient from the goal.
 
-
 .. image:: ../data/maze.png
    :width: 100%
 
@@ -187,10 +186,13 @@ Bellman-Ford method
 Sources
 +++++++
 
+* `maze-build.py <../code/maze-build.py>`_
+* `maze-python.py <../code/maze-numpy.py>`_
+* `maze-numpy.py <../code/maze-numpy.py>`_
+
 References
 ++++++++++
 
-           
 Smoke simulation
 ----------------
 
diff --git a/code/allocations.py b/code/allocations.py