Talk demonstrating how to massively optimise data processing and numerical computation in Python. We perform outlier detection on a large time-series weather dataset (ISD). We take detecting outliers in 600GBs worth of data in Python down from 28 days to 38 minutes.
Topics covered:
- motivations for fast numerical processing in Python
- why Python is a slow programming language
- fast numerical processing in
numpy - vectorisation
- using
numbato optimise non-vectorised code - parallelising computation using
joblib
You can also run the presentation on a local web server. Clone this repository and run the presentation like so:
npm install
grunt serve
The presentation can now be accessed on localhost:8080. Note that this web application is configured to bind to hostname 0.0.0.0, which means that once the Grunt server is running, it will be accessible from external hosts as well (using the current host's public IP address).