Home

Planning

No prerequisite for previous R use, but expect Python familiarity

We do require some basic programming experience (say, equivalent to some hypothetical "Programming 101"), but it doesn't have to be specifically in R/Python.
Should focus on hands-on doing rather than lectures + separate exercise (see coderefinery approach)
Presentation technology bikeshedding

If this were a Python only course, jupyter notebooks would be an obvious choice? But what about R users? jupyter isn't that popular there, R users tend to use Rstudio, which provides "Rmarkdown" documents which can be used to do similar "literate programming" stuff as jupyter notebooks.

Notes

Key topics:

IO, data storage formats (local disks, scratch, ...)
Comparison of type of tools/libraries for different tasks
Filesystems (what we have available)?
matplotlib/ggplot
Optimizing memory usage
Parallelization - split, apply, combine, array jobs Secondary topics:
profiling
slurm scripts/slurm history/array jobs
memory/object models
seff

Python specific

Should use python 3.x (http://python3statement.org/). Python for data analysis 2nd edition (Wes McKinneys Pandas book) also uses python3.

R specific

How much do we want to teach Hadleyverse stuff vs. out-of-the-box R stuff?

ggplot at least is IMHO quite a lot better than the built-in plotting and widely used.

Themes

Unlike the outline, these are the big lessons people should learn via the things we teach.

use the right tools, data structures, and libraries
automation of workflows. Don't do everything manually
use good file formats
good development environments, IDEs, ...
profiling (and less debugging)

Outline

The general idea is that we do the same workshop/session/lecture/whatever twice, once with R and once with Python. That allows us to reuse lecture materials for both courses and share improvements.

Day 1

Introduction
- What does the course cover?
- Data Frames
  - What kind of data structure is it? Compare to the other usual suspects, lists, dicts, N-d arrays.
    - Special features: Categories/Factors, missing values
  - Useful for tabular data (CSV files, some similarities with RDBMS)
Get people set up
- Start Rstudio / jupyter notebook session on node via slurm
- ssh keys (at least for R)
Introductory exercises
- numpy/pandas beginnings (/ similar stuff for R)
Profiling, debugging
A few more short exercises
I/O
- HDF5 / pytables
- sqlite
- csv
Even more exercises

Day 2

Maybe move part of I/O from day 1 here?
Split-apply-combine
- Motivation, why is this a common and useful workflow?
- Running on a parallel batch system
  - Small problem: Everything in one process
  - Medium: Apply part in parallel using multiprocessing or other simple technique.
  - Large: Apply part in parallel using slurm array jobs, and using job dependencies to correctly order the split, apply, and combine phases.

Day 3

Visualization with matplotlib & ggplot
- Seaborn could be interesting too (statistics-focused layer on top of matplotlib), but I have no personal experience of it.
- For matplotlib could cover tricks like using latex for rendering math for axis labels etc.
Workflows for visualization
- repeatability is important!
- putting plotting stuff into scripts vs. redoing it
- using make for managing workflows