Initial commit to add a Dataset-driven Narrative to the Lesson

mike-ivs · mike-ivs · commit f09c4b8e2c7b · 2021-12-02T18:02:31.000+13:00
diff --git a/_episodes/02-numpy.md b/_episodes/02-numpy.md
@@ -30,7 +30,7 @@ that can be called upon when needed.
 
 ## Loading data into Python
 
-To begin processing inflammation data, we need to load it into Python.
+To begin processing the clinical trial inflammation data, we need to load it into Python.
 We can do that using a library called
 [NumPy](http://docs.scipy.org/doc/numpy/ "NumPy Documentation"), which stands for Numerical Python.
 In general, you should use this library when you want to do fancy things with lots of numbers,
diff --git a/_episodes/03-matplotlib.md b/_episodes/03-matplotlib.md
@@ -30,9 +30,15 @@ matplotlib.pyplot.show()
 ![Heat map representing the `data` variable. Each cell is colored by value along a color gradient
 from blue to yellow.](../fig/inflammation-01-imshow.svg)
 
-Blue pixels in this heat map represent low values, while yellow pixels represent high values.  As we
-can see, inflammation rises and falls over a 40-day period.  Let's take a look at the average
-inflammation over time:
+Each row in the heat map corresponds to a patient in the clinical trial dataset, and each column
+corresponds to a day in the dataset.  Blue pixels in this heat map represent low values, while yellow
+pixels represent high values.  As we can see, the general number of inflammation flare-ups for the patients
+rises and falls over a 40-day period.
+
+So far so good, this is in line with our knowledge of the clinical trial and Dr. Maverick's claims:
+the patients take their medication once their inflammation flare-ups begin; it takes around 3 weeks
+for the medication to take effect and begin reducing flare-ups; and flare-ups appear to drop to zero
+by the end of the clinical trial.  Now let's take a look at the average inflammation over time:
 
 ~~~
 ave_inflammation = numpy.mean(data, axis=0)
@@ -45,8 +51,9 @@ matplotlib.pyplot.show()
 
 Here, we have put the average inflammation per day across all patients in the variable
 `ave_inflammation`, then asked `matplotlib.pyplot` to create and display a line graph of those
-values.  The result is a roughly linear rise and fall, which is suspicious: we might instead expect
-a sharper rise and slower fall.  Let's have a look at two other statistics:
+values.  The result is a roughly linear rise and fall, in line with Dr. Maverick's claim that the
+medication takes 3 weeks to take effect.  But a good data scientist doesn't just consider the
+average of a dataset, so let's have a look at two other statistics:
 
 ~~~
 max_plot = matplotlib.pyplot.plot(numpy.max(data, axis=0))
@@ -64,10 +71,10 @@ matplotlib.pyplot.show()
 
 ![A line graph showing the minimum inflammation across all patients over a 40-day period.](../fig/inflammation-01-minimum.svg)
 
-The maximum value rises and falls smoothly, while the minimum seems to be a step function.  Neither
-trend seems particularly likely, so either there's a mistake in our calculations or something is
-wrong with our data.  This insight would have been difficult to reach by examining the numbers
-themselves without visualization tools.
+The maximum value rises and falls linearly, while the minimum seems to be a step function.  
+Suspicious... neither trend seems particularly likely, so either there's a mistake in our
+calculations or something is wrong with our data.  This insight would have been difficult
+to reach by examining the numbers themselves without visualization tools.
 
 ### Grouping plots
 You can group similar plots in a single figure using subplots.
diff --git a/_episodes/04-lists.md b/_episodes/04-lists.md
@@ -20,8 +20,14 @@ list[2:9]), in the same way as strings and arrays."
 - "Strings are immutable (i.e., the characters in them cannot be changed)."
 ---
 
-In the previous episode, we analyzed a single file with inflammation data. Our goal, however, is to
-process all the inflammation data we have, which means that we still have eleven more files to go!
+In the previous episode, we analyzed a single file of clinical trial inflammation data. However,
+after finding some peculiar, and potentially suspicious, trends in the trial data we ask
+Dr. Maverick if they have performed any other clinical trials. Surprisingly, they say that they
+have and provide us with 11 more CSV files for a further 11 clinical trials they have undertaken
+since the initial trial.
+
+Our goal now is to process all the inflammation data we have, which means that we still have
+eleven more files to go!
 
 The natural first step is to collect the names of all the files that we have to process. In Python,
 a list is a way to store multiple values together. In this episode, we will learn how to store
diff --git a/_episodes/05-loop.md b/_episodes/05-loop.md
@@ -21,9 +21,10 @@ inflammation dataset (`inflammation-01.csv`), which revealed some suspicious fea
 
 ![Line graphs showing average, maximum and minimum inflammation across all patients over a 40-day period.](../fig/03-loop_2_0.png)
 
-We have a dozen data sets right now, though, and more on the way.
-We want to create plots for all of our data sets with a single statement.
-To do that, we'll have to teach the computer how to repeat things.
+We have a dozen data sets right now, though, and potentially more on the way if Dr. Maverick
+can keep up their surprisingly fast clinical trial rate. We want to create plots for all of
+our data sets with a single statement. To do that, we'll have to teach the computer how to
+repeat things.
 
 An example task that we might want to repeat is accessing numbers in a list,
 which we
diff --git a/_episodes/06-files.md b/_episodes/06-files.md
@@ -107,11 +107,25 @@ inflammation-03.csv
 maximum and minimum inflammation over a 40-day period for all patients in the third
 dataset.](../fig/03-loop_49_5.png)
 
-Sure enough,
-the maxima of the first two data sets show exactly the same ramp as the first,
-and their minima show the same staircase structure;
-a different situation has been revealed in the third dataset,
-where the maxima are a bit less regular, but the minima are consistently zero.
+
+Hmmm. The plots generated for the second clinical trial file look very similar to the plots for
+the first file: their average plots show similar "noisy" rises and falls; their maxima plots
+show exactly the same linear rise and fall; and their minima plots show similar staircase
+structures.
+
+The third dataset shows much noisier average and maxima plots that are far less suspicious than
+the first two datasets, however the minima plot shows that the third dataset minima are
+consistently zero across every day of the trial. If we produce a heat map for the third data file
+we see the following:
+
+![Heat map of the third inflammation dataset. Note that there are sporadic zero values throughout
+the entire dataset, and the last patient only has zero values over the 40 day study.](../fig/inflammation-03-imshow.svg)
+
+We can see that there are zero values sporadically distributed across all patients and days of the
+clinical trial, suggesting that there were potential issues with data collection throughout the
+trial. In addition, we can see that the last patient in the study didn't have any inflammation
+flare-ups at all throughout the trial, suggesting that they may not even suffer from arthritis.
+
 
 > ## Plotting Differences
 >
@@ -197,4 +211,34 @@ where the maxima are a bit less regular, but the minima are consistently zero.
 >{: .solution}
 {: .challenge}
 
+After spending some time investigating the heat map and statistical plots, as well as
+doing the above exercises to plot differences between datasets and to generate composite
+patient statistics, we gain some insight into the twelve clinical trial datasets:
+
+The datasets appear to fall into two categories:
+
+* seemingly "ideal" datasets that agree excellently with Dr. Maverick's claims,
+  but display suspicious maxima and minima (such as `inflammation-01.csv` and `inflammation-02.csv`)
+* "noisy" datasets that somewhat agree with Dr. Maverick's claims, but show concerning
+  data collection issues such as sporadic missing values and even an unsuitable candidate
+  making it into the clinical trial.
+
+In fact, it appears that all three of the "noisy" datasets (`inflammation-03.csv`,
+`inflammation-08.csv`, and `inflammation-11.csv`) are identical down to the last value.
+Armed with this information, we confront Dr. Maverick about the suspicious data and
+duplicated files.
+
+Dr. Maverick confesses that they fabricated the clinical data after they found out
+that the initial trial suffered from a number of issues, including unreliable data-recording and
+poor participant selection. They created fake data to prove their drug worked, and when we asked
+for more trials they tried to generate more fake sets, as well as throwing in the original
+poor-quality dataset a few times to try and make all the trials seem more "realistic".
+
+Congratulations! We've cracked the case and proven that the inflammation datasets have been
+synthetically generated (in python no less!).
+
+But it would be a shame to throw away the synthetic datasets that have taught us so much
+already, so we'll forgive the imaginary Dr. Maverick and continue to use the data to learn
+how to program.
+
 {% include links.md %}
diff --git a/_extras/guide.md b/_extras/guide.md
@@ -18,9 +18,9 @@ We use Python in our lessons because:
 We are using a dataset with records on inflammation from patients following an
 arthritis treatment.
 
-We make reference in the lesson that this data is somehow strange. It is strange
-because it is fabricated! The script used to generate the inflammation data
-is included as [`code/gen_inflammation.py`](../code/gen_inflammation.py).
+We make reference in the lesson that this data is suspicious and has been
+synthetically generated in Python by the imaginary "Dr. Maverick"! The script used to generate
+the inflammation data is included as [`code/gen_inflammation.py`](../code/gen_inflammation.py).
 
 ## Overall
 
diff --git a/fig/inflammation-03-imshow.svg b/fig/inflammation-03-imshow.svg
diff --git a/index.md b/index.md