Skip to content

Commit f09c4b8

Browse files
committed
Initial commit to add a Dataset-driven Narrative to the Lesson
1 parent fad1ace commit f09c4b8

File tree

8 files changed

+448
-28
lines changed

8 files changed

+448
-28
lines changed

_episodes/02-numpy.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ that can be called upon when needed.
3030

3131
## Loading data into Python
3232

33-
To begin processing inflammation data, we need to load it into Python.
33+
To begin processing the clinical trial inflammation data, we need to load it into Python.
3434
We can do that using a library called
3535
[NumPy](http://docs.scipy.org/doc/numpy/ "NumPy Documentation"), which stands for Numerical Python.
3636
In general, you should use this library when you want to do fancy things with lots of numbers,

_episodes/03-matplotlib.md

Lines changed: 16 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -30,9 +30,15 @@ matplotlib.pyplot.show()
3030
![Heat map representing the `data` variable. Each cell is colored by value along a color gradient
3131
from blue to yellow.](../fig/inflammation-01-imshow.svg)
3232

33-
Blue pixels in this heat map represent low values, while yellow pixels represent high values. As we
34-
can see, inflammation rises and falls over a 40-day period. Let's take a look at the average
35-
inflammation over time:
33+
Each row in the heat map corresponds to a patient in the clinical trial dataset, and each column
34+
corresponds to a day in the dataset. Blue pixels in this heat map represent low values, while yellow
35+
pixels represent high values. As we can see, the general number of inflammation flare-ups for the patients
36+
rises and falls over a 40-day period.
37+
38+
So far so good, this is in line with our knowledge of the clinical trial and Dr. Maverick's claims:
39+
the patients take their medication once their inflammation flare-ups begin; it takes around 3 weeks
40+
for the medication to take effect and begin reducing flare-ups; and flare-ups appear to drop to zero
41+
by the end of the clinical trial. Now let's take a look at the average inflammation over time:
3642

3743
~~~
3844
ave_inflammation = numpy.mean(data, axis=0)
@@ -45,8 +51,9 @@ matplotlib.pyplot.show()
4551

4652
Here, we have put the average inflammation per day across all patients in the variable
4753
`ave_inflammation`, then asked `matplotlib.pyplot` to create and display a line graph of those
48-
values. The result is a roughly linear rise and fall, which is suspicious: we might instead expect
49-
a sharper rise and slower fall. Let's have a look at two other statistics:
54+
values. The result is a roughly linear rise and fall, in line with Dr. Maverick's claim that the
55+
medication takes 3 weeks to take effect. But a good data scientist doesn't just consider the
56+
average of a dataset, so let's have a look at two other statistics:
5057

5158
~~~
5259
max_plot = matplotlib.pyplot.plot(numpy.max(data, axis=0))
@@ -64,10 +71,10 @@ matplotlib.pyplot.show()
6471

6572
![A line graph showing the minimum inflammation across all patients over a 40-day period.](../fig/inflammation-01-minimum.svg)
6673

67-
The maximum value rises and falls smoothly, while the minimum seems to be a step function. Neither
68-
trend seems particularly likely, so either there's a mistake in our calculations or something is
69-
wrong with our data. This insight would have been difficult to reach by examining the numbers
70-
themselves without visualization tools.
74+
The maximum value rises and falls linearly, while the minimum seems to be a step function.
75+
Suspicious... neither trend seems particularly likely, so either there's a mistake in our
76+
calculations or something is wrong with our data. This insight would have been difficult
77+
to reach by examining the numbers themselves without visualization tools.
7178

7279
### Grouping plots
7380
You can group similar plots in a single figure using subplots.

_episodes/04-lists.md

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -20,8 +20,14 @@ list[2:9]), in the same way as strings and arrays."
2020
- "Strings are immutable (i.e., the characters in them cannot be changed)."
2121
---
2222

23-
In the previous episode, we analyzed a single file with inflammation data. Our goal, however, is to
24-
process all the inflammation data we have, which means that we still have eleven more files to go!
23+
In the previous episode, we analyzed a single file of clinical trial inflammation data. However,
24+
after finding some peculiar, and potentially suspicious, trends in the trial data we ask
25+
Dr. Maverick if they have performed any other clinical trials. Surprisingly, they say that they
26+
have and provide us with 11 more CSV files for a further 11 clinical trials they have undertaken
27+
since the initial trial.
28+
29+
Our goal now is to process all the inflammation data we have, which means that we still have
30+
eleven more files to go!
2531

2632
The natural first step is to collect the names of all the files that we have to process. In Python,
2733
a list is a way to store multiple values together. In this episode, we will learn how to store

_episodes/05-loop.md

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -21,9 +21,10 @@ inflammation dataset (`inflammation-01.csv`), which revealed some suspicious fea
2121

2222
![Line graphs showing average, maximum and minimum inflammation across all patients over a 40-day period.](../fig/03-loop_2_0.png)
2323

24-
We have a dozen data sets right now, though, and more on the way.
25-
We want to create plots for all of our data sets with a single statement.
26-
To do that, we'll have to teach the computer how to repeat things.
24+
We have a dozen data sets right now, though, and potentially more on the way if Dr. Maverick
25+
can keep up their surprisingly fast clinical trial rate. We want to create plots for all of
26+
our data sets with a single statement. To do that, we'll have to teach the computer how to
27+
repeat things.
2728

2829
An example task that we might want to repeat is accessing numbers in a list,
2930
which we

_episodes/06-files.md

Lines changed: 49 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -107,11 +107,25 @@ inflammation-03.csv
107107
maximum and minimum inflammation over a 40-day period for all patients in the third
108108
dataset.](../fig/03-loop_49_5.png)
109109

110-
Sure enough,
111-
the maxima of the first two data sets show exactly the same ramp as the first,
112-
and their minima show the same staircase structure;
113-
a different situation has been revealed in the third dataset,
114-
where the maxima are a bit less regular, but the minima are consistently zero.
110+
111+
Hmmm. The plots generated for the second clinical trial file look very similar to the plots for
112+
the first file: their average plots show similar "noisy" rises and falls; their maxima plots
113+
show exactly the same linear rise and fall; and their minima plots show similar staircase
114+
structures.
115+
116+
The third dataset shows much noisier average and maxima plots that are far less suspicious than
117+
the first two datasets, however the minima plot shows that the third dataset minima are
118+
consistently zero across every day of the trial. If we produce a heat map for the third data file
119+
we see the following:
120+
121+
![Heat map of the third inflammation dataset. Note that there are sporadic zero values throughout
122+
the entire dataset, and the last patient only has zero values over the 40 day study.](../fig/inflammation-03-imshow.svg)
123+
124+
We can see that there are zero values sporadically distributed across all patients and days of the
125+
clinical trial, suggesting that there were potential issues with data collection throughout the
126+
trial. In addition, we can see that the last patient in the study didn't have any inflammation
127+
flare-ups at all throughout the trial, suggesting that they may not even suffer from arthritis.
128+
115129

116130
> ## Plotting Differences
117131
>
@@ -197,4 +211,34 @@ where the maxima are a bit less regular, but the minima are consistently zero.
197211
>{: .solution}
198212
{: .challenge}
199213
214+
After spending some time investigating the heat map and statistical plots, as well as
215+
doing the above exercises to plot differences between datasets and to generate composite
216+
patient statistics, we gain some insight into the twelve clinical trial datasets:
217+
218+
The datasets appear to fall into two categories:
219+
220+
* seemingly "ideal" datasets that agree excellently with Dr. Maverick's claims,
221+
but display suspicious maxima and minima (such as `inflammation-01.csv` and `inflammation-02.csv`)
222+
* "noisy" datasets that somewhat agree with Dr. Maverick's claims, but show concerning
223+
data collection issues such as sporadic missing values and even an unsuitable candidate
224+
making it into the clinical trial.
225+
226+
In fact, it appears that all three of the "noisy" datasets (`inflammation-03.csv`,
227+
`inflammation-08.csv`, and `inflammation-11.csv`) are identical down to the last value.
228+
Armed with this information, we confront Dr. Maverick about the suspicious data and
229+
duplicated files.
230+
231+
Dr. Maverick confesses that they fabricated the clinical data after they found out
232+
that the initial trial suffered from a number of issues, including unreliable data-recording and
233+
poor participant selection. They created fake data to prove their drug worked, and when we asked
234+
for more trials they tried to generate more fake sets, as well as throwing in the original
235+
poor-quality dataset a few times to try and make all the trials seem more "realistic".
236+
237+
Congratulations! We've cracked the case and proven that the inflammation datasets have been
238+
synthetically generated (in python no less!).
239+
240+
But it would be a shame to throw away the synthetic datasets that have taught us so much
241+
already, so we'll forgive the imaginary Dr. Maverick and continue to use the data to learn
242+
how to program.
243+
200244
{% include links.md %}

_extras/guide.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -18,9 +18,9 @@ We use Python in our lessons because:
1818
We are using a dataset with records on inflammation from patients following an
1919
arthritis treatment.
2020

21-
We make reference in the lesson that this data is somehow strange. It is strange
22-
because it is fabricated! The script used to generate the inflammation data
23-
is included as [`code/gen_inflammation.py`](../code/gen_inflammation.py).
21+
We make reference in the lesson that this data is suspicious and has been
22+
synthetically generated in Python by the imaginary "Dr. Maverick"! The script used to generate
23+
the inflammation data is included as [`code/gen_inflammation.py`](../code/gen_inflammation.py).
2424

2525
## Overall
2626

0 commit comments

Comments
 (0)