Skip to content

Commit 06bb720

Browse files
authored
Merge pull request #964 from mike-ivs/gh-pages
Adding narrative to the datasets Closes #909 Closes #887
2 parents fad1ace + db3f3a5 commit 06bb720

File tree

8 files changed

+463
-34
lines changed

8 files changed

+463
-34
lines changed

_episodes/02-numpy.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ that can be called upon when needed.
3030

3131
## Loading data into Python
3232

33-
To begin processing inflammation data, we need to load it into Python.
33+
To begin processing the clinical trial inflammation data, we need to load it into Python.
3434
We can do that using a library called
3535
[NumPy](http://docs.scipy.org/doc/numpy/ "NumPy Documentation"), which stands for Numerical Python.
3636
In general, you should use this library when you want to do fancy things with lots of numbers,

_episodes/03-matplotlib.md

Lines changed: 24 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -13,8 +13,8 @@ keypoints:
1313
---
1414

1515
## Visualizing data
16-
The mathematician Richard Hamming once said, "The purpose of computing is insight, not numbers," and
17-
the best way to develop insight is often to visualize data. Visualization deserves an entire
16+
The mathematician Richard Hamming once said, "The purpose of computing is insight, not numbers,"
17+
and the best way to develop insight is often to visualize data. Visualization deserves an entire
1818
lecture of its own, but we can explore a few features of Python's `matplotlib` library here. While
1919
there is no official plotting library, `matplotlib` is the _de facto_ standard. First, we will
2020
import the `pyplot` module from `matplotlib` and use two of its functions to create and display a
@@ -30,9 +30,19 @@ matplotlib.pyplot.show()
3030
![Heat map representing the `data` variable. Each cell is colored by value along a color gradient
3131
from blue to yellow.](../fig/inflammation-01-imshow.svg)
3232

33-
Blue pixels in this heat map represent low values, while yellow pixels represent high values. As we
34-
can see, inflammation rises and falls over a 40-day period. Let's take a look at the average
35-
inflammation over time:
33+
Each row in the heat map corresponds to a patient in the clinical trial dataset, and each column
34+
corresponds to a day in the dataset. Blue pixels in this heat map represent low values, while
35+
yellow pixels represent high values. As we can see, the general number of inflammation flare-ups
36+
for the patients rises and falls over a 40-day period.
37+
38+
So far so good as this is in line with our knowledge of the clinical trial and Dr. Maverick's
39+
claims:
40+
41+
* the patients take their medication once their inflammation flare-ups begin
42+
* it takes around 3 weeks for the medication to take effect and begin reducing flare-ups
43+
* and flare-ups appear to drop to zero by the end of the clinical trial.
44+
45+
Now let's take a look at the average inflammation over time:
3646

3747
~~~
3848
ave_inflammation = numpy.mean(data, axis=0)
@@ -45,8 +55,9 @@ matplotlib.pyplot.show()
4555

4656
Here, we have put the average inflammation per day across all patients in the variable
4757
`ave_inflammation`, then asked `matplotlib.pyplot` to create and display a line graph of those
48-
values. The result is a roughly linear rise and fall, which is suspicious: we might instead expect
49-
a sharper rise and slower fall. Let's have a look at two other statistics:
58+
values. The result is a reasonably linear rise and fall, in line with Dr. Maverick's claim that
59+
the medication takes 3 weeks to take effect. But a good data scientist doesn't just consider the
60+
average of a dataset, so let's have a look at two other statistics:
5061

5162
~~~
5263
max_plot = matplotlib.pyplot.plot(numpy.max(data, axis=0))
@@ -64,18 +75,18 @@ matplotlib.pyplot.show()
6475

6576
![A line graph showing the minimum inflammation across all patients over a 40-day period.](../fig/inflammation-01-minimum.svg)
6677

67-
The maximum value rises and falls smoothly, while the minimum seems to be a step function. Neither
68-
trend seems particularly likely, so either there's a mistake in our calculations or something is
69-
wrong with our data. This insight would have been difficult to reach by examining the numbers
70-
themselves without visualization tools.
78+
The maximum value rises and falls linearly, while the minimum seems to be a step function.
79+
Neither trend seems particularly likely, so either there's a mistake in our calculations or
80+
something is wrong with our data. This insight would have been difficult to reach by examining
81+
the numbers themselves without visualization tools.
7182

7283
### Grouping plots
7384
You can group similar plots in a single figure using subplots.
7485
This script below uses a number of new commands. The function `matplotlib.pyplot.figure()`
7586
creates a space into which we will place all of our plots. The parameter `figsize`
7687
tells Python how big to make this space. Each subplot is placed into the figure using
77-
its `add_subplot` [method]({{ page.root }}/reference.html#method). The `add_subplot` method takes 3
78-
parameters. The first denotes how many total rows of subplots there are, the second parameter
88+
its `add_subplot` [method]({{ page.root }}/reference.html#method). The `add_subplot` method takes
89+
3 parameters. The first denotes how many total rows of subplots there are, the second parameter
7990
refers to the total number of subplot columns, and the final parameter denotes which subplot
8091
your variable is referencing (left-to-right, top-to-bottom). Each subplot is stored in a
8192
different variable (`axes1`, `axes2`, `axes3`). Once a subplot is created, the axes can

_episodes/04-lists.md

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -20,8 +20,14 @@ list[2:9]), in the same way as strings and arrays."
2020
- "Strings are immutable (i.e., the characters in them cannot be changed)."
2121
---
2222

23-
In the previous episode, we analyzed a single file with inflammation data. Our goal, however, is to
24-
process all the inflammation data we have, which means that we still have eleven more files to go!
23+
In the previous episode, we analyzed a single file of clinical trial inflammation data. However,
24+
after finding some peculiar and potentially suspicious trends in the trial data we ask
25+
Dr. Maverick if they have performed any other clinical trials. Surprisingly, they say that they
26+
have and provide us with 11 more CSV files for a further 11 clinical trials they have undertaken
27+
since the initial trial.
28+
29+
Our goal now is to process all the inflammation data we have, which means that we still have
30+
eleven more files to go!
2531

2632
The natural first step is to collect the names of all the files that we have to process. In Python,
2733
a list is a way to store multiple values together. In this episode, we will learn how to store

_episodes/05-loop.md

Lines changed: 8 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -19,11 +19,13 @@ In the episode about visualizing data,
1919
we wrote Python code that plots values of interest from our first
2020
inflammation dataset (`inflammation-01.csv`), which revealed some suspicious features in it.
2121

22-
![Line graphs showing average, maximum and minimum inflammation across all patients over a 40-day period.](../fig/03-loop_2_0.png)
22+
![Line graphs showing average, maximum and minimum inflammation across all patients over a 40-day
23+
period.](../fig/03-loop_2_0.png)
2324

24-
We have a dozen data sets right now, though, and more on the way.
25-
We want to create plots for all of our data sets with a single statement.
26-
To do that, we'll have to teach the computer how to repeat things.
25+
We have a dozen data sets right now and potentially more on the way if Dr. Maverick
26+
can keep up their surprisingly fast clinical trial rate. We want to create plots for all of
27+
our data sets with a single statement. To do that, we'll have to teach the computer how to
28+
repeat things.
2729

2830
An example task that we might want to repeat is accessing numbers in a list,
2931
which we
@@ -148,7 +150,8 @@ for variable in collection:
148150

149151
Using the odds example above, the loop might look like this:
150152

151-
![Loop variable 'num' being assigned the value of each element in the list `odds` in turn and then being printed](../fig/05-loops_image_num.png)
153+
![Loop variable 'num' being assigned the value of each element in the list `odds` in turn and
154+
then being printed](../fig/05-loops_image_num.png)
152155

153156
where each number (`num`) in the variable `odds` is looped through and printed one number after
154157
another. The other numbers in the diagram denote which loop cycle the number was printed in (1

_episodes/06-files.md

Lines changed: 51 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,7 @@ This means we can loop over it
4545
to do something with each filename in turn.
4646
In our case,
4747
the "something" we want to do is generate a set of plots for each file in our inflammation dataset.
48+
4849
If we want to start by analyzing just the first three files in alphabetical order, we can use the
4950
`sorted` built-in function to generate a new sorted list from the `glob.glob` output:
5051

@@ -107,11 +108,26 @@ inflammation-03.csv
107108
maximum and minimum inflammation over a 40-day period for all patients in the third
108109
dataset.](../fig/03-loop_49_5.png)
109110

110-
Sure enough,
111-
the maxima of the first two data sets show exactly the same ramp as the first,
112-
and their minima show the same staircase structure;
113-
a different situation has been revealed in the third dataset,
114-
where the maxima are a bit less regular, but the minima are consistently zero.
111+
112+
The plots generated for the second clinical trial file look very similar to the plots for
113+
the first file: their average plots show similar "noisy" rises and falls; their maxima plots
114+
show exactly the same linear rise and fall; and their minima plots show similar staircase
115+
structures.
116+
117+
The third dataset shows much noisier average and maxima plots that are far less suspicious than
118+
the first two datasets, however the minima plot shows that the third dataset minima is
119+
consistently zero across every day of the trial. If we produce a heat map for the third data file
120+
we see the following:
121+
122+
![Heat map of the third inflammation dataset. Note that there are sporadic zero values throughout
123+
the entire dataset, and the last patient only has zero values over the 40 day study.
124+
](../fig/inflammation-03-imshow.svg)
125+
126+
We can see that there are zero values sporadically distributed across all patients and days of the
127+
clinical trial, suggesting that there were potential issues with data collection throughout the
128+
trial. In addition, we can see that the last patient in the study didn't have any inflammation
129+
flare-ups at all throughout the trial, suggesting that they may not even suffer from arthritis!
130+
115131

116132
> ## Plotting Differences
117133
>
@@ -197,4 +213,34 @@ where the maxima are a bit less regular, but the minima are consistently zero.
197213
>{: .solution}
198214
{: .challenge}
199215
216+
After spending some time investigating the heat map and statistical plots, as well as
217+
doing the above exercises to plot differences between datasets and to generate composite
218+
patient statistics, we gain some insight into the twelve clinical trial datasets.
219+
220+
The datasets appear to fall into two categories:
221+
222+
* seemingly "ideal" datasets that agree excellently with Dr. Maverick's claims,
223+
but display suspicious maxima and minima (such as `inflammation-01.csv` and `inflammation-02.csv`)
224+
* "noisy" datasets that somewhat agree with Dr. Maverick's claims, but show concerning
225+
data collection issues such as sporadic missing values and even an unsuitable candidate
226+
making it into the clinical trial.
227+
228+
In fact, it appears that all three of the "noisy" datasets (`inflammation-03.csv`,
229+
`inflammation-08.csv`, and `inflammation-11.csv`) are identical down to the last value.
230+
Armed with this information, we confront Dr. Maverick about the suspicious data and
231+
duplicated files.
232+
233+
Dr. Maverick confesses that they fabricated the clinical data after they found out
234+
that the initial trial suffered from a number of issues, including unreliable data-recording and
235+
poor participant selection. They created fake data to prove their drug worked, and when we asked
236+
for more data they tried to generate more fake datasets, as well as throwing in the original
237+
poor-quality dataset a few times to try and make all the trials seem a bit more "realistic".
238+
239+
Congratulations! We've investigated the inflammation data and proven that the datasets have been
240+
synthetically generated.
241+
242+
But it would be a shame to throw away the synthetic datasets that have taught us so much
243+
already, so we'll forgive the imaginary Dr. Maverick and continue to use the data to learn
244+
how to program.
245+
200246
{% include links.md %}

_extras/guide.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -18,9 +18,9 @@ We use Python in our lessons because:
1818
We are using a dataset with records on inflammation from patients following an
1919
arthritis treatment.
2020

21-
We make reference in the lesson that this data is somehow strange. It is strange
22-
because it is fabricated! The script used to generate the inflammation data
23-
is included as [`code/gen_inflammation.py`](../code/gen_inflammation.py).
21+
We make reference in the lesson that this data is suspicious and has been
22+
synthetically generated in Python by the imaginary "Dr. Maverick"! The script used to generate
23+
the inflammation data is included as [`code/gen_inflammation.py`](../code/gen_inflammation.py).
2424

2525
## Overall
2626

0 commit comments

Comments
 (0)