@@ -107,11 +107,25 @@ inflammation-03.csv
107107maximum and minimum inflammation over a 40-day period for all patients in the third
108108dataset.] ( ../fig/03-loop_49_5.png )
109109
110- Sure enough,
111- the maxima of the first two data sets show exactly the same ramp as the first,
112- and their minima show the same staircase structure;
113- a different situation has been revealed in the third dataset,
114- where the maxima are a bit less regular, but the minima are consistently zero.
110+
111+ Hmmm. The plots generated for the second clinical trial file look very similar to the plots for
112+ the first file: their average plots show similar "noisy" rises and falls; their maxima plots
113+ show exactly the same linear rise and fall; and their minima plots show similar staircase
114+ structures.
115+
116+ The third dataset shows much noisier average and maxima plots that are far less suspicious than
117+ the first two datasets, however the minima plot shows that the third dataset minima are
118+ consistently zero across every day of the trial. If we produce a heat map for the third data file
119+ we see the following:
120+
121+ ![ Heat map of the third inflammation dataset. Note that there are sporadic zero values throughout
122+ the entire dataset, and the last patient only has zero values over the 40 day study.] ( ../fig/inflammation-03-imshow.svg )
123+
124+ We can see that there are zero values sporadically distributed across all patients and days of the
125+ clinical trial, suggesting that there were potential issues with data collection throughout the
126+ trial. In addition, we can see that the last patient in the study didn't have any inflammation
127+ flare-ups at all throughout the trial, suggesting that they may not even suffer from arthritis.
128+
115129
116130> ## Plotting Differences
117131>
@@ -197,4 +211,34 @@ where the maxima are a bit less regular, but the minima are consistently zero.
197211>{: .solution}
198212{: .challenge}
199213
214+ After spending some time investigating the heat map and statistical plots, as well as
215+ doing the above exercises to plot differences between datasets and to generate composite
216+ patient statistics, we gain some insight into the twelve clinical trial datasets:
217+
218+ The datasets appear to fall into two categories:
219+
220+ * seemingly "ideal" datasets that agree excellently with Dr. Maverick's claims,
221+ but display suspicious maxima and minima (such as `inflammation-01.csv` and `inflammation-02.csv`)
222+ * "noisy" datasets that somewhat agree with Dr. Maverick's claims, but show concerning
223+ data collection issues such as sporadic missing values and even an unsuitable candidate
224+ making it into the clinical trial.
225+
226+ In fact, it appears that all three of the "noisy" datasets (`inflammation-03.csv`,
227+ `inflammation-08.csv`, and `inflammation-11.csv`) are identical down to the last value.
228+ Armed with this information, we confront Dr. Maverick about the suspicious data and
229+ duplicated files.
230+
231+ Dr. Maverick confesses that they fabricated the clinical data after they found out
232+ that the initial trial suffered from a number of issues, including unreliable data-recording and
233+ poor participant selection. They created fake data to prove their drug worked, and when we asked
234+ for more trials they tried to generate more fake sets, as well as throwing in the original
235+ poor-quality dataset a few times to try and make all the trials seem more "realistic".
236+
237+ Congratulations! We've cracked the case and proven that the inflammation datasets have been
238+ synthetically generated (in python no less!).
239+
240+ But it would be a shame to throw away the synthetic datasets that have taught us so much
241+ already, so we'll forgive the imaginary Dr. Maverick and continue to use the data to learn
242+ how to program.
243+
200244{% include links.md %}
0 commit comments