simgen/repro-quarto.qmd at main · bodkan/simgen · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
# Quarto reports and slides

## Introduction

From the previous section, you have now set up a proper computational project
structure. You've added your pipeline R scripts which download, manipulate,
filter, and otherwise process "raw data" into "processed data", the latter
being the starting point of data analysis, visualization, and statistical
inference. You've also learned to write standalone scripts to do all of that.

**Honestly, what you've learned in the previous chapter on general R project
setup is everything you might need to get through a Master's project or even
a PhD. To reiterate:**

1. **You're data processing code is now fully automated and completely reproducible.**

2. **You can also generate results, figures, and summaries in an equally
automated way.**

**In principle, you don't need anything else beyond nicely organized scripts
which produce results (tables, figures, etc.) in an equally organized way.**

**In this session, you will learn about another possibility of doing
reproducible data science. Not a replacement for the previous script-based
workflows, but a complementary approach to doing the same thing. I personally
use both approaches.**

**Let's introduce the [Quarto system](https://quarto.org) for reproducible
scientific research. First, please take a moment to watch
[this wonderful presentation](https://www.youtube.com/watch?v=_f3latmOhew)**
(you can stop watching by about 15 minutes, when the discussion turns to
building websites).

---

## Quick, fast-forward exercise

If you didn't go through the previous exercises, don't worry, there's
a huge amount of what you can learn without relying on
previously-defined scripts.

For instance, **download the [source file](https://github.com/bodkan/simgen/blob/main/slides_tidy-viz.qmd) for my slides on _ggplot2_.
They are written in Quarto, so try opening them in your RStudio,
click on the `"Source"` on top of your editor window, and hit the `"=> Render"`
button.**

Slides should appear! The slides (and specifically the source file
for these slides) are a great example of a "reproducible presentat"---
presentation which is actually generated by R and includes your various
plots, comments, tables, etc.

---

**Now, use the reference materials about [writing Quarto presentations](https://quarto.org/docs/presentations/)
and, together with my example slides I linked above, try to write your own
presentation, perhaps using the examples from our _ggplot2_ session (pretending
that you're creating slides for a group update!). Alternatively, use this
as a basis to create your own presentation about your own project. Even just
a couple of slides, with some basic text formatting, and 2-3 figures from
your own results is an amazing start.**

---

**In the "header" section of the Quarto source file (header is the text
between `---` and `---` on top), change `revealjs` to `html`. Click on the
`=> Render` button again and see what
happens. Now you have a (again completely reproducible) "computational report"
instead of slides!**

---

## Exercise 1: Creating a reproducible report

The previous exercises were focused on our metadata and IBD data, and turning
our disorganized pile of code into proper project structure.

In this session, we'll do a similar thing, but focus instead on the
analysis of Neanderthal proportions in a time-series of aDNA individuals from
Europe [discussed here](https://bodkan.net/simgen/tidy-viz.html#part-2-practicing-visualization-on-time-series-data).

**First create a new blank Quarto document by doing the following:**

1. Click on `File` `->` `New File` `->` `Quarto Document...`.
2. Click on `Create Empty Document`.
3. Save the file under `notebooks/neand_ancestry.qmd`.

(I like to call what we'll be creating as "a notebook", because it's very
similar to a normal lab notebook).

---

**In your new Quarto file, first make sure you have the "Source" view turned
on (top left of your editor window). For now at least, you will want to switch
to "Visual" when you start writing!**

---

**Paste in the following template (just to save you a lot of annoying typing):**

::: {.callout-note collapse="true" icon=false}
#### Click to reveal code to be added to your Quarto document

```{bash}
#| echo: false
cat files/repro/neand_ancestry.qmd
```

:::

---

**Click on the `"=> Render"` button on top of your editor window and see the
magic happen!**

---

**You can also check the `"Render on Save"` box on top of your editor window.
See what happens when you save the document using CTRL / CMD + S. This can
slow things down for long-running analyses, but is very convenient otherwise.**

---

**In a Quarto document (of any kind, reports, presentations, anything), this
is a very document component. It's called a "code block":**


``` {{r}}
# here is your code
```

**Whenever a Quarto document is rendered, R executes code in these code blocks!
It then includes a figure, prints the result, etc., which then becomes the
part of the resulting document. I hope you can now appreciate how useful
this is for:**

1. **Making your reserach more reproducible** -- the code _and_ the results
are part of a single document, which is ran top to bottom, automatically!

2. **Making your research easier to do** -- this is effectively a lab notebook
of your research activity for that particular project. You can write notes,
comments, reminders, conclusions, etc.

**This allows you to avoid hunting down for _"which bit of code and where
created this particular figure?"_, which is very very stressful at times,
especially close to deadlines.**

---

## Exercise 2: Completing your report

Now that you've rendered the document, you can see that I left you guidelines
and blanks to fill in in the Quarto template. **Using the set of exercises
on the topic of Neanderthal proportion in a time-series of aDNA individuals
from Europe
[discussed here](https://bodkan.net/simgen/tidy-viz.html#part-2-practicing-visualization-on-time-series-data),
fill in the blanks accordingly.**

**Try to get in the mindset of using this document as your lab notebook!
If you didn't manage to get through the exercises on analyzing and plotting
Neanderthal ancestry proportions at the link above, use this opportunity
to work on those exercises, this time using your Quarto document as a means
to solve them.**

**Hint:** Again, if you ever need help, here it is:

::: {.callout-note collapse="true" icon=false}
#### Click to see the completed Quarto document

You can find the complete source document
[here](https://github.com/bodkan/simgen/blob/main/files/repro/neand_ancestry_complete.qmd).

You can download the rendered version
[here](https://github.com/bodkan/simgen/blob/main/files/repro/neand_ancestry_complete.html).


:::

## Exercise 4: Adjusting code chunk options

You can see that the final report contains both the code and the results
of this code. Sometimes you don't want that, particularly when you want to
create not a document, but presentation slides like you will do in the
next exercise.

My favourites (and the only ones I personally remember) are these ones:

- Show code, but hide it first (reader has to click)! This is my favourite,
because sometimes your supervisor doesn't want to read code, they just want
to see a figure. :)

```{r}
#| code-fold: true
```

``` {{r}}
#| code-fold: true

# here is your code which will not be shown in the report
```

- Don't show code, but show results:

``` {{r}}
#| echo: false

# here is your code which will not be shown in the report
```

- Show code, but don't evaluate it (it produces no results):

``` {{r}}
#| eval: false

# here is your code which will not be shown in the report
```

**Experiment with the above mentioned options in your report (or slides
in the next exercise).
Here's a [very useful summary](https://rpubs.com/drgregmartin/1266667)
of many more options.**


## Exercise 4: Creating slides

Here's my favourite aspect of Quarto. You can not only create fully reproducible
"lab notebooks", but you can also create automatically generated slides. This
is extremely useful as a means to have realiable means to have up-to-date
presentations for group meetings, etc.

**Create a new Quarto Document (`File` `->` `New File` `->` `Quarto Presentation...` `->` `Create Empty Document.). Then copy the entire contents
of your `reports`/neand_ancestry.qmd` into this new document, and save it
as `reports/neand_ancestry_slides.qmd`.**

**Then change this one single line in the header at the top of your file,
changing `format: html` to `format: revealjs`.**

**Then click the `"Render"` button again! Observe the magic happen!**

---

It's pretty obvious to you now that slides have different requirements
than documents. For one, including lots of code (or maybe any code)
isn't that useful. Additionally, showing `library(...)` calls in a presentation
doesn't make any sense either plus, slightly different formatting might
be needed.

**Take a look at [this](https://quarto.org/docs/presentations/revealjs/) overview
of the Quarto slides functionality. Then edit your slides (remove unnecessary
headings/slide titles, etc.) to make them more suitable for presentation
in a meeting.**

**For a more practical set of tips (how to include animated slides, how
to do formatting, how to include images, etc.), you can take a look at
the source `.qmd` file for the [introduction presentation for this workshop]
(https://github.com/bodkan/simgen/blob/main/slides_whoami.qmd). You
can click through them yourself interactively
[here](https://bodkan.quarto.pub/slides_whoami/).**

**Note:** Remove slides which are not useful, show only code which is
important (like the `lm` model?), focus on figures and statistical `summary()`
on the linear regression results.


::: {.callout-note collapse="true" icon=false}
#### Click to see the see my attempt at cleaner slides

You can find the complete source document
[here](https://github.com/bodkan/simgen/blob/main/files/repro/neand_ancestry_complete_slides.qmd).

You can download the slides
[here](https://github.com/bodkan/simgen/blob/main/files/repro/neand_ancestry_complete_slides.html).


:::


## Exercise 5: Recording R session information

**The following command should be included at the end of your "Quarto reports".
When you run it, how would you read and interpret the information it provides?
What do you think is the most important information which might be missing in
case you need to pick up someone else's project or script?**

```{r}
sessionInfo()
```

**Create a new chunk at the end of your document and add this command
to this chunk to be included every time it is rendered.**

## Exercise 6: The entire point of doing all this workshop

In this final exercise, I would like you to take whatever data you have,
and try to use some of what you've learned so far --- about R programming,
about _tidyverse_, about _ggplot2_ --- and create a Quarto report in which
you will put some of what you've learned into practice.

Alternatively, if you have a messy set of scripts (we all have that, even
my stuff is messy, don't worry) ready and some results already generated,
try to transform them into what would be a nice, automated, and reproducible
Quarto report.

You could also do work on transforming code which you
now realize could be organized in a more structured way, perhaps like we've
learned in the previous session on building R pipelines, into a tidy
step-by-step cascade of R scripts.

The sky is a limit!


# Fun fact

This entire workbook and course is
[written in Quarto](https://github.com/bodkan/simgen)! :)