Skip to content

Commit ca14dce

Browse files
Merge branch 'main' into 263-pre-requisites
2 parents 571bd59 + 548b5e6 commit ca14dce

File tree

3 files changed

+179
-91
lines changed

3 files changed

+179
-91
lines changed
63.5 KB
Loading

s1/index.qmd

Lines changed: 167 additions & 91 deletions
Original file line numberDiff line numberDiff line change
@@ -45,96 +45,7 @@ execute:
4545

4646
See [openstreetmap.org](https://www.openstreetmap.org/#map=19/53.80689/-1.55637) or search for other open access datasets for more ideas
4747

48-
<!-- 2. Work through the transport chapter of Geocomputation with R: https://r.geocompx.org/transport.html -->
4948

50-
<!-- See https://github.com/ITSLeeds/TDS/blob/master/practicals/2-software.md -->
51-
52-
<!-- - In terms of future work in an evolving job market? -->
53-
54-
<!-- - In terms of the kinds of problems you want to solve? -->
55-
56-
<!-- ## Sketching research methods (in groups of 2-4, 30 minutes) -->
57-
58-
<!-- Starting with the 1000 'desire lines' dataset of Leeds, sketch-out some research ideas that cover -->
59-
60-
<!-- 1) Hypotheses: generate two hypotheses that are falsifiable and 2 hypotheses that are not falsifiable -->
61-
62-
<!-- 2) Input data: draw schematic representations of additional datasets that you could use alongside the desire lines dataset, with at least one at each of these levels: -->
63-
64-
<!-- - Zones -->
65-
66-
<!-- - Points -->
67-
68-
<!-- - Routes -->
69-
70-
<!-- - Route networks -->
71-
72-
<!-- - Individual -->
73-
74-
<!-- What temporal and spatial resolution could each one have? -->
75-
76-
<!-- 3) Methods: using a flow diagram (e.g. as shown below) -->
77-
78-
```{r schematic, echo=FALSE}
79-
# knitr::include_graphics("https://raw.githubusercontent.com/npct/pct-team/master/flow-model/flow-diag2.png")
80-
```
81-
82-
<!-- ## Practical, group computer task (30 minutes) -->
83-
84-
<!-- Create a github account (all). See: https://github.com -->
85-
86-
<!-- Building on the follow code chunk (but with no copy-and-pasting), create a data frame that contains the names, coffee habits and like/dislike of bus travel for everyone in your group (just 1 computer per group): -->
87-
88-
<!-- ```{r} -->
89-
90-
<!-- person_name = c( -->
91-
<!-- "robin", -->
92-
93-
<!-- "malcolm", -->
94-
95-
<!-- "richard" -->
96-
97-
<!-- ) -->
98-
99-
<!-- n_coffee = c( -->
100-
101-
<!-- 5, -->
102-
103-
<!-- 1, -->
104-
105-
<!-- 0 -->
106-
107-
<!-- ) -->
108-
109-
<!-- like_bus_travel = c( -->
110-
111-
<!-- TRUE, -->
112-
113-
<!-- FALSE, -->
114-
115-
<!-- TRUE -->
116-
117-
<!-- ) -->
118-
119-
<!-- personal_data = data.frame(person_name, n_coffee, like_bus_travel) -->
120-
121-
<!-- personal_data -->
122-
123-
<!-- ``` -->
124-
125-
<!-- When you are complete, add your code to https://github.com/ITSLeeds/TDS/blob/master/code-r/01-person-data.R -->
126-
127-
<!-- ## Learning outcomes -->
128-
129-
```{r, echo=FALSE}
130-
# Identify available datasets and access and clean them
131-
# Combine datasets from multiple sources
132-
# Understand what machine learning is, which problems it is appropriate for compared with traditional statistical approaches, and how to implement machine learning techniques
133-
# Visualise and communicate the results of transport data science, and know about setting-up interactive web applications
134-
# Deciding when to use local computing power vs cloud services
135-
```
136-
137-
<!-- - Articulate the relevance and limitations of data-centric analysis applied to transport problems, compared with other methods -->
13849

13950
# Data Science foundations
14051

@@ -241,7 +152,12 @@ crashes[[2]]
241152

242153
## Data science on real data
243154

244-
To get some larger datasets, try the following (from Chapter 8 of RSRR)
155+
Work through the following example on road traffic data (recommended for most people) or the NTS data (for people more interested in travel survey data).
156+
You can do both if you have time.
157+
158+
### UK Road Safety Data
159+
160+
To get some larger datasets, try the following (from Chapter 8 of [RSRR](https://itsleeds.github.io/rrsrr/)):
245161

246162
::: {.panel-tabset group="language"}
247163
## R
@@ -282,6 +198,166 @@ Let's go through these exercises together:
282198

283199
- We'll explore this together
284200

201+
### UK National Travel Survey (NTS) data
202+
203+
<details>
204+
205+
Note: you will need to download the modified NTS 2022 data from your Minerva module page and place it in your working directory for this section to work.
206+
207+
```{r import_dataset}
208+
#| eval: false
209+
# Read CSV file
210+
NTS_data <- read.csv("NTS2022_modifieddata.csv")
211+
212+
# Look at the column names
213+
names(NTS_data)
214+
215+
# Look at the data
216+
head(NTS_data)
217+
```
218+
219+
You should see something like this:
220+
221+
```
222+
> names(NTS_data)
223+
[1] "IndividualID" "avg_trip_length"
224+
[3] "avg_trip_length_weekday" "avg_trip_length_weekend"
225+
[5] "total_distance" "total_distance_weekday"
226+
[7] "total_distance_weekend" "SD_triplength"
227+
[9] "sd_Total_Distance_wknd" "sd_Total_Distance_wk"
228+
229+
> head(NTS_data)
230+
IndividualID avg_trip_length avg_trip_length_weekday avg_trip_length_weekend
231+
1 2023000001 4.080000 4.631579 2.333333
232+
2 2023000002 2.538462 2.400000 3.000000
233+
3 2023000003 5.916667 6.250000 5.250000
234+
```
235+
236+
Visualising datasets is important when dealing with large volumes of data, as visualisations help convey complex information in an easily interpretable format. Consider the histogram plots of average trip lengths over a week in the UK.
237+
238+
```{r avg-triplength-histogram, message=FALSE, warning=FALSE}
239+
#| eval: false
240+
# Note: This requires ggplot2 library to be loaded first
241+
library(tidyverse) # Tidyverse contains ggplot2 and other useful packages
242+
ggplot(NTS_data, aes(x = avg_trip_length)) +
243+
geom_histogram(binwidth = 1, fill = "darkgrey") +
244+
labs(
245+
title = "Avg. Trip Length in Whole Week",
246+
x = "Trip Length (km)",
247+
y = "Number of Individuals"
248+
) +
249+
theme_minimal() +
250+
xlim(0, 50)
251+
```
252+
253+
Data exploration or "exploratory data analysis" (EDA) involves examining datasets in depth to uncover underlying patterns or differences. The direction of this investigation is largely guided by the research question.
254+
255+
Consider different histogram plots for weekdays and weekends. Can you identify any differences between them? (Clue: Check the number of individuals between 0-1 Km)
256+
257+
Think: What could be plausible reasons for such difference?
258+
259+
```{r avg-triplength-weekday-histogram, message=FALSE, warning=FALSE}
260+
#| eval: false
261+
ggplot(NTS_data, aes(x = avg_trip_length_weekday)) +
262+
geom_histogram(binwidth = 1, fill = "darkblue") +
263+
labs(
264+
title = "Avg. Trip Length on Weekdays",
265+
x = "Trip Length (km)",
266+
y = "Number of Individuals"
267+
) +
268+
theme_minimal() +
269+
xlim(0, 50)
270+
```
271+
272+
```{r avg-triplength-weekend-histogram, message=FALSE, warning=FALSE}
273+
#| eval: false
274+
ggplot(NTS_data, aes(x = avg_trip_length_weekend)) +
275+
geom_histogram(binwidth = 1, fill = "darkred") +
276+
labs(
277+
title = "Avg. Trip Length on Weekends",
278+
x = "Trip Length (km)",
279+
y = "Number of Individuals"
280+
) +
281+
theme_minimal() +
282+
xlim(0, 50)
283+
```
284+
285+
You can more easily compare the two histograms when they are placed in the same plot, with transparency added to the bars:
286+
287+
```{r avg-triplength-weekday-weekend-histogram, message=FALSE, warning=FALSE}
288+
#| eval: false
289+
g_combined = ggplot() +
290+
geom_histogram(data = NTS_data, aes(x = avg_trip_length_weekday),
291+
binwidth = 1, fill = "darkblue", alpha = 0.5) +
292+
geom_histogram(data = NTS_data, aes(x = avg_trip_length_weekend),
293+
binwidth = 1, fill = "darkred", alpha = 0.5) +
294+
labs(
295+
title = "Avg. Trip Length on Weekdays (blue) and Weekends (red)",
296+
x = "Trip Length (km)",
297+
y = "Number of Individuals"
298+
) +
299+
theme_minimal() +
300+
xlim(0, 50)
301+
# Then 'print' the plot to show it:
302+
g_combined
303+
```
304+
305+
You can save the plot with `ggsave()`:
306+
307+
```{r}
308+
#| eval: false
309+
ggsave("avg_trip_length_weekday_weekend.png", plot = g_combined, width = 8, height = 6)
310+
```
311+
312+
And (this is how you can show figures in Quarto), in a quarto document (.qmd file) that you will use to write and submit your coursework, you can include the saved figure like this (we will come onto this later in the module):
313+
314+
```
315+
![](avg_trip_length_weekday_weekend.png)
316+
```
317+
318+
![Avg. Trip Length on Weekdays (blue) and Weekends (red)](images/avg_trip_length_weekday_weekend.png)
319+
320+
```{r}
321+
#| eval: false
322+
#| echo: false
323+
# move the file:
324+
file.rename("avg_trip_length_weekday_weekend.png", "s1/images/avg_trip_length_weekday_weekend.png")
325+
```
326+
327+
Don't they largely look the same? Can you stop here and infer that the trip length distributions for weekdays and weekends are largely similar? You might, depending on the resources at your disposal, but from an academic point of view we need to think about other potential dimensions where they could be different.
328+
329+
Consider different histogram plots for 'Standard Deviation' of trip lengths over weekdays and weekends. Can you identify any differences between them? (Clue: Again, check the number of individuals with SD 0-2 Km)
330+
331+
Think: What could be plausible reasons for such difference?
332+
333+
```{r SD-triplength-weekday-histogram, message=FALSE, warning=FALSE}
334+
#| eval: false
335+
ggplot(NTS_data, aes(x = sd_Total_Distance_wk)) +
336+
geom_histogram(binwidth = 0.5, fill = "darkblue") +
337+
labs(
338+
title = "SD of Trip Length on Weekdays",
339+
x = "SD of trip length (km)",
340+
y = "Number of Individuals"
341+
) +
342+
theme_minimal() +
343+
xlim(0, 25) + ylim(0,1000)
344+
```
345+
346+
```{r SD-triplength-weekend-histogram, message=FALSE, warning=FALSE}
347+
#| eval: false
348+
ggplot(NTS_data, aes(x = sd_Total_Distance_wknd)) +
349+
geom_histogram(binwidth = 0.5, fill = "darkred") +
350+
labs(
351+
title = "SD of Trip Length on Weekends",
352+
x = "SD of trip length (km)",
353+
y = "Number of Individuals"
354+
) +
355+
theme_minimal() +
356+
xlim(0, 25) + ylim(0,1000)
357+
```
358+
359+
</details>
360+
285361
# Self-study practical (1 hr)
286362

287363
**Read and try to complete the exercises in Chapters 1 to 5 of the book [Reproducible Road Safety Research with R](https://itsleeds.github.io/rrsrr/).**
@@ -311,4 +387,4 @@ For details on installing packages see [here](https://docs.ropensci.org/stats19/
311387

312388
- Think of a research question that you could answer with data science, and write it down in a .qmd file. Include a sketch of the data you would need to answer the question.
313389

314-
- Sign-up to the Cadence platform as outlined at [itsleeds.github.io/tds/s2/#the-cadence-platform](https://itsleeds.github.io/tds/s2/#the-cadence-platform)
390+
- Sign-up to the Cadence platform as outlined at [itsleeds.github.io/tds/s2/#the-cadence-platform](https://itsleeds.github.io/tds/s2/#the-cadence-platform)

s4/index.qmd

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -99,6 +99,7 @@ We will start with a simple map of the world. Load the `world` object from the `
9999
::: {.panel-tabset}
100100
## R
101101
```{r}
102+
#| eval: false
102103
#| echo: true
103104
#| output: false
104105
world = spData::world
@@ -120,6 +121,7 @@ Use some basic R functions to explore the `world` object. e.g. `class(world)`, `
120121
::: {.panel-tabset}
121122
## R
122123
```{r}
124+
#| eval: false
123125
#| warning: false
124126
plot(world)
125127
```
@@ -143,6 +145,7 @@ Note that this makes a map of each column in the data frame. Try some other plot
143145
::: {.panel-tabset}
144146
## R
145147
```{r}
148+
#| eval: false
146149
plot(world[3:6])
147150
plot(world["pop"])
148151
```
@@ -167,6 +170,7 @@ Load the `nz` and `nz_height` datasets from the `spData` package.
167170
::: {.panel-tabset}
168171
## R
169172
```{r}
173+
#| eval: false
170174
#| echo: true
171175
#| output: false
172176
nz = spData::nz
@@ -185,6 +189,7 @@ We can use `tidyverse` functions like `filter` and `select` on `sf` objects in t
185189
::: {.panel-tabset}
186190
## R
187191
```{r}
192+
#| eval: false
188193
#| echo: true
189194
#| output: false
190195
canterbury = nz |> filter(Name == "Canterbury")
@@ -241,6 +246,7 @@ In this section we will look at basic transport data in the R package **stplanr*
241246
Load the `stplanr` package as follows:
242247

243248
```{r}
249+
#| eval: false
244250
#| echo: true
245251
#| output: false
246252
library(stplanr)
@@ -255,6 +261,7 @@ First we will load some sample data:
255261
::: {.panel-tabset}
256262
## R
257263
```{r}
264+
#| eval: false
258265
#| echo: true
259266
od_data = stplanr::od_data_sample
260267
zone = stplanr::cents_sf
@@ -274,6 +281,7 @@ Now we will rename one of the columns from `foot` to `walk`
274281
::: {.panel-tabset}
275282
## R
276283
```{r}
284+
#| eval: false
277285
#| echo: true
278286
od_data = od_data |>
279287
rename(walk = foot)
@@ -292,6 +300,7 @@ Next we will made a new dataset `od_data_walk` by taking `od_data` and piping it
292300

293301
## R
294302
```{r}
303+
#| eval: false
295304
#| echo: true
296305
od_data_walk = od_data |>
297306
filter(walk > 0) |>
@@ -313,6 +322,7 @@ We can use the generic `plot` function to view the relationships between variabl
313322
::: {.panel-tabset}
314323
## R
315324
```{r}
325+
#| eval: false
316326
plot(od_data_walk)
317327
```
318328

@@ -328,6 +338,7 @@ R has built in modelling functions such as `lm` lets make a simple model to pred
328338
::: {.panel-tabset}
329339
## R
330340
```{r}
341+
#| eval: false
331342
#| echo: true
332343
model1 = lm(proportion_walk ~ proportion_drive, data = od_data_walk)
333344
od_data_walk$proportion_walk_predicted = model1$fitted.values
@@ -349,6 +360,7 @@ We can use the `ggplot2` package to graph our model predictions.
349360
::: {.panel-tabset}
350361
## R
351362
```{r}
363+
#| eval: false
352364
ggplot(od_data_walk) +
353365
geom_point(aes(proportion_drive, proportion_walk)) +
354366
geom_line(aes(proportion_drive, proportion_walk_predicted))

0 commit comments

Comments
 (0)