PhenologyShiny/DataManagement_files/DataManagement.Rmd at main · wleuenberger/PhenologyShiny · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
---
title: "DataManagement"
author: "Wendy Leuenberger"
date: "`r Sys.Date()`"
output:
  html_document:
    keep_md: true
---

## Note:
# 2023 data was copied directly into the previous year's complete data frame (i.e., "CleanedPhenologyData2017to2022.csv)
# Because we've successfully merged all previous years, moving forward, the new year's data can be copied in at the bottom of the previously complete data frame. E.g., 2024 data can be copied into the bottom of "CleanedPhenologyData2017to2024.csv"

```{r setup, include=FALSE}
rm(list=ls()) #remove all previous variables
knitr::opts_chunk$set(echo = TRUE)
```

```{r load packages, include = FALSE}
library(magrittr)
library(tidyverse)
library(readxl)
library(lubridate)
```

```{r figure settings, include = FALSE}
tbw = theme_bw(base_size = 18)
th = theme(panel.grid = element_blank())
```

## Read in data

Note: This R file should be located in the same working directory as the data. Otherwise reading data from other places in the google drive is challenging and I couldn't get it to work. Might be tricky since the drive is "shared with me" not a shared drive or my drive.

```{r, warning = FALSE, message = FALSE}
Data2017 <- read_csv('test_form5_2019_12_19_14_56_01_542496.csv')
Data2018 <- read_csv(
  'Form_BS162_2018_v2_2019_12_19_14_56_54_642030.csv')
Data2019 <- read_csv(
  'Form_BS162_2019_2019_12_19_14_57_47_271712.csv')
Data2021 <- read_xlsx(
  '2021Fall_TreePhenologyData_All_wPhotoLinks.xlsx')
Data2022 <- read_xlsx('TreePhenologyData2022Shiny.xlsx')
# remove empty rows
Data2022 <- na.omit(Data2022)
# remove whitespace from SPECIES
Data2022$SPECIES <- trimws(Data2022$SPECIES, which = "right")

```

## Merge data

### Merge 2017-2019

```{r}
Data171819 <- Data2017 %>%
  bind_rows(Data2018) %>%
  bind_rows(Data2019)
```

### Species columns

```{r, message = FALSE}
# Remove \n from the SPECIES column
Data2021 %<>%
  mutate(SPECIES = str_remove(SPECIES, '\n'))
Data2022 %<>%
  mutate(SPECIES = str_remove(SPECIES, '\n'))

# Take a look at the three species columns
Data2021$SPECIES %>% table
Data2022$SPECIES %>% table
Data171819$species %>% table

# Create dataframe with both styles of species names, 2022 species names are identical to 2021
SpeciesLookUp <- tibble(
  SPECIES = unique(Data2021$SPECIES) %>% sort,
  species = c(unique(Data171819$species), 'OSVI') %>% sort)

# Add the species names to both spreadsheets
Data171819 %<>%
  left_join(SpeciesLookUp)
Data2021 %<>%
  left_join(SpeciesLookUp)
Data2022 %<>%
  left_join(SpeciesLookUp)

```

### Leaf color columns

Will probably want to round to the nearest 5% since most people did so.

```{r, message = FALSE}
Data171819$color %>% table
# ggplot(Data171819, aes(x = color)) +
#   geom_bar() + tbw + th
Data2021$`LEAF COLOR` %>% table
# Note: bars barely show up due to 0.01 width
# ggplot(Data2021, aes(x = `LEAF COLOR`)) +
#   geom_bar() + tbw + th
Data2022$`LEAF COLOR` %>% table
```

### Leaf fall columns

I'm not sure how students found that a negative percent of leaves were falling. We can remove those three rows since we have replication and don't know what to do with them. There is also a lot of precision, 0.0001 recorded in 2021. We'll need to round those values.

```{r}
Data171819$fall %>% table
Data2021$`LEAF FALL` %>% table
Data2022$`LEAF FALL` %>% table
```

### Accession number columns

140 trees are in both datasets. 36 trees are only in one dataset.

```{r}
# Most individual/ACCESSION values are in ALL dataframes
# Note: Should later compare all 3 dataframes together
# F:13 T:140
unique(Data171819$individual) %in%
  unique(Data2021$ACCESSION) %>%
  table
# F:23 T:140
unique(Data2021$ACCESSION) %in%
  unique(Data171819$individual) %>%
  table
# F:19 T:134
unique(Data171819$individual) %in%
  unique(Data2022$ACCESSION) %>%
  table
# F:34 T:134
unique(Data2022$ACCESSION) %in%
  unique(Data171819$individual) %>%
  table
# F:7 T:156
unique(Data2021$ACCESSION) %in%
  unique(Data2022$ACCESSION) %>%
  table
# F:12 T:156
unique(Data2022$ACCESSION) %in%
  unique(Data2021$ACCESSION) %>%
  table

"
The below will be edited to output the intersection and what does not intersect
between the 3dfs. This code is a placeholder.
varname <- function(a, b, c){
  df1[,a]; df2[,a]=a[2]; df3[,z]=a[3]
  intersection
  !intersection

}
t <- list(c(), c(), c())
  lapply(FUN = ,varname)
"

# Look at the values that are NOT in dataframes
# 171819
unique(Data171819$individual)[!unique(Data171819$individual) %in%
  unique(Data2021$ACCESSION)]

unique(Data171819$individual)[!unique(Data171819$individual) %in%
  unique(Data2022$ACCESSION)]

# 2021
unique(Data2021$ACCESSION)[!unique(Data2021$ACCESSION) %in%
  unique(Data171819$individual)]

unique(Data2021$ACCESSION)[!unique(Data2021$ACCESSION) %in%
  unique(Data2022$ACCESSION)]

# 2022
unique(Data2022$ACCESSION)[!unique(Data2021$ACCESSION) %in%
  unique(Data171819$individual)]

unique(Data2022$ACCESSION)[!unique(Data2022$ACCESSION) %in%
  unique(Data2021$ACCESSION)]

```

### Photo links

```{r}
# no photo addresses provided for 2022
Data171819$photo %>% head
Data2021$`PICTURE ADDRESS` %>% head


```

### Rename columns as needed

```{r}
# Rename columns
Data171819 %<>%
  rename(DATE = `_submission_time`)

Data2021 %<>%
  rename(color = `LEAF COLOR`,
         fall = `LEAF FALL`,
         individual = ACCESSION,
         photo = `PICTURE ADDRESS`)

Data2022 %<>%
  rename(color = `LEAF COLOR`,
         fall = `LEAF FALL`,
         individual = ACCESSION)
```

### Add lat, long, altitude to 2021 data

Tried to do this, but it resulted in almost 500,000 rows of data due to rows of data with multiple lat/long/alt datasets. The problem is not just the altitude data, as removing that column doesn't help much.

```{r, message = FALSE}

Data2021 %>%
  left_join(Data171819[,c('individual', '_tree_gps_latitude',
                         '_tree_gps_longitude',
                         '_tree_gps_altitude')]) %>% dim

Data2021 %>%
  left_join(Data171819[,c('individual', '_tree_gps_latitude',
                         '_tree_gps_longitude')]) %>% dim


```

### Merge select columns

```{r}
# Currently including group_num and student_id in case we want to use them as random effects. We don't have them for all years though
AllData <- Data171819[,c('SPECIES', 'species', 'individual',
                         'color', 'fall', 'DATE',
                         "_tree_gps_latitude",
                         '_tree_gps_longitude',
                         '_tree_gps_altitude',
                         'group_num', 'student_id',
                         'photo')] %>%
  bind_rows(Data2021[,c('SPECIES', 'species', 'individual',
                        'color', 'fall', 'DATE',
                        # "_tree_gps_latitude",
                        # '_tree_gps_longitude',
                        # '_tree_gps_altitude',
                        'photo')])
AllData <- AllData[,c('SPECIES', 'species', 'individual',
                         'color', 'fall', 'DATE',
                         "_tree_gps_latitude",
                         '_tree_gps_longitude',
                         '_tree_gps_altitude',
                         'group_num', 'student_id',
                         'photo')]%>%
  bind_rows(Data2022[,c('SPECIES', 'species', 'individual',
                        'color', 'fall', 'DATE')])

# Rename lat/long/altitude for ease
AllData %<>% rename(Latitude = '_tree_gps_latitude',
                    Longitude = '_tree_gps_longitude',
                    Altitude = '_tree_gps_altitude')

```

## Remove questionable data

```{r}
# Negative % leaf fall
AllData %>%
  filter(color < 0 |
         fall < 0)
# Remove negatives
AllData %<>%
  filter(color >= 0,
         fall >= 0)
```

## Round leaf color and leaf fall data

```{r}
AllData %<>%
  mutate(ColorR1 = round(color, digits = 0),
         ColorR5 = round(color / 5, digits = 0) * 5,
         ColorR10 = round(color, digits = -1),
         FallR1 = round(fall, digits = 0),
         FallR5 = round(fall / 5, digits = 0) * 5,
         FallR10 = round(fall, digits = -1))
AllData %>%
  select(color, ColorR1, ColorR5, ColorR10,
         fall, FallR1, FallR5, FallR10) %>%
  unique %>%
  head(20)
```

### Visualize

People seemed to naturally round to 5% in many cases, and somewhat to 10% as well. May want to use these smoothed values, though might not be necessary.

```{r}
ForPlot <- AllData %>%
  pivot_longer(cols = c(ColorR1, ColorR5, ColorR10,
                        FallR1, FallR5, FallR10),
               names_to = 'Metric', values_to = 'Values')
ggplot(ForPlot %>% filter(Metric %in% c('ColorR1', 'FallR1')),
       aes(x = Values)) +
  geom_bar() +
  tbw + th +
  facet_wrap(~ Metric)

ggplot(ForPlot %>% filter(Metric %in% c('ColorR5', 'FallR5')),
       aes(x = Values)) +
  geom_bar() +
  tbw + th +
  facet_wrap(~ Metric)

ggplot(ForPlot %>% filter(Metric %in% c('ColorR10', 'FallR10')),
       aes(x = Values)) +
  geom_bar() +
  tbw + th +
  facet_wrap(~ Metric)

```

## Date information

Extract date (YMD). Columns for year, month, day, week

```{r}
AllData %<>%
  mutate(Year = year(DATE),
         Month = month(DATE),
         Day = day(DATE),
         Week = week(DATE))

AllData %>%
  select(DATE, Year, Month, Day, Week) %>%
  unique %>%
  tail
```

## Unique observer

```{r}
# We have group_num and student_id for some sets of data. Not sure it'll be useful since we don't have them for everything. If needed though, here's a unique group_num/student_id/year variable
AllData %<>%
  mutate(UniqueGroup = paste(Year, group_num, sep = "_"),
         UniqueStudent = paste(Year, group_num, student_id,
                               sep = "_"))
```

## Export cleaned data

```{r}
# Commented out since it's already run.
# Uncomment if needed to re-export.
# write.csv(AllData, file = 'CleanedPhenologyData2017to2022.csv',
#           row.names = FALSE)

```