Skip to content

Commit 29814fe

Browse files
committed
redo document structure
1 parent 01756c0 commit 29814fe

File tree

1 file changed

+42
-30
lines changed

1 file changed

+42
-30
lines changed

source/dataset_bias_cv.Rmd

Lines changed: 42 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,7 @@ source(here::here("source", "R", "grouped_lm.R"))
4141
Investigate bias in indicators caused by the over/under representation of certain datasets in the occurrence cube.
4242

4343
# Load data
44-
44+
## Download occurrence cube
4545
We download the occurrence cube.
4646
This cube is four dimensional, where the fourth dimension is the dataset.
4747

@@ -100,6 +100,7 @@ download_occ_cube(
100100

101101
> GBIF.org (05 August 2025) GBIF Occurrence Download https://doi.org/10.15468/dl.48vfzy
102102
103+
## Read and preprocess cube
103104
We read in the data cube and add dataset names.
104105

105106
```{r}
@@ -129,6 +130,9 @@ n_distinct <- birdcubeflanders_dataset %>%
129130
nrow(birdcubeflanders_dataset) == n_distinct
130131
```
131132

133+
## Overview of component datasets
134+
### Number and proportion of observations per dataset
135+
132136
In total, there are `r n_distinct(birdcubeflanders_dataset$datasetname)` component datasets. We look at the number and proportion of observations per dataset in the cube.
133137

134138
```{r}
@@ -217,6 +221,8 @@ ggsave(file.path(out_path, "component_datasets_year.png"),
217221
width = 8, height = 6, dpi = 300)
218222
```
219223

224+
### Number and proportion of species per dataset
225+
220226
We look at the number and proportion of species per dataset in the cube.
221227

222228
```{r}
@@ -246,6 +252,7 @@ birdcubeflanders_dataset %>%
246252
coord_flip()
247253
```
248254

255+
### Combined visualisation of observations and species
249256
We see a large number of observations in a small number of component datasets. Some datasets are specialised in specific species, others are general.
250257

251258
```{r}
@@ -284,8 +291,7 @@ ggsave(file.path(out_path, "component_datasets.png"),
284291
```
285292

286293
# Comparing species prevalence
287-
## Load and process data
288-
294+
## Load and process ABV reference data
289295
We load the ABV data (structured monitoring data).
290296
We categorise the species according to rarity:
291297

@@ -333,7 +339,6 @@ birdcube_filtered <- birdcube_dataset_filtered %>%
333339
```
334340

335341
## Indicator calculation
336-
337342
We calculate species prevalence.
338343
For each dataset we calculate the proportion of occupied grid cells by each species.
339344

@@ -393,6 +398,7 @@ ggsave(file.path(out_path, "prevalence.png"),
393398
width = 8, height = 6, dpi = 300)
394399
```
395400

401+
## Cross-validation of prevalence estimates
396402
We calculate error measures for the indicator based on leave-one-dataset-out cross-validation.
397403
We use a constant total of grid cells (`r length(unique(birdcube_dataset_filtered$mgrscode))`) such that this is independent from the datasets left out.
398404

@@ -748,16 +754,16 @@ ggsave(file.path(out_path, "prevalence_panels.png"),
748754
width = 12, height = 10, dpi = 300)
749755
```
750756

751-
## Trends in error: RMSE
752-
757+
## Trends in error measures
753758
We look at trends in CV error measures related to:
754759

755760
1 differences in rarity
756761
2. number of datasets
757762
3. effective number of datasets
758763
4. dataset evenness
759764

760-
### Differences in rarity
765+
### RMSE trends
766+
#### Differences in rarity
761767

762768
```{r}
763769
p_rmse_rarity <- prevalence_cv %>%
@@ -774,7 +780,7 @@ p_rmse_rarity <- prevalence_cv %>%
774780
p_rmse_rarity
775781
```
776782

777-
### Number of datasets
783+
#### Number of datasets
778784

779785
We cannot compute the evenness for species only found in a single dataset.
780786

@@ -824,7 +830,7 @@ grouped_lm(
824830
)
825831
```
826832

827-
### Effective number of datasets
833+
#### Effective number of datasets
828834

829835
The effective number of datasets takes into account the proportion of observations per dataset.
830836
It is calculated per species $j$ as the exponent of the Shannon Entropy:
@@ -863,7 +869,7 @@ grouped_lm(
863869
)
864870
```
865871

866-
### Dataset evenness
872+
#### Dataset evenness
867873

868874
Dataset evenness is a measure that captures how occurrences of a species are distributed across multiple datasets (0 is highly uneven, 1 is completely even).
869875
Pielou’s Evenness index $J$ is calculated as the normalised Shannon Entropy:
@@ -942,7 +948,7 @@ plot(m)
942948
summary(m)
943949
```
944950

945-
## Trends in error: MRE
951+
### MRE trends
946952

947953
We look at trends in CV error measures related to:
948954

@@ -951,7 +957,7 @@ We look at trends in CV error measures related to:
951957
3. effective number of datasets
952958
4. dataset evenness
953959

954-
### Differences in rarity
960+
#### Differences in rarity
955961

956962
```{r}
957963
p_mre_rarity <- prevalence_cv %>%
@@ -968,7 +974,7 @@ p_mre_rarity <- prevalence_cv %>%
968974
p_mre_rarity
969975
```
970976

971-
### Number of datasets
977+
#### Number of datasets
972978

973979
```{r}
974980
trend_dataset_mre <- birdcube_dataset_filtered %>%
@@ -1016,7 +1022,7 @@ grouped_lm(
10161022
)
10171023
```
10181024

1019-
### Effective number of datasets
1025+
#### Effective number of datasets
10201026

10211027
The effective number of datasets takes into account the proportion of observations per dataset.
10221028
It is calculated per species $j$ as the exponent of the Shannon Entropy:
@@ -1055,7 +1061,7 @@ grouped_lm(
10551061
)
10561062
```
10571063

1058-
### Dataset evenness
1064+
#### Dataset evenness
10591065

10601066
Dataset evenness is a measure that captures how occurrences of a species are distributed across multiple datasets (0 is highly uneven, 1 is completely even).
10611067
Pielou’s Evenness index $J$ is calculated as the normalised Shannon Entropy:
@@ -1161,11 +1167,11 @@ p_error_trends <- plot_grid(
11611167
11621168
ggsave(file.path(out_path, "error_trends.png"),
11631169
p_error_trends,
1164-
width = 10, height = 10, dpi = 300)
1170+
width = 12, height = 10, dpi = 300)
11651171
```
11661172

1167-
1168-
## Error of prevalence estimates relative to ABV reference values
1173+
# Error of prevalence estimates relative to ABV reference values
1174+
## Absolute and relative improvement definitions
11691175
Calculate the CV error compared to ABV prevalence (= "true" prevalence).
11701176

11711177
Let
@@ -1254,7 +1260,7 @@ improvement_df <- prevalence_cv %>%
12541260
)
12551261
```
12561262

1257-
### Overall effect of leave-one-dataset-out cross-validation
1263+
## Overall effect of leave-one-dataset-out cross-validation
12581264

12591265
```{r}
12601266
tab_overall <- improvement_df %>%
@@ -1285,7 +1291,7 @@ ggplot(improvement_df, aes(x = improvement)) +
12851291
theme_bw(base_size = 12)
12861292
```
12871293

1288-
### Differences in improvement across rarity classes
1294+
## Differences in improvement across rarity classes
12891295

12901296
```{r}
12911297
tab_rarity <- improvement_df %>%
@@ -1331,8 +1337,8 @@ ggplot(improvement_df, aes(rarity, rel_improvement)) +
13311337
theme_bw(base_size = 12)
13321338
```
13331339

1334-
### Aggregated species-level sensitivity patterns
1335-
#### Distribution of species-level median improvements
1340+
## Aggregated species-level sensitivity patterns
1341+
### Distribution of species-level median improvements
13361342
At the species level, median improvement scores were centred close to zero, indicating that for most species the omission of individual datasets had limited influence on prevalence estimates. A smaller number of species exhibited consistently positive or negative median improvements, suggesting higher sensitivity to data composition. Overall, cross-validation tends to move prevalence estimates closer to the true value.
13371343

13381344
```{r}
@@ -1345,18 +1351,24 @@ species_summary <- improvement_df %>%
13451351
.groups = "drop"
13461352
)
13471353
1348-
ggplot(species_summary, aes(x = median_improvement)) +
1354+
p_species_improvement <- ggplot(species_summary, aes(x = median_improvement)) +
13491355
geom_histogram(bins = 30, fill = "cornflowerblue") +
13501356
geom_vline(xintercept = 0, linetype = 2) +
13511357
labs(
13521358
x = "Median improvement per species",
13531359
y = "Number of species"
13541360
) +
13551361
theme_bw(base_size = 12)
1362+
p_species_improvement
13561363
```
13571364

1358-
#### Species-level sensitivity by rarity class
1365+
```{r, echo=FALSE}
1366+
ggsave(file.path(out_path, "species_improvement.png"),
1367+
p_species_improvement,
1368+
width = 8, height = 6, dpi = 300)
1369+
```
13591370

1371+
### Species-level sensitivity by rarity class
13601372
Species-level sensitivity differed between rarity classes. Rare species showed a wider spread of median improvements and larger relative changes compared to common species, indicating greater dependence on individual datasets.
13611373

13621374
```{r}
@@ -1383,7 +1395,7 @@ ggplot(species_summary, aes(rarity, median_rel_improvement)) +
13831395

13841396
There is apparently one very common species that is also highly influenced by one dataset.
13851397

1386-
#### Identifying outlying species (diagnostic, not reported)
1398+
### Identifying outlying species (diagnostic, not reported)
13871399
One species showed unusually large relative deteriorations when datasets were omitted. This species can be treated as a diagnostic case which serves to identify potential data or modelling issues.
13881400
It concerns *Streptopelia decaocto* (EN: Eurasian collared dove, NL: Turkse tortel). <!-- spell-check: ignore -->
13891401

@@ -1408,7 +1420,7 @@ improvement_df %>%
14081420

14091421
Without the *waarnemingen.be* dataset, the estimate is much lower.
14101422

1411-
### Influence of individual component datasets
1423+
## Influence of individual component datasets
14121424
The *waarnemingen.be* dataset(s) show the largest influences (in both directions). For rare species, we see improvements, for common, we see deterioration.
14131425

14141426
```{r}
@@ -1509,7 +1521,7 @@ improvement_df %>%
15091521
legend.position = c(0.83, 0.28))
15101522
```
15111523

1512-
### Species-level robustness of prevalence estimates
1524+
## Species-level robustness of prevalence estimates
15131525

15141526
We summarised dataset-removal sensitivity at the species level by collapsing relative improvement scores into a single robustness metric.
15151527
For each species, we define robustness as:
@@ -1537,7 +1549,7 @@ This metric is bounded between 0 (low robustness) and 1 (high robustness).
15371549

15381550
Using the median ensures robustness against outlying datasets and prevents single influential components from dominating the score.
15391551

1540-
#### Robustness by species
1552+
### Robustness by species
15411553
Species-level robustness scores were generally high, with most species exhibiting values close to one, indicating limited sensitivity to the omission of individual datasets. A smaller subset of species showed lower robustness scores, reflecting stronger dependence on specific data components.
15421554

15431555
```{r}
@@ -1564,7 +1576,7 @@ ggplot(species_summary, aes(x = robustness)) +
15641576
theme_bw(base_size = 12)
15651577
```
15661578

1567-
#### Robustness by rarity class
1579+
### Robustness by rarity class
15681580
Robustness scores do not differ that much between rarity classes.
15691581

15701582
```{r}
@@ -1577,7 +1589,7 @@ ggplot(species_summary, aes(rarity, robustness)) +
15771589
theme_bw(base_size = 12)
15781590
```
15791591

1580-
### Conclusions and discussion
1592+
## Conclusions and discussion
15811593

15821594
Using leave-one-dataset-out cross-validation, we assessed the sensitivity of prevalence estimates derived from the bird data cube to the composition of the underlying datasets, using ABV prevalence as a reference benchmark. Overall, omission of individual component datasets more often reduced than increased the deviation from the reference prevalence, although the magnitude of these improvements was typically small. This indicates that the prevalence indicator is generally robust to changes in dataset composition.
15831595

0 commit comments

Comments
 (0)