Skip to content

Commit ebb8049

Browse files
committed
finalise analysis
1 parent 4b3a270 commit ebb8049

File tree

1 file changed

+105
-58
lines changed

1 file changed

+105
-58
lines changed

source/dataset_bias_cv.Rmd

Lines changed: 105 additions & 58 deletions
Original file line numberDiff line numberDiff line change
@@ -858,33 +858,8 @@ grouped_lm(
858858
)
859859
```
860860

861-
## Datasets
862-
863-
```{r}
864-
dataset_richness <- birdcubeflanders_dataset %>%
865-
summarise(n_obs = sum(n),
866-
n_spec = n_distinct(species),
867-
.by = "datasetkey")
868-
869-
test <- prevalence_cv %>%
870-
left_join(dataset_richness, by = join_by("datasetkey_out" == "datasetkey"))
871-
```
872-
873-
```{r}
874-
ggplot(test, aes(x = as.factor(n_obs), y = error)) +
875-
geom_boxplot()
876-
```
877-
878-
```{r}
879-
ggplot(test, aes(x = as.factor(n_spec), y = perc_error)) +
880-
geom_boxplot()
881-
```
882-
883-
## Error compared to ABV
884-
885-
Calculate the CV error compared to ABV prevalence.
886-
A smaller error means the prevalence comes closer to the "true" prevalence (ABV).
887-
MRE, MSE and RMSE per dataset left out.
861+
## Error of prevalence estimates relative to ABV reference values
862+
Calculate the CV error compared to ABV prevalence (= "true" prevalence).
888863

889864
Let
890865

@@ -1036,8 +1011,7 @@ ggplot(improvement_df, aes(rarity, rel_improvement)) +
10361011
theme_bw(base_size = 12)
10371012
```
10381013

1039-
### Species-level sensitivity (aggregated)
1040-
1014+
### Aggregated species-level sensitivity patterns
10411015
#### Distribution of species-level median improvements
10421016
At the species level, median improvement scores were centred close to zero, indicating that for most species the omission of individual datasets had limited influence on prevalence estimates. A smaller number of species exhibited consistently positive or negative median improvements, suggesting higher sensitivity to data composition. Overall, cross-validation tends to move prevalence estimates closer to the true value.
10431017

@@ -1089,7 +1063,7 @@ ggplot(species_summary, aes(rarity, median_rel_improvement)) +
10891063

10901064
There is apparently one very common species that is also highly influenced by one dataset.
10911065

1092-
#### Identifying *outlying* species (diagnostic, not reported)
1066+
#### Identifying outlying species (diagnostic, not reported)
10931067
One species showed unusually large relative deteriorations when datasets were omitted. This species can be treated as a diagnostic case which serves to identify potential data or modelling issues.
10941068
It concerns *Streptopelia decaocto* (EN: Eurasian collared dove, NL: Turkse tortel).
10951069

@@ -1113,57 +1087,130 @@ improvement_df %>%
11131087
Without the *waarnemingen.be* dataset, the estimate is much lower.
11141088

11151089
### Influence of individual component datasets
1116-
1117-
The influence of component datasets varied substantially. A small number of datasets consistently yielded positive improvements when omitted, indicating a disproportionate influence on prevalence estimates. Other datasets had neutral or stabilising effects across most species.
1090+
The *waarnemingen.be* dataset(s) show the largest influences (in both directions). For rare species, we see improvements, for common, we see deterioration.
11181091

11191092
```{r}
1120-
tab_dataset <- improvement_df %>%
1121-
group_by(datasetkey_out) %>%
1122-
summarise(
1123-
mean_improvement = mean(improvement),
1124-
median_improvement = median(improvement),
1125-
prop_improved = mean(improved),
1126-
.groups = "drop"
1127-
) %>%
1128-
arrange(desc(mean_improvement))
1093+
datasets_df <- birdcubeflanders_dataset %>%
1094+
group_by(datasetname, datasetkey) %>%
1095+
summarise(n_obs = sum(n),
1096+
n_spec = n_distinct(species)) %>%
1097+
ungroup() %>%
1098+
mutate(datasetname = reorder(datasetname, n_obs)) %>%
1099+
rename(datasetkey_out = datasetkey)
1100+
```
11291101

1130-
tab_dataset
1102+
```{r}
1103+
improvement_df %>%
1104+
left_join(datasets_df, by = join_by(datasetkey_out)) %>%
1105+
select("datasetname", "species", "rarity", "improvement") %>%
1106+
ggplot(aes(x = species, y = datasetname, fill = improvement)) +
1107+
geom_tile() +
1108+
scale_fill_gradient2(midpoint = 0) +
1109+
scale_x_discrete(label = function(x) stringr::str_trunc(x, 20)) +
1110+
scale_y_discrete(label = function(x) stringr::str_trunc(x, 40)) +
1111+
facet_wrap(~rarity, scales = "free_x") +
1112+
labs(
1113+
x = "",
1114+
y = "Dataset omitted",
1115+
fill = "Improvement"
1116+
) +
1117+
theme_bw(base_size = 12) +
1118+
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))
11311119
```
11321120

1121+
We remove the *waarnemingen.be* datasets. Still, the largest datasets show strongest impacts, often positive.
11331122

11341123
```{r}
11351124
improvement_df %>%
1136-
mutate(
1137-
datasetkey_out = forcats::fct_reorder(datasetkey_out, improvement, median)
1138-
) %>%
1139-
ggplot(aes(datasetkey_out, species, fill = improvement)) +
1125+
left_join(datasets_df, by = join_by(datasetkey_out)) %>%
1126+
select("datasetname", "species", "rarity", "improvement") %>%
1127+
# Now we filter out the dataset
1128+
filter(!grepl("^Waarnemingen.be", datasetname)) %>%
1129+
ggplot(aes(x = species, y = datasetname, fill = improvement)) +
11401130
geom_tile() +
11411131
scale_fill_gradient2(midpoint = 0) +
1132+
scale_x_discrete(label = function(x) stringr::str_trunc(x, 20)) +
1133+
scale_y_discrete(label = function(x) stringr::str_trunc(x, 40)) +
1134+
facet_wrap(~rarity, scales = "free_x") +
11421135
labs(
1143-
x = "Dataset omitted",
1144-
y = "Species",
1136+
x = "",
1137+
y = "Dataset omitted",
11451138
fill = "Improvement"
11461139
) +
11471140
theme_bw(base_size = 12) +
1148-
theme(axis.text.x = element_text(angle = 90, vjust = 0.5))
1141+
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))
11491142
```
11501143

1151-
### Synthesis using a mixed-effects model
1144+
## Species-level robustness of prevalence estimates
1145+
1146+
We summarised dataset-removal sensitivity at the species level by collapsing relative improvement scores into a single robustness metric.
1147+
For each species, we define robustness as:
1148+
1149+
$$
1150+
\text{Robustness}_s = 1 - \min\left(1,; \left|\widetilde{RI}_s\right|\right)
1151+
$$
1152+
1153+
where $\widetilde{RI}_s$ is the median relative improvement across all leave-one-dataset-out cross-validation runs for species $s$.
1154+
1155+
This metric is bounded between 0 (low robustness) and 1 (high robustness).
1156+
1157+
* **Robustness ≈ 1**
1158+
Prevalence estimates are largely insensitive to the omission of individual datasets.
1159+
1160+
* **Robustness ≈ 0.5**
1161+
Dataset removal typically changes error magnitude by ~50%.
1162+
1163+
* **Robustness ≈ 0**
1164+
Prevalence estimates are highly dependent on individual datasets.
1165+
1166+
Using the median ensures robustness against outlying datasets and prevents single influential components from dominating the score.
1167+
1168+
### Robustness by species
1169+
Species-level robustness scores were generally high, with most species exhibiting values close to one, indicating limited sensitivity to the omission of individual datasets. A smaller subset of species showed lower robustness scores, reflecting stronger dependence on specific data components.
11521170

11531171
```{r}
1154-
library(lme4)
1172+
species_summary <- improvement_df %>%
1173+
group_by(species, rarity) %>%
1174+
summarise(
1175+
median_improvement = median(improvement),
1176+
median_rel_improvement = median(rel_improvement, na.rm = TRUE),
1177+
prop_improved = mean(improved),
1178+
.groups = "drop"
1179+
) %>%
1180+
mutate(
1181+
robustness = 1 - pmin(1, abs(median_rel_improvement))
1182+
)
1183+
```
11551184

1156-
m <- lmer(
1157-
improvement ~ rarity + (1 | species) + (1 | datasetkey_out),
1158-
data = improvement_df
1159-
)
1185+
```{r}
1186+
ggplot(species_summary, aes(x = robustness)) +
1187+
geom_histogram(bins = 30) +
1188+
labs(
1189+
x = "Species robustness score",
1190+
y = "Number of species"
1191+
) +
1192+
theme_bw(base_size = 12)
1193+
```
1194+
1195+
### Robustness by rarity class
1196+
Robustness scores do not differ that much between rarity classes.
11601197

1161-
summary(m)
1198+
```{r}
1199+
ggplot(species_summary, aes(rarity, robustness)) +
1200+
geom_boxplot(outlier.alpha = 0.4) +
1201+
labs(
1202+
x = "Rarity class",
1203+
y = "Species robustness score"
1204+
) +
1205+
theme_bw(base_size = 12)
11621206
```
11631207

1208+
### Conclusions and discussion
11641209

1165-
**Results text (draft)**
1210+
Using leave-one-dataset-out cross-validation, we assessed the sensitivity of prevalence estimates derived from the bird data cube to the composition of the underlying datasets, using ABV prevalence as a reference benchmark. Overall, omission of individual component datasets more often reduced than increased the deviation from the reference prevalence, although the magnitude of these improvements was typically small. This indicates that the prevalence indicator is generally robust to changes in dataset composition.
11661211

1167-
> Mixed-effects modelling confirmed systematic differences in sensitivity across rarity classes, while accounting for species- and dataset-specific variability. Random effects indicated that both species identity and dataset identity contribute to variability in improvement scores, highlighting heterogeneous data influence within the monitoring network.
1212+
Sensitivity patterns differed systematically across rarity classes. Rare species showed larger relative improvements and greater variability in error reduction than common species, reflecting their stronger dependence on individual datasets. In contrast, prevalence estimates for common species were more stable, but occasionally deteriorated substantially when influential datasets were removed. These differences largely arise from the mathematical properties of the indicator and the contrasting prevalence distributions between structured and opportunistic data sources, rather than from data quality issues alone. In particular, high-prevalence species have limited scope for improvement through dataset removal, whereas rare species can show appreciable gains.
11681213

1214+
Aggregating results at the species level confirmed that most species exhibit limited sensitivity to dataset removal, while a small number act as diagnostic cases with pronounced dependence on specific datasets. Analysis at the dataset level showed that large datasets exert the strongest influence on prevalence estimates. The *waarnemingen.be* datasets, in particular, had a substantial impact, improving estimates for rare species while worsening them for common species. This dual effect highlights the central role of large opportunistic datasets in shaping prevalence indicators.
11691215

1216+
To summarise sensitivity in a compact and interpretable way, we introduced a species-level robustness metric based on the median relative improvement in error across cross-validation runs. Most species exhibited high robustness scores, indicating stable prevalence estimates, while lower scores identified species for which estimates are more dependent on data composition. This metric provides a practical tool for summarising robustness in species-rich indicator systems and for identifying cases that may warrant closer scrutiny.

0 commit comments

Comments
 (0)