Skip to content

Commit 9afc883

Browse files
Update PR #484 preview
1 parent c3af9bf commit 9afc883

File tree

121 files changed

+4153
-6790
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

121 files changed

+4153
-6790
lines changed

preview/pr484/404.html

Lines changed: 3 additions & 11 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

preview/pr484/CODE_OF_CONDUCT.html

Lines changed: 29 additions & 76 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

preview/pr484/CONTRIBUTING.html

Lines changed: 26 additions & 70 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

preview/pr484/MULTI_VERSION_DOCS.html

Lines changed: 60 additions & 140 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

preview/pr484/articles/enteric_fever_example.html

Lines changed: 296 additions & 29 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

preview/pr484/articles/enteric_fever_example.md

Lines changed: 296 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -290,11 +290,12 @@ est1 <- est_seroincidence(
290290
)
291291

292292
summary(est1)
293-
#> # A tibble: 1 × 10
294-
#> est.start incidence.rate SE CI.lwr CI.upr coverage log.lik iterations
295-
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
296-
#> 1 0.1 0.128 0.00682 0.115 0.142 0.95 -2376. 4
297-
#> # ℹ 2 more variables: antigen.isos <chr>, nlm.convergence.code <ord>
293+
#> # A tibble: 1 × 11
294+
#> est.start incidence.rate SE CI.lwr CI.upr se_type coverage log.lik
295+
#> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
296+
#> 1 0.1 0.128 0.00682 0.115 0.142 standard 0.95 -2376.
297+
#> # ℹ 3 more variables: iterations <int>, antigen.isos <chr>,
298+
#> # nlm.convergence.code <ord>
298299
```
299300

300301
### Stratified Seroincidence
@@ -336,7 +337,7 @@ summary(est_country_age)
336337
#> b) Strata : Country, ageCat
337338
#>
338339
#> Seroincidence estimates:
339-
#> # A tibble: 9 × 14
340+
#> # A tibble: 9 × 15
340341
#> Stratum Country ageCat n est.start incidence.rate SE CI.lwr CI.upr
341342
#> <chr> <chr> <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
342343
#> 1 Stratum 1 Banglad… <5 101 0.1 0.400 0.0395 0.330 0.485
@@ -348,8 +349,8 @@ summary(est_country_age)
348349
#> 7 Stratum 7 Pakistan <5 126 0.1 0.106 0.0136 0.0823 0.136
349350
#> 8 Stratum 8 Pakistan 5-15 261 0.1 0.115 0.00845 0.0991 0.132
350351
#> 9 Stratum 9 Pakistan 16+ 107 0.1 0.190 0.0204 0.154 0.235
351-
#> # ℹ 5 more variables: coverage <dbl>, log.lik <dbl>, iterations <int>,
352-
#> # antigen.isos <chr>, nlm.convergence.code <ord>
352+
#> # ℹ 6 more variables: se_type <chr>, coverage <dbl>, log.lik <dbl>,
353+
#> # iterations <int>, antigen.isos <chr>, nlm.convergence.code <ord>
353354
```
354355

355356
Note that we get a warning about uneven observations between antigen
@@ -478,18 +479,296 @@ Z-statistic - P-value - 95% confidence interval for the difference
478479
We can use this to identify which differences are statistically
479480
significant (typically using p \< 0.05 as a threshold).
480481

482+
## Accounting for Clustered Sampling Designs
483+
484+
In many survey designs, observations are clustered (e.g., multiple
485+
individuals from the same household, geographic area, or school). When
486+
observations within clusters are more similar to each other than to
487+
observations from different clusters, standard errors that ignore
488+
clustering will be too small, leading to overconfident inference.
489+
490+
The `serocalculator` package supports cluster-robust standard error
491+
estimation through the `cluster_var` parameter.
492+
493+
### Example: Household Clustering
494+
495+
The SEES data includes cluster identifiers (`cluster` for geographic
496+
clusters, `catchment` for health facility catchment areas, and
497+
`Country`). These variables are nested: clusters are nested within
498+
catchments, which are nested within countries. We can use these to
499+
account for within-cluster correlation:
500+
501+
``` r
502+
# Account for geographic clustering
503+
# Clusters in the SEES data represent geographic areas
504+
# Use Pakistan data for this example
505+
est_with_clustering <- est_seroincidence(
506+
pop_data = xs_data |> filter(Country == "Pakistan"),
507+
sr_params = curves,
508+
noise_params = noise |> filter(Country == "Pakistan"),
509+
antigen_isos = c("HlyE_IgG", "HlyE_IgA"),
510+
cluster_var = "cluster" # Geographic cluster identifier
511+
)
512+
513+
# The summary will show cluster-robust standard errors
514+
summary(est_with_clustering)
515+
#> # A tibble: 1 × 11
516+
#> est.start incidence.rate SE CI.lwr CI.upr se_type coverage log.lik
517+
#> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
518+
#> 1 0.1 0.128 0.0104 0.109 0.150 cluster-robust 0.95 -2376.
519+
#> # ℹ 3 more variables: iterations <int>, antigen.isos <chr>,
520+
#> # nlm.convergence.code <ord>
521+
522+
# Compare with non-robust (standard) analysis
523+
est_no_clustering <- est_seroincidence(
524+
pop_data = xs_data |> filter(Country == "Pakistan"),
525+
sr_params = curves,
526+
noise_params = noise |> filter(Country == "Pakistan"),
527+
antigen_isos = c("HlyE_IgG", "HlyE_IgA")
528+
# No cluster_var specified - uses standard errors
529+
)
530+
531+
summary(est_no_clustering)
532+
#> # A tibble: 1 × 11
533+
#> est.start incidence.rate SE CI.lwr CI.upr se_type coverage log.lik
534+
#> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
535+
#> 1 0.1 0.128 0.00682 0.115 0.142 standard 0.95 -2376.
536+
#> # ℹ 3 more variables: iterations <int>, antigen.isos <chr>,
537+
#> # nlm.convergence.code <ord>
538+
```
539+
540+
Notice that the point estimates (incidence rates) are identical between
541+
the two analyses, but the standard errors are larger with cluster-robust
542+
estimation. This reflects the reduced effective sample size due to
543+
within-cluster correlation.
544+
545+
The mathematical details of cluster-robust standard errors are explained
546+
in the [Cluster-robust standard errors
547+
section](https://ucd-serg.github.io/serocalculator/methodology.html#cluster-robust-standard-errors-for-clustered-sampling-designs)
548+
of the methodology article.
549+
550+
### Clustering with Stratified Analysis
551+
552+
Clustering can also be combined with stratified analysis using
553+
[`est_seroincidence_by()`](https://ucd-serg.github.io/serocalculator/reference/est_seroincidence_by.md):
554+
555+
``` r
556+
# Stratified analysis by catchment with cluster adjustment
557+
# Use Pakistan data for this example
558+
est_catchment_clustered <- est_seroincidence_by(
559+
pop_data = xs_data |> filter(Country == "Pakistan"),
560+
strata = "catchment",
561+
sr_params = curves,
562+
noise_params = noise |> filter(Country == "Pakistan"),
563+
antigen_isos = c("HlyE_IgG", "HlyE_IgA"),
564+
cluster_var = "cluster" # Account for clustering within each stratum
565+
)
566+
#> Warning: `curve_params` is missing all strata variables and will be used unstratified.
567+
#> ℹ To avoid this warning, specify the desired set of stratifying variables in
568+
#> the `curve_strata_varnames` and `noise_strata_varnames` arguments to
569+
#> `est_seroincidence_by()`.
570+
#> Warning: `noise_params` is missing all strata variables and will be used unstratified.
571+
#> ℹ To avoid this warning, specify the desired set of stratifying variables in
572+
#> the `curve_strata_varnames` and `noise_strata_varnames` arguments to
573+
#> `est_seroincidence_by()`.
574+
575+
summary(est_catchment_clustered)
576+
#> Seroincidence estimated given the following setup:
577+
#> a) Antigen isotypes : HlyE_IgG, HlyE_IgA
578+
#> b) Strata : catchment
579+
#>
580+
#> Seroincidence estimates:
581+
#> # A tibble: 2 × 14
582+
#> Stratum catchment n est.start incidence.rate SE CI.lwr CI.upr se_type
583+
#> <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
584+
#> 1 Stratu… aku 294 0.1 0.106 0.00983 0.0880 0.127 cluste…
585+
#> 2 Stratu… kgh 200 0.1 0.167 0.00905 0.151 0.186 cluste…
586+
#> # ℹ 5 more variables: coverage <dbl>, log.lik <dbl>, iterations <int>,
587+
#> # antigen.isos <chr>, nlm.convergence.code <ord>
588+
589+
# Compare with stratified analysis without clustering
590+
est_catchment_no_clustering <- est_seroincidence_by(
591+
pop_data = xs_data |> filter(Country == "Pakistan"),
592+
strata = "catchment",
593+
sr_params = curves,
594+
noise_params = noise |> filter(Country == "Pakistan"),
595+
antigen_isos = c("HlyE_IgG", "HlyE_IgA")
596+
# No cluster_var - standard errors
597+
)
598+
#> Warning: `curve_params` is missing all strata variables and will be used unstratified.
599+
#> ℹ To avoid this warning, specify the desired set of stratifying variables in
600+
#> the `curve_strata_varnames` and `noise_strata_varnames` arguments to
601+
#> `est_seroincidence_by()`.
602+
#> `noise_params` is missing all strata variables and will be used unstratified.
603+
#> ℹ To avoid this warning, specify the desired set of stratifying variables in
604+
#> the `curve_strata_varnames` and `noise_strata_varnames` arguments to
605+
#> `est_seroincidence_by()`.
606+
607+
summary(est_catchment_no_clustering)
608+
#> Seroincidence estimated given the following setup:
609+
#> a) Antigen isotypes : HlyE_IgG, HlyE_IgA
610+
#> b) Strata : catchment
611+
#>
612+
#> Seroincidence estimates:
613+
#> # A tibble: 2 × 14
614+
#> Stratum catchment n est.start incidence.rate SE CI.lwr CI.upr se_type
615+
#> <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
616+
#> 1 Stratu… aku 294 0.1 0.106 0.00767 0.0916 0.122 standa…
617+
#> 2 Stratu… kgh 200 0.1 0.167 0.0133 0.143 0.196 standa…
618+
#> # ℹ 5 more variables: coverage <dbl>, log.lik <dbl>, iterations <int>,
619+
#> # antigen.isos <chr>, nlm.convergence.code <ord>
620+
```
621+
622+
Within each stratum (catchment), the cluster-robust standard errors are
623+
larger than the standard errors, appropriately accounting for geographic
624+
clustering within each catchment area.
625+
626+
### Understanding the Impact
627+
628+
When cluster-robust standard errors are used:
629+
630+
- **Point estimates** (incidence rates) remain unchanged
631+
- **Standard errors** typically increase by 5-15% to reflect
632+
within-cluster correlation
633+
- The [`summary()`](https://rdrr.io/r/base/summary.html) output includes
634+
a `se_type` column indicating “cluster-robust” vs “standard”
635+
- Confidence intervals appropriately widen to account for reduced
636+
effective sample size
637+
638+
### Cluster-Robust Country Comparisons
639+
640+
For our main findings comparing all three countries, we should use
641+
cluster-robust standard errors to properly account for the geographic
642+
clustering in the SEES study:
643+
644+
``` r
645+
# Estimate seroincidence for Bangladesh with cluster adjustment
646+
est_bangladesh_clustered <- est_seroincidence(
647+
pop_data = xs_data |> filter(Country == "Bangladesh"),
648+
sr_params = curves,
649+
noise_params = noise |> filter(Country == "Bangladesh"),
650+
antigen_isos = c("HlyE_IgG", "HlyE_IgA"),
651+
cluster_var = "cluster"
652+
)
653+
654+
# Estimate seroincidence for Nepal with cluster adjustment
655+
est_nepal_clustered <- est_seroincidence(
656+
pop_data = xs_data |> filter(Country == "Nepal"),
657+
sr_params = curves,
658+
noise_params = noise |> filter(Country == "Nepal"),
659+
antigen_isos = c("HlyE_IgG", "HlyE_IgA"),
660+
cluster_var = "cluster"
661+
)
662+
663+
# Estimate seroincidence for Pakistan with cluster adjustment
664+
est_pakistan_clustered <- est_seroincidence(
665+
pop_data = xs_data |> filter(Country == "Pakistan"),
666+
sr_params = curves,
667+
noise_params = noise |> filter(Country == "Pakistan"),
668+
antigen_isos = c("HlyE_IgG", "HlyE_IgA"),
669+
cluster_var = "cluster"
670+
)
671+
672+
# Pairwise comparisons with cluster-robust SEs
673+
comparison_bangla_nepal <- compare_seroincidence(
674+
est_bangladesh_clustered,
675+
est_nepal_clustered
676+
)
677+
678+
comparison_bangla_pakistan <- compare_seroincidence(
679+
est_bangladesh_clustered,
680+
est_pakistan_clustered
681+
)
682+
683+
comparison_nepal_pakistan <- compare_seroincidence(
684+
est_nepal_clustered,
685+
est_pakistan_clustered
686+
)
687+
688+
# Display all comparisons
689+
print("Bangladesh vs Nepal:")
690+
#> [1] "Bangladesh vs Nepal:"
691+
print(comparison_bangla_nepal)
692+
#>
693+
#> Two-sample z-test for difference in seroincidence rates
694+
#>
695+
#> data: seroincidence estimates
696+
#> z = 14.99, p-value < 2.2e-16
697+
#> alternative hypothesis: true difference in incidence rates is not equal to 0
698+
#> 95 percent confidence interval:
699+
#> 0.3496836 0.4548783
700+
#> sample estimates:
701+
#> incidence rate 1 incidence rate 2 difference
702+
#> 0.45050113 0.04822018 0.40228095
703+
704+
print("Bangladesh vs Pakistan:")
705+
#> [1] "Bangladesh vs Pakistan:"
706+
print(comparison_bangla_pakistan)
707+
#>
708+
#> Two-sample z-test for difference in seroincidence rates
709+
#>
710+
#> data: seroincidence estimates
711+
#> z = 11.372, p-value < 2.2e-16
712+
#> alternative hypothesis: true difference in incidence rates is not equal to 0
713+
#> 95 percent confidence interval:
714+
#> 0.2670811 0.3783145
715+
#> sample estimates:
716+
#> incidence rate 1 incidence rate 2 difference
717+
#> 0.4505011 0.1278034 0.3226978
718+
719+
print("Nepal vs Pakistan:")
720+
#> [1] "Nepal vs Pakistan:"
721+
print(comparison_nepal_pakistan)
722+
#>
723+
#> Two-sample z-test for difference in seroincidence rates
724+
#>
725+
#> data: seroincidence estimates
726+
#> z = -6.978, p-value = 2.994e-12
727+
#> alternative hypothesis: true difference in incidence rates is not equal to 0
728+
#> 95 percent confidence interval:
729+
#> -0.10193630 -0.05723007
730+
#> sample estimates:
731+
#> incidence rate 1 incidence rate 2 difference
732+
#> 0.04822018 0.12780336 -0.07958318
733+
```
734+
735+
The cluster-robust comparisons provide more accurate inference by
736+
accounting for within-cluster correlation in the study design.
737+
481738
## Conclusions
482739

483-
We estimate that Bangladesh has the highest enteric fever seroconversion
484-
rates across all age groups, with the highest rates observed among 5- to
485-
15-year-olds (477 per 1000 person-years). In this age group, the
486-
seroconversion rate in Bangladesh is 14 times higher than in Nepal,
487-
where the rate is 35 per 1000 person-years. These findings highlight
488-
substantial geographic variation in enteric fever transmission,
489-
emphasizing the need for targeted prevention strategies.
740+
Using cluster-robust standard errors to account for geographic
741+
clustering in the SEES study, we observe significant variation in
742+
enteric fever seroconversion rates across the three countries.
743+
Bangladesh has the highest overall seroconversion rate at 450.5 per 1000
744+
person-years (95% CI: 401.6-505.4), followed by Pakistan at 127.8 per
745+
1000 person-years (95% CI: 109-149.8), and Nepal at 48.2 per 1000
746+
person-years (95% CI: 39.8-58.5). Pairwise comparisons show Bangladesh
747+
has significantly higher rates than both Nepal (p \< 0.001) and Pakistan
748+
(p \< 0.001), while the difference between Nepal and Pakistan is also
749+
significant (p \< 0.001).
750+
751+
Across age groups, the highest rates are observed among 5- to
752+
15-year-olds in Bangladesh (477 per 1000 person-years), which is 14
753+
times higher than Nepal in the same age group (35 per 1000
754+
person-years). These findings highlight substantial geographic variation
755+
in enteric fever transmission, emphasizing the need for targeted
756+
prevention strategies tailored to local epidemiology.
757+
758+
The cluster-robust approach properly accounts for within-cluster
759+
correlation arising from the geographic sampling design. While point
760+
estimates remain unchanged, the cluster-robust standard errors and
761+
confidence intervals provide more accurate uncertainty quantification,
762+
ensuring valid statistical inference. This adjustment is particularly
763+
important for survey designs where observations within clusters (e.g.,
764+
households, schools, or geographic areas) are correlated, as failing to
765+
account for clustering can lead to underestimated standard errors and
766+
overly narrow confidence intervals.
767+
490768
**serocalculator** offers an efficient and reproducible approach to
491-
estimating seroconversion rates, enabling data-driven insights for
492-
disease surveillance and public health decision-making.
769+
estimating seroconversion rates with proper accounting for complex
770+
survey designs, enabling data-driven insights for disease surveillance
771+
and public health decision-making.
493772

494773
## Acknowledgments
495774

0 commit comments

Comments
 (0)