@@ -290,11 +290,12 @@ est1 <- est_seroincidence(
290290)
291291
292292summary(est1 )
293- # > # A tibble: 1 × 10
294- # > est.start incidence.rate SE CI.lwr CI.upr coverage log.lik iterations
295- # > <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
296- # > 1 0.1 0.128 0.00682 0.115 0.142 0.95 -2376. 4
297- # > # ℹ 2 more variables: antigen.isos <chr>, nlm.convergence.code <ord>
293+ # > # A tibble: 1 × 11
294+ # > est.start incidence.rate SE CI.lwr CI.upr se_type coverage log.lik
295+ # > <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
296+ # > 1 0.1 0.128 0.00682 0.115 0.142 standard 0.95 -2376.
297+ # > # ℹ 3 more variables: iterations <int>, antigen.isos <chr>,
298+ # > # nlm.convergence.code <ord>
298299```
299300
300301### Stratified Seroincidence
@@ -336,7 +337,7 @@ summary(est_country_age)
336337# > b) Strata : Country, ageCat
337338# >
338339# > Seroincidence estimates:
339- # > # A tibble: 9 × 14
340+ # > # A tibble: 9 × 15
340341# > Stratum Country ageCat n est.start incidence.rate SE CI.lwr CI.upr
341342# > <chr> <chr> <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
342343# > 1 Stratum 1 Banglad… <5 101 0.1 0.400 0.0395 0.330 0.485
@@ -348,8 +349,8 @@ summary(est_country_age)
348349# > 7 Stratum 7 Pakistan <5 126 0.1 0.106 0.0136 0.0823 0.136
349350# > 8 Stratum 8 Pakistan 5-15 261 0.1 0.115 0.00845 0.0991 0.132
350351# > 9 Stratum 9 Pakistan 16+ 107 0.1 0.190 0.0204 0.154 0.235
351- # > # ℹ 5 more variables: coverage <dbl >, log.lik <dbl>, iterations <int >,
352- # > # antigen.isos <chr>, nlm.convergence.code <ord>
352+ # > # ℹ 6 more variables: se_type <chr >, coverage <dbl>, log.lik <dbl >,
353+ # > # iterations <int>, antigen.isos <chr>, nlm.convergence.code <ord>
353354```
354355
355356Note that we get a warning about uneven observations between antigen
@@ -478,18 +479,296 @@ Z-statistic - P-value - 95% confidence interval for the difference
478479We can use this to identify which differences are statistically
479480significant (typically using p \< 0.05 as a threshold).
480481
482+ ## Accounting for Clustered Sampling Designs
483+
484+ In many survey designs, observations are clustered (e.g., multiple
485+ individuals from the same household, geographic area, or school). When
486+ observations within clusters are more similar to each other than to
487+ observations from different clusters, standard errors that ignore
488+ clustering will be too small, leading to overconfident inference.
489+
490+ The ` serocalculator ` package supports cluster-robust standard error
491+ estimation through the ` cluster_var ` parameter.
492+
493+ ### Example: Household Clustering
494+
495+ The SEES data includes cluster identifiers (` cluster ` for geographic
496+ clusters, ` catchment ` for health facility catchment areas, and
497+ ` Country ` ). These variables are nested: clusters are nested within
498+ catchments, which are nested within countries. We can use these to
499+ account for within-cluster correlation:
500+
501+ ``` r
502+ # Account for geographic clustering
503+ # Clusters in the SEES data represent geographic areas
504+ # Use Pakistan data for this example
505+ est_with_clustering <- est_seroincidence(
506+ pop_data = xs_data | > filter(Country == " Pakistan" ),
507+ sr_params = curves ,
508+ noise_params = noise | > filter(Country == " Pakistan" ),
509+ antigen_isos = c(" HlyE_IgG" , " HlyE_IgA" ),
510+ cluster_var = " cluster" # Geographic cluster identifier
511+ )
512+
513+ # The summary will show cluster-robust standard errors
514+ summary(est_with_clustering )
515+ # > # A tibble: 1 × 11
516+ # > est.start incidence.rate SE CI.lwr CI.upr se_type coverage log.lik
517+ # > <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
518+ # > 1 0.1 0.128 0.0104 0.109 0.150 cluster-robust 0.95 -2376.
519+ # > # ℹ 3 more variables: iterations <int>, antigen.isos <chr>,
520+ # > # nlm.convergence.code <ord>
521+
522+ # Compare with non-robust (standard) analysis
523+ est_no_clustering <- est_seroincidence(
524+ pop_data = xs_data | > filter(Country == " Pakistan" ),
525+ sr_params = curves ,
526+ noise_params = noise | > filter(Country == " Pakistan" ),
527+ antigen_isos = c(" HlyE_IgG" , " HlyE_IgA" )
528+ # No cluster_var specified - uses standard errors
529+ )
530+
531+ summary(est_no_clustering )
532+ # > # A tibble: 1 × 11
533+ # > est.start incidence.rate SE CI.lwr CI.upr se_type coverage log.lik
534+ # > <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
535+ # > 1 0.1 0.128 0.00682 0.115 0.142 standard 0.95 -2376.
536+ # > # ℹ 3 more variables: iterations <int>, antigen.isos <chr>,
537+ # > # nlm.convergence.code <ord>
538+ ```
539+
540+ Notice that the point estimates (incidence rates) are identical between
541+ the two analyses, but the standard errors are larger with cluster-robust
542+ estimation. This reflects the reduced effective sample size due to
543+ within-cluster correlation.
544+
545+ The mathematical details of cluster-robust standard errors are explained
546+ in the [ Cluster-robust standard errors
547+ section] ( https://ucd-serg.github.io/serocalculator/methodology.html#cluster-robust-standard-errors-for-clustered-sampling-designs )
548+ of the methodology article.
549+
550+ ### Clustering with Stratified Analysis
551+
552+ Clustering can also be combined with stratified analysis using
553+ [ ` est_seroincidence_by() ` ] ( https://ucd-serg.github.io/serocalculator/reference/est_seroincidence_by.md ) :
554+
555+ ``` r
556+ # Stratified analysis by catchment with cluster adjustment
557+ # Use Pakistan data for this example
558+ est_catchment_clustered <- est_seroincidence_by(
559+ pop_data = xs_data | > filter(Country == " Pakistan" ),
560+ strata = " catchment" ,
561+ sr_params = curves ,
562+ noise_params = noise | > filter(Country == " Pakistan" ),
563+ antigen_isos = c(" HlyE_IgG" , " HlyE_IgA" ),
564+ cluster_var = " cluster" # Account for clustering within each stratum
565+ )
566+ # > Warning: `curve_params` is missing all strata variables and will be used unstratified.
567+ # > ℹ To avoid this warning, specify the desired set of stratifying variables in
568+ # > the `curve_strata_varnames` and `noise_strata_varnames` arguments to
569+ # > `est_seroincidence_by()`.
570+ # > Warning: `noise_params` is missing all strata variables and will be used unstratified.
571+ # > ℹ To avoid this warning, specify the desired set of stratifying variables in
572+ # > the `curve_strata_varnames` and `noise_strata_varnames` arguments to
573+ # > `est_seroincidence_by()`.
574+
575+ summary(est_catchment_clustered )
576+ # > Seroincidence estimated given the following setup:
577+ # > a) Antigen isotypes : HlyE_IgG, HlyE_IgA
578+ # > b) Strata : catchment
579+ # >
580+ # > Seroincidence estimates:
581+ # > # A tibble: 2 × 14
582+ # > Stratum catchment n est.start incidence.rate SE CI.lwr CI.upr se_type
583+ # > <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
584+ # > 1 Stratu… aku 294 0.1 0.106 0.00983 0.0880 0.127 cluste…
585+ # > 2 Stratu… kgh 200 0.1 0.167 0.00905 0.151 0.186 cluste…
586+ # > # ℹ 5 more variables: coverage <dbl>, log.lik <dbl>, iterations <int>,
587+ # > # antigen.isos <chr>, nlm.convergence.code <ord>
588+
589+ # Compare with stratified analysis without clustering
590+ est_catchment_no_clustering <- est_seroincidence_by(
591+ pop_data = xs_data | > filter(Country == " Pakistan" ),
592+ strata = " catchment" ,
593+ sr_params = curves ,
594+ noise_params = noise | > filter(Country == " Pakistan" ),
595+ antigen_isos = c(" HlyE_IgG" , " HlyE_IgA" )
596+ # No cluster_var - standard errors
597+ )
598+ # > Warning: `curve_params` is missing all strata variables and will be used unstratified.
599+ # > ℹ To avoid this warning, specify the desired set of stratifying variables in
600+ # > the `curve_strata_varnames` and `noise_strata_varnames` arguments to
601+ # > `est_seroincidence_by()`.
602+ # > `noise_params` is missing all strata variables and will be used unstratified.
603+ # > ℹ To avoid this warning, specify the desired set of stratifying variables in
604+ # > the `curve_strata_varnames` and `noise_strata_varnames` arguments to
605+ # > `est_seroincidence_by()`.
606+
607+ summary(est_catchment_no_clustering )
608+ # > Seroincidence estimated given the following setup:
609+ # > a) Antigen isotypes : HlyE_IgG, HlyE_IgA
610+ # > b) Strata : catchment
611+ # >
612+ # > Seroincidence estimates:
613+ # > # A tibble: 2 × 14
614+ # > Stratum catchment n est.start incidence.rate SE CI.lwr CI.upr se_type
615+ # > <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
616+ # > 1 Stratu… aku 294 0.1 0.106 0.00767 0.0916 0.122 standa…
617+ # > 2 Stratu… kgh 200 0.1 0.167 0.0133 0.143 0.196 standa…
618+ # > # ℹ 5 more variables: coverage <dbl>, log.lik <dbl>, iterations <int>,
619+ # > # antigen.isos <chr>, nlm.convergence.code <ord>
620+ ```
621+
622+ Within each stratum (catchment), the cluster-robust standard errors are
623+ larger than the standard errors, appropriately accounting for geographic
624+ clustering within each catchment area.
625+
626+ ### Understanding the Impact
627+
628+ When cluster-robust standard errors are used:
629+
630+ - ** Point estimates** (incidence rates) remain unchanged
631+ - ** Standard errors** typically increase by 5-15% to reflect
632+ within-cluster correlation
633+ - The [ ` summary() ` ] ( https://rdrr.io/r/base/summary.html ) output includes
634+ a ` se_type ` column indicating “cluster-robust” vs “standard”
635+ - Confidence intervals appropriately widen to account for reduced
636+ effective sample size
637+
638+ ### Cluster-Robust Country Comparisons
639+
640+ For our main findings comparing all three countries, we should use
641+ cluster-robust standard errors to properly account for the geographic
642+ clustering in the SEES study:
643+
644+ ``` r
645+ # Estimate seroincidence for Bangladesh with cluster adjustment
646+ est_bangladesh_clustered <- est_seroincidence(
647+ pop_data = xs_data | > filter(Country == " Bangladesh" ),
648+ sr_params = curves ,
649+ noise_params = noise | > filter(Country == " Bangladesh" ),
650+ antigen_isos = c(" HlyE_IgG" , " HlyE_IgA" ),
651+ cluster_var = " cluster"
652+ )
653+
654+ # Estimate seroincidence for Nepal with cluster adjustment
655+ est_nepal_clustered <- est_seroincidence(
656+ pop_data = xs_data | > filter(Country == " Nepal" ),
657+ sr_params = curves ,
658+ noise_params = noise | > filter(Country == " Nepal" ),
659+ antigen_isos = c(" HlyE_IgG" , " HlyE_IgA" ),
660+ cluster_var = " cluster"
661+ )
662+
663+ # Estimate seroincidence for Pakistan with cluster adjustment
664+ est_pakistan_clustered <- est_seroincidence(
665+ pop_data = xs_data | > filter(Country == " Pakistan" ),
666+ sr_params = curves ,
667+ noise_params = noise | > filter(Country == " Pakistan" ),
668+ antigen_isos = c(" HlyE_IgG" , " HlyE_IgA" ),
669+ cluster_var = " cluster"
670+ )
671+
672+ # Pairwise comparisons with cluster-robust SEs
673+ comparison_bangla_nepal <- compare_seroincidence(
674+ est_bangladesh_clustered ,
675+ est_nepal_clustered
676+ )
677+
678+ comparison_bangla_pakistan <- compare_seroincidence(
679+ est_bangladesh_clustered ,
680+ est_pakistan_clustered
681+ )
682+
683+ comparison_nepal_pakistan <- compare_seroincidence(
684+ est_nepal_clustered ,
685+ est_pakistan_clustered
686+ )
687+
688+ # Display all comparisons
689+ print(" Bangladesh vs Nepal:" )
690+ # > [1] "Bangladesh vs Nepal:"
691+ print(comparison_bangla_nepal )
692+ # >
693+ # > Two-sample z-test for difference in seroincidence rates
694+ # >
695+ # > data: seroincidence estimates
696+ # > z = 14.99, p-value < 2.2e-16
697+ # > alternative hypothesis: true difference in incidence rates is not equal to 0
698+ # > 95 percent confidence interval:
699+ # > 0.3496836 0.4548783
700+ # > sample estimates:
701+ # > incidence rate 1 incidence rate 2 difference
702+ # > 0.45050113 0.04822018 0.40228095
703+
704+ print(" Bangladesh vs Pakistan:" )
705+ # > [1] "Bangladesh vs Pakistan:"
706+ print(comparison_bangla_pakistan )
707+ # >
708+ # > Two-sample z-test for difference in seroincidence rates
709+ # >
710+ # > data: seroincidence estimates
711+ # > z = 11.372, p-value < 2.2e-16
712+ # > alternative hypothesis: true difference in incidence rates is not equal to 0
713+ # > 95 percent confidence interval:
714+ # > 0.2670811 0.3783145
715+ # > sample estimates:
716+ # > incidence rate 1 incidence rate 2 difference
717+ # > 0.4505011 0.1278034 0.3226978
718+
719+ print(" Nepal vs Pakistan:" )
720+ # > [1] "Nepal vs Pakistan:"
721+ print(comparison_nepal_pakistan )
722+ # >
723+ # > Two-sample z-test for difference in seroincidence rates
724+ # >
725+ # > data: seroincidence estimates
726+ # > z = -6.978, p-value = 2.994e-12
727+ # > alternative hypothesis: true difference in incidence rates is not equal to 0
728+ # > 95 percent confidence interval:
729+ # > -0.10193630 -0.05723007
730+ # > sample estimates:
731+ # > incidence rate 1 incidence rate 2 difference
732+ # > 0.04822018 0.12780336 -0.07958318
733+ ```
734+
735+ The cluster-robust comparisons provide more accurate inference by
736+ accounting for within-cluster correlation in the study design.
737+
481738## Conclusions
482739
483- We estimate that Bangladesh has the highest enteric fever seroconversion
484- rates across all age groups, with the highest rates observed among 5- to
485- 15-year-olds (477 per 1000 person-years). In this age group, the
486- seroconversion rate in Bangladesh is 14 times higher than in Nepal,
487- where the rate is 35 per 1000 person-years. These findings highlight
488- substantial geographic variation in enteric fever transmission,
489- emphasizing the need for targeted prevention strategies.
740+ Using cluster-robust standard errors to account for geographic
741+ clustering in the SEES study, we observe significant variation in
742+ enteric fever seroconversion rates across the three countries.
743+ Bangladesh has the highest overall seroconversion rate at 450.5 per 1000
744+ person-years (95% CI: 401.6-505.4), followed by Pakistan at 127.8 per
745+ 1000 person-years (95% CI: 109-149.8), and Nepal at 48.2 per 1000
746+ person-years (95% CI: 39.8-58.5). Pairwise comparisons show Bangladesh
747+ has significantly higher rates than both Nepal (p \< 0.001) and Pakistan
748+ (p \< 0.001), while the difference between Nepal and Pakistan is also
749+ significant (p \< 0.001).
750+
751+ Across age groups, the highest rates are observed among 5- to
752+ 15-year-olds in Bangladesh (477 per 1000 person-years), which is 14
753+ times higher than Nepal in the same age group (35 per 1000
754+ person-years). These findings highlight substantial geographic variation
755+ in enteric fever transmission, emphasizing the need for targeted
756+ prevention strategies tailored to local epidemiology.
757+
758+ The cluster-robust approach properly accounts for within-cluster
759+ correlation arising from the geographic sampling design. While point
760+ estimates remain unchanged, the cluster-robust standard errors and
761+ confidence intervals provide more accurate uncertainty quantification,
762+ ensuring valid statistical inference. This adjustment is particularly
763+ important for survey designs where observations within clusters (e.g.,
764+ households, schools, or geographic areas) are correlated, as failing to
765+ account for clustering can lead to underestimated standard errors and
766+ overly narrow confidence intervals.
767+
490768** serocalculator** offers an efficient and reproducible approach to
491- estimating seroconversion rates, enabling data-driven insights for
492- disease surveillance and public health decision-making.
769+ estimating seroconversion rates with proper accounting for complex
770+ survey designs, enabling data-driven insights for disease surveillance
771+ and public health decision-making.
493772
494773## Acknowledgments
495774
0 commit comments