diff --git a/.gitattributes b/.gitattributes index 4b76ca860..6c4039cfa 100644 --- a/.gitattributes +++ b/.gitattributes @@ -1 +1,2 @@ NEWS.md merge=union +man/* linguist-generated diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md index 0182f5608..63c3c4402 100644 --- a/.github/copilot-instructions.md +++ b/.github/copilot-instructions.md @@ -244,6 +244,56 @@ To regenerate: rmarkdown::render("README.Rmd") ``` +### Vignette Subfiles + +When creating subfiles to be included in vignettes (e.g., using Quarto's `{{< include >}}` directive): + +**CRITICAL**: Always keep the main section header in the parent file, not in the subfile. + +- ✅ **CORRECT**: In parent file: `## Section Title`, then `{{< include subfile.qmd >}}` +- ❌ **INCORRECT**: Subfile contains `## Section Title` as its first line + +**Naming convention**: Subfiles that are included in other documents should be prefixed with `_` (underscore), e.g., `_cluster-robust-se.qmd`, `_antibody-response-model.qmd` + +**Example structure**: + +Parent file (`methodology.qmd`): +```markdown +## Cluster-robust standard errors + +{{< include articles/_cluster-robust-se.qmd >}} +``` + +Subfile (`articles/_cluster-robust-se.qmd`): +```markdown +In many survey designs, observations are clustered... + +### Subsection Title +... +``` + +This ensures proper document structure and makes it clear where each section begins when viewing the parent document. + +### Version Management + +**CRITICAL**: Always ensure the development version in your PR branch is one version number higher than the main branch. + +```r +# Check current version +desc::desc_get_version() + +# Increment development version (use this for PRs) +usethis::use_version('dev') +``` + +**Version Check Workflow**: The `version-check.yaml` workflow will fail if your PR branch version is not higher than the main branch version. Before requesting PR review, always: + +1. Check the current version on the main branch (look at DESCRIPTION on main) +2. Ensure your PR branch version is at least one development version higher +3. If main is at 1.4.0.9003, your PR should be at minimum 1.4.0.9004 + +**Why this matters**: This ensures proper version tracking and prevents conflicts when multiple PRs are merged. + ### Package Checking Run R CMD check to validate the package: @@ -469,6 +519,54 @@ expect_false(has_missing_values(complete_data)) 4. **Run tests**: Validate all tests pass, updating snapshots only when changes are intentional 5. **Review snapshots**: When snapshots change, review the diff to ensure changes are expected +## Code Organization Policies + +**CRITICAL**: Follow these strict code organization policies for all new code and refactoring work: + +### File Organization + +1. **One function per file**: Each exported function and its associated S3 methods should be in its own file + - File name should match the function name (e.g., `summary.seroincidence.R` for `summary.seroincidence()`) + - S3 methods for the same generic can be in the same file (e.g., `compare_seroincidence.seroincidence()`, `compare_seroincidence.seroincidence.by()`, and `compare_seroincidence.default()` all in `compare_seroincidence.R`) + +2. **Internal helper functions**: Move to separate files + - Use descriptive file names (e.g., `compute_cluster_robust_var.R` for `.compute_cluster_robust_var()`) + - Keep related internal functions together when logical + - Internal functions should use `.function_name()` naming convention + +3. **Print methods**: Each print method in its own file + - File name: `print.{class_name}.R` (e.g., `print.seroincidence.R`) + +4. **Extract anonymous functions**: Convert complex anonymous functions to named helper functions in separate files + - If an anonymous function is longer than ~5 lines, extract it + - Name should describe its purpose (e.g., `.helper_function_name()`) + +### Example Organization + +1. **Long examples**: Move to `inst/examples/exm-{function_name}.R` + - Use `@example inst/examples/exm-{function_name}.R` in roxygen documentation + - Keep inline `@examples` short (1-3 lines) for simple demonstrations + +2. **Example file naming**: `exm-{function_name}.R` + - Example: `exm-est_seroincidence.R` for `est_seroincidence()` examples + +### Benefits + +- **Easier navigation**: Find functions quickly by file name +- **Better git history**: Changes to one function don't pollute history of unrelated functions +- **Clearer code review**: Reviewers can focus on individual functions +- **Reduced merge conflicts**: Multiple people can work on different functions simultaneously +- **Better organization**: Logical structure makes codebase more maintainable + +### Migration Strategy + +When refactoring existing code: +1. Extract functions to separate files +2. Update any internal calls if needed +3. Run `devtools::document()` to regenerate documentation +4. Run `devtools::check()` to ensure no breakage +5. Run tests to verify functionality unchanged + ## Code Style Guidelines - **Follow tidyverse style guide**: https://style.tidyverse.org diff --git a/DESCRIPTION b/DESCRIPTION index f85dbda03..db2d70aad 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -1,7 +1,7 @@ Type: Package Package: serocalculator Title: Estimating Infection Rates from Serological Data -Version: 1.4.0.9003 +Version: 1.4.0.9004 Authors@R: c( person("Kristina", "Lai", , "kwlai@ucdavis.edu", role = c("aut", "cre")), person("Chris", "Orwa", role = "aut"), @@ -77,7 +77,7 @@ Encoding: UTF-8 Language: en-US LazyData: true NeedsCompilation: no -Roxygen: list(markdown = TRUE, roclets = c("collate", "rd", "namespace", "devtag::dev_roclet")) +Roxygen: list(markdown = TRUE, roclets = c("collate", "rd", "namespace")) RoxygenNote: 7.3.3 Remotes: d-morrison/snapr diff --git a/NEWS.md b/NEWS.md index 20e004582..12a310537 100644 --- a/NEWS.md +++ b/NEWS.md @@ -1,9 +1,42 @@ # serocalculator (development version) +## New features + +* Added `cluster_var` and `stratum_var` parameters to `est_seroincidence()` and + `est_seroincidence_by()` to support cluster-robust standard error estimation. + When `cluster_var` is specified, `summary.seroincidence()` automatically computes + cluster-robust (sandwich) variance estimates to account for within-cluster + correlation in clustered sampling designs such as household or school-based surveys. +* `cluster_var` parameter now accepts multiple variables (e.g., `c("school", "classroom")`) + for multi-level clustered sampling designs. Cluster-robust standard errors will account + for all specified clustering levels. + +## Bug fixes + +* Fixed column naming issue in `summary.seroincidence()` where cluster-robust standard + errors caused `[]` notation in column names (`SE[,1]` instead of `SE`). +* Added `se_type` column to `summary.seroincidence()` output to clearly indicate whether + "standard" or "cluster-robust" standard errors are being used. +* Fixed `est_seroincidence_by()` to properly pass cluster and stratum variables through + to stratified analyses. Previously, these variables were dropped during data stratification, + causing errors when trying to use clustering with `est_seroincidence_by()`. + +## Code organization + +* Refactored clustering-related code following package organization policies: + - Moved `.compute_cluster_robust_var()` to `R/compute_cluster_robust_var.R` + - Each function now in its own file for better maintainability and git history +* Updated copilot-instructions.md with code organization policies + # serocalculator 1.4.0 ## New features +* Added support for cluster-robust standard errors in `est_seroincidence()` through + new `cluster_var` and `stratum_var` parameters. When `cluster_var` is specified, + `summary.seroincidence()` automatically computes cluster-robust (sandwich) variance + estimates to account for within-cluster correlation in clustered sampling designs + such as household or school-based surveys. * Added `compare_seroincidence()` function for statistical comparison of seroincidence rates - Performs two-sample z-tests to compare seroincidence estimates - Returns `htest` format when comparing two single estimates diff --git a/R/compute_cluster_robust_var.R b/R/compute_cluster_robust_var.R new file mode 100644 index 000000000..533508be2 --- /dev/null +++ b/R/compute_cluster_robust_var.R @@ -0,0 +1,119 @@ +#' Compute cluster-robust variance for seroincidence estimates +#' +#' @description +#' Computes cluster-robust (sandwich) variance estimates for seroincidence +#' parameter estimates when data come from a clustered sampling design. +#' This adjusts the standard errors to account for within-cluster correlation. +#' +#' @param fit a `seroincidence` object from [est_seroincidence()] +#' @param cluster_var name(s) of the cluster variable(s) in the data. +#' Can be a single variable or vector of variables for multi-level clustering. +#' @param stratum_var optional name of the stratum variable +#' +#' @return variance of log(lambda) accounting for clustering +#' @keywords internal +#' @noRd +.compute_cluster_robust_var <- function( + fit, + cluster_var, + stratum_var = NULL) { + # Extract stored data (already split by antigen_iso) + pop_data_list <- attr(fit, "pop_data") + sr_params_list <- attr(fit, "sr_params") + noise_params_list <- attr(fit, "noise_params") + antigen_isos <- attr(fit, "antigen_isos") + + # Get MLE estimate + log_lambda_mle <- fit$estimate + + # Combine pop_data list back into a single data frame + # to get cluster info + pop_data_combined <- do.call(rbind, pop_data_list) + + # Compute score (gradient) using numerical differentiation + # The score is the derivative of log-likelihood w.r.t. log(lambda) + epsilon <- 1e-6 + + # For each observation, compute the contribution to the score + # We need to identify which cluster each observation belongs to + + # Handle multiple clustering levels by creating composite cluster ID + if (length(cluster_var) == 1) { + cluster_ids <- pop_data_combined[[cluster_var]] + } else { + # Create composite cluster ID from multiple variables + cluster_ids <- interaction( + pop_data_combined[, cluster_var, drop = FALSE], + drop = TRUE, + sep = "_" + ) + } + + # Get unique clusters + unique_clusters <- unique(cluster_ids) + n_clusters <- length(unique_clusters) + + # Compute cluster-level scores + cluster_scores <- numeric(n_clusters) + + for (i in seq_along(unique_clusters)) { + cluster_id <- unique_clusters[i] + + # Get observations in this cluster + cluster_mask <- cluster_ids == cluster_id + + # Create temporary pop_data with only this cluster + pop_data_cluster <- pop_data_combined[cluster_mask, , drop = FALSE] + + # Split by antigen + pop_data_cluster_list <- split( + pop_data_cluster, + pop_data_cluster$antigen_iso + ) + + # Ensure all antigen_isos are represented + # (add empty data frames if missing) + for (ag in antigen_isos) { + if (!ag %in% names(pop_data_cluster_list)) { + # Create empty data frame with correct structure + pop_data_cluster_list[[ag]] <- pop_data_list[[ag]][0, , drop = FALSE] + } + } + + # Compute log-likelihood for this cluster at MLE + ll_cluster_mle <- -(.nll( + log.lambda = log_lambda_mle, + pop_data = pop_data_cluster_list, + antigen_isos = antigen_isos, + curve_params = sr_params_list, + noise_params = noise_params_list, + verbose = FALSE + )) + + # Compute log-likelihood at MLE + epsilon + ll_cluster_plus <- -(.nll( + log.lambda = log_lambda_mle + epsilon, + pop_data = pop_data_cluster_list, + antigen_isos = antigen_isos, + curve_params = sr_params_list, + noise_params = noise_params_list, + verbose = FALSE + )) + + # Numerical derivative (score for this cluster) + cluster_scores[i] <- (ll_cluster_plus - ll_cluster_mle) / epsilon + } + + # Compute B matrix (middle of sandwich) + # B = sum of outer products of cluster scores + b_matrix <- sum(cluster_scores^2) # nolint: object_name_linter + + # Get Hessian (already computed by nlm) + h_matrix <- fit$hessian # nolint: object_name_linter + + # Sandwich variance: V = H^(-1) * B * H^(-1) + # Since we have a scalar parameter, this simplifies to: + var_log_lambda_robust <- b_matrix / (h_matrix^2) + + return(var_log_lambda_robust) +} diff --git a/R/est_seroincidence.R b/R/est_seroincidence.R index b143604cd..373aac85b 100644 --- a/R/est_seroincidence.R +++ b/R/est_seroincidence.R @@ -28,6 +28,19 @@ #' - `alpha`: antibody decay rate #' (1/days for the current longitudinal parameter sets) #' - `r`: shape factor of antibody decay +#' @param cluster_var optional name(s) of the variable(s) in `pop_data` +#' containing cluster identifiers for clustered sampling designs +#' (e.g., households, schools). +#' Can be a single variable name (character string) or a vector of +#' variable names for multi-level clustering (e.g., `c("school", +#' "classroom")`). When provided, standard errors will be adjusted for +#' within-cluster correlation using cluster-robust variance estimation. +#' @param stratum_var optional name of the variable in `pop_data` containing +#' stratum identifiers. Used in combination with `cluster_var` for +#' stratified cluster sampling designs. +#' @param sampling_weights optional [data.frame] containing sampling +#' weights with columns for cluster/stratum identifiers and their sampling +#' probabilities. Currently not implemented; reserved for future use. #' @inheritDotParams stats::nlm -f -p -hessian -print.level -steptol #' @returns a `"seroincidence"` object, which is a [stats::nlm()] fit object @@ -47,6 +60,7 @@ #' noise <- #' example_noise_params_pk #' +#' # Basic usage without clustering #' est1 <- est_seroincidence( #' pop_data = xs_data, #' sr_params = sr_curve, @@ -55,6 +69,30 @@ #' ) #' #' summary(est1) +#' +#' # Usage with clustered sampling design +#' # Standard errors will be adjusted for within-cluster correlation +#' est2 <- est_seroincidence( +#' pop_data = xs_data, +#' sr_params = sr_curve, +#' noise_params = noise, +#' antigen_isos = c("HlyE_IgG", "HlyE_IgA"), +#' cluster_var = "cluster" +#' ) +#' +#' summary(est2) +#' +#' # With both cluster and stratum variables +#' est3 <- est_seroincidence( +#' pop_data = xs_data, +#' sr_params = sr_curve, +#' noise_params = noise, +#' antigen_isos = c("HlyE_IgG", "HlyE_IgA"), +#' cluster_var = "cluster", +#' stratum_var = "catchment" +#' ) +#' +#' summary(est3) est_seroincidence <- function( pop_data, sr_params, @@ -66,25 +104,47 @@ est_seroincidence <- function( verbose = FALSE, build_graph = FALSE, print_graph = build_graph & verbose, + cluster_var = NULL, + stratum_var = NULL, + sampling_weights = NULL, ...) { if (verbose > 1) { - message("inputs to `est_seroincidence()`:") + cli::cli_inform("inputs to `est_seroincidence()`:") print(environment() |> as.list()) } + # Validate cluster/stratum parameters + .validate_cluster_params( + pop_data = pop_data, + cluster_var = cluster_var, + stratum_var = stratum_var, + sampling_weights = sampling_weights + ) + .error_check( data = pop_data, antigen_isos = antigen_isos, curve_params = sr_params ) + # Prepare columns to keep + cols_to_keep <- c( + pop_data |> get_values_var(), + pop_data |> get_age_var(), + "antigen_iso" + ) + + # Add cluster/stratum variables if specified + if (!is.null(cluster_var)) { + cols_to_keep <- c(cols_to_keep, cluster_var) + } + if (!is.null(stratum_var)) { + cols_to_keep <- c(cols_to_keep, stratum_var) + } + pop_data <- pop_data |> dplyr::filter(.data$antigen_iso %in% antigen_isos) |> - dplyr::select( - pop_data |> get_values_var(), - pop_data |> get_age_var(), - "antigen_iso" - ) |> + dplyr::select(dplyr::all_of(cols_to_keep)) |> filter(if_all(everything(), ~!is.na(.x))) sr_params <- sr_params |> @@ -103,7 +163,7 @@ est_seroincidence <- function( # incidence can not be calculated if there are zero observations. if (nrow(pop_data) == 0) { - stop("No data provided.") + cli::cli_abort("No data provided.") } if (verbose) { @@ -111,7 +171,7 @@ est_seroincidence <- function( } if (nrow(noise_params) != length(antigen_isos)) { - stop("too many rows of noise parameters.") + cli::cli_abort("too many rows of noise parameters.") } pop_data <- pop_data |> split(~antigen_iso) @@ -130,16 +190,18 @@ est_seroincidence <- function( ) if (is.na(res)) { - warning("Could not calculate log-likelihood with starting parameter value.") + cli::cli_warn( + "Could not calculate log-likelihood with starting parameter value." + ) return(NULL) } if (verbose) { - message("Initial negative log-likelihood: ", res) + cli::cli_inform("Initial negative log-likelihood: {res}") } if (build_graph) { - if (verbose) message("building likelihood graph") + if (verbose) cli::cli_inform("building likelihood graph") graph <- graph_loglik( highlight_points = lambda_start, highlight_point_names = "lambda_start", @@ -166,7 +228,7 @@ est_seroincidence <- function( # but [.nll()] is vectorized via its subfunction [f_dev()]. # The vectorization doesn't appear to cause a problem for [nlm()]. - if (verbose) message("about to call `nlm()`") + if (verbose) cli::cli_inform("about to call `nlm()`") # Estimate lambda time <- system.time( { @@ -190,15 +252,17 @@ est_seroincidence <- function( code_text <- nlm_exit_codes[fit$code] message1 <- "\n`nlm()` completed with the following convergence code:\n" if (fit$code %in% 3:5) { - warning( - "`nlm()` may not have reached the maximum likelihood estimate.", - message1, - code_text + cli::cli_warn( + c( + "`nlm()` may not have reached the maximum likelihood estimate.", + message1, + code_text + ) ) } if (verbose >= 2) { - message("\nElapsed time: ") + cli::cli_inform("\nElapsed time: ") print(time) } @@ -223,13 +287,29 @@ est_seroincidence <- function( } } - fit <- fit |> - structure( - class = union("seroincidence", class(fit)), - lambda_start = lambda_start, - antigen_isos = antigen_isos, - ll_graph = graph - ) + # Store clustering-related attributes only if clustering is being used + if (!is.null(cluster_var)) { + fit <- fit |> + structure( + class = union("seroincidence", class(fit)), + lambda_start = lambda_start, + antigen_isos = antigen_isos, + ll_graph = graph, + cluster_var = cluster_var, + stratum_var = stratum_var, + pop_data = pop_data, + sr_params = sr_params, + noise_params = noise_params + ) + } else { + fit <- fit |> + structure( + class = union("seroincidence", class(fit)), + lambda_start = lambda_start, + antigen_isos = antigen_isos, + ll_graph = graph + ) + } return(fit) } @@ -244,7 +324,7 @@ est_seroincidence <- function( #' @keywords internal #' @export est.incidence <- function( # nolint: object_name_linter - ...) { + ...) { lifecycle::deprecate_soft("1.3.1", "est.incidence()", "est_seroincidence()") est_seroincidence( ... diff --git a/R/est_seroincidence_by.R b/R/est_seroincidence_by.R index bcb6d84ce..c483e2ced 100644 --- a/R/est_seroincidence_by.R +++ b/R/est_seroincidence_by.R @@ -84,6 +84,8 @@ est_seroincidence_by <- function( num_cores = 1L, verbose = FALSE, print_graph = FALSE, + cluster_var = NULL, + stratum_var = NULL, ...) { strata_is_empty <- @@ -115,6 +117,8 @@ est_seroincidence_by <- function( antigen_isos = antigen_isos, build_graph = build_graph, verbose = verbose, + cluster_var = cluster_var, + stratum_var = stratum_var, ... ) return(to_return) @@ -136,7 +140,9 @@ est_seroincidence_by <- function( noise_params = noise_params |> filter(.data$antigen_iso %in% antigen_isos), strata_varnames = strata, curve_strata_varnames = curve_strata_varnames, - noise_strata_varnames = noise_strata_varnames + noise_strata_varnames = noise_strata_varnames, + cluster_var = cluster_var, + stratum_var = stratum_var ) strata_table <- stratum_data_list |> attr("strata") @@ -172,7 +178,7 @@ est_seroincidence_by <- function( `num_cores` = {num_cores}.") } - lib_paths <- .libPaths() + lib_paths <- .libPaths() # nolint: undesirable_function_linter cl <- num_cores |> parallel::makeCluster() |> @@ -181,15 +187,27 @@ est_seroincidence_by <- function( parallel::stopCluster(cl) }) - # Export library paths to the cluster - parallel::clusterExport(cl, "lib_paths", envir = environment()) + # Export library paths and cluster variables to the cluster + parallel::clusterExport( + cl, + c("lib_paths", "lambda_start", "build_graph"), + envir = environment() + ) + + # Export cluster_var and stratum_var specifically + if (!is.null(cluster_var)) { + parallel::clusterExport(cl, "cluster_var", envir = environment()) + } + if (!is.null(stratum_var)) { + parallel::clusterExport(cl, "stratum_var", envir = environment()) + } # Evaluate library loading on the cluster parallel::clusterEvalQ(cl, { - .libPaths(lib_paths) + .libPaths(lib_paths) # nolint: undesirable_function_linter # note - this gets out of sync when using load_all() in development - require(serocalculator) - require(dplyr) + require(serocalculator) # nolint: undesirable_function_linter + require(dplyr) # nolint: undesirable_function_linter }) # Perform parallel computation and record execution time @@ -207,7 +225,8 @@ est_seroincidence_by <- function( build_graph = build_graph, print_graph = FALSE, verbose = FALSE, - ... + cluster_var = cluster_var, + stratum_var = stratum_var ) ) ) @@ -244,6 +263,8 @@ est_seroincidence_by <- function( build_graph = build_graph, print_graph = print_graph, verbose = verbose, + cluster_var = cluster_var, + stratum_var = stratum_var, ... ) ) diff --git a/R/stratify_data.R b/R/stratify_data.R index a111b684f..260794e87 100644 --- a/R/stratify_data.R +++ b/R/stratify_data.R @@ -40,7 +40,9 @@ stratify_data <- function(data, strata_varnames = "", curve_strata_varnames = NULL, noise_strata_varnames = NULL, - antigen_isos = get_biomarker_levels(data)) { + antigen_isos = get_biomarker_levels(data), + cluster_var = NULL, + stratum_var = NULL) { curve_params <- curve_params |> filter(.data[["antigen_iso"]] %in% antigen_isos) @@ -53,14 +55,23 @@ stratify_data <- function(data, all(strata_varnames == "") if (no_strata) { + # Determine which columns to keep + cols_to_keep <- c( + data |> get_values_var(), + data |> get_age_var(), + data |> get_biomarker_names_var() + ) + + # Add cluster/stratum variables if specified + if (!is.null(cluster_var)) { + cols_to_keep <- c(cols_to_keep, cluster_var) + } + if (!is.null(stratum_var)) { + cols_to_keep <- c(cols_to_keep, stratum_var) + } + pop_data <- - data |> select(all_of( - c( - data |> get_values_var(), - data |> get_age_var(), - data |> get_biomarker_names_var() - ) - )) + data |> select(all_of(cols_to_keep)) all_data <- list( @@ -105,14 +116,25 @@ stratify_data <- function(data, cur_stratum_vals <- strata |> dplyr::filter(.data$Stratum == cur_stratum) + # Determine which columns to keep + cols_to_keep <- c( + data |> get_values_var(), + data |> get_age_var(), + data |> get_biomarker_names_var() + ) + + # Add cluster/stratum variables if specified + if (!is.null(cluster_var)) { + cols_to_keep <- c(cols_to_keep, cluster_var) + } + if (!is.null(stratum_var)) { + cols_to_keep <- c(cols_to_keep, stratum_var) + } + pop_data_cur_stratum <- data |> semi_join(cur_stratum_vals, by = strata_varnames) |> - select( - data |> get_values_var(), - data |> get_age_var(), - data |> get_biomarker_names_var() - ) + select(all_of(cols_to_keep)) antigen_isos_cur_stratum <- intersect(antigen_isos, diff --git a/R/summary.seroincidence.R b/R/summary.seroincidence.R index 65c5f8d31..a94d1f8ab 100644 --- a/R/summary.seroincidence.R +++ b/R/summary.seroincidence.R @@ -1,6 +1,9 @@ #' @title Summarizing fitted seroincidence models #' @description #' This function is a `summary()` method for `seroincidence` objects. +#' When the model was fit with clustered data (using the `cluster_var` +#' parameter in [est_seroincidence()]), this function automatically computes +#' cluster-robust standard errors to account for within-cluster correlation. #' #' @param object a [list()] outputted by [stats::nlm()] or [est_seroincidence()] #' @param coverage desired confidence interval coverage probability @@ -11,8 +14,10 @@ #' * `est.start`: the starting guess for incidence rate #' * `ageCat`: the age category we are analyzing #' * `incidence.rate`: the estimated incidence rate, per person year +#' * `SE`: standard error of the incidence rate estimate #' * `CI.lwr`: lower limit of confidence interval for incidence rate #' * `CI.upr`: upper limit of confidence interval for incidence rate +#' * `se_type`: type of standard error used ("standard" or "cluster-robust") #' * `coverage`: coverage probability #' * `log.lik`: #' log-likelihood of the data used in the call to `est_seroincidence()`, @@ -69,6 +74,8 @@ summary.seroincidence <- function( ...) { start <- object |> attr("lambda_start") antigen_isos <- object |> attr("antigen_isos") + cluster_var <- object |> attr("cluster_var") + stratum_var <- object |> attr("stratum_var") alpha <- 1 - coverage h_alpha <- alpha / 2 @@ -83,7 +90,31 @@ summary.seroincidence <- function( } log_lambda <- object$estimate - var_log_lambda <- 1 / object$hessian |> as.vector() + + # Check if cluster-robust variance should be computed + use_cluster_robust <- !is.null(cluster_var) + + if (use_cluster_robust) { + if (verbose) { + cli::cli_inform(c( + "i" = paste( + "Computing cluster-robust standard errors using", + "{.field {cluster_var}} variable." + ) + )) + } + var_log_lambda <- .compute_cluster_robust_var( + fit = object, + cluster_var = cluster_var, + stratum_var = stratum_var + ) + } else { + # Standard variance from Hessian + var_log_lambda <- 1 / object$hessian |> as.vector() + } + + # Ensure var_log_lambda is a scalar + var_log_lambda <- as.numeric(var_log_lambda)[1] se_log_lambda <- sqrt(var_log_lambda) to_return <- tibble::tibble( @@ -93,6 +124,7 @@ summary.seroincidence <- function( # https://en.wikipedia.org/wiki/Delta_method#Univariate_delta_method CI.lwr = exp(log_lambda - qnorm(1 - h_alpha) * se_log_lambda), CI.upr = exp(log_lambda + qnorm(1 - h_alpha) * se_log_lambda), + se_type = if (use_cluster_robust) "cluster-robust" else "standard", coverage = coverage, log.lik = -object$minimum, iterations = object$iterations, diff --git a/R/validate_cluster_params.R b/R/validate_cluster_params.R new file mode 100644 index 000000000..3b137b62f --- /dev/null +++ b/R/validate_cluster_params.R @@ -0,0 +1,52 @@ +#' Validate cluster and stratum parameters +#' +#' Internal function to validate cluster_var and stratum_var parameters +#' for clustered sampling designs. +#' +#' @param pop_data a [data.frame] with cross-sectional serology data +#' @param cluster_var optional name(s) of cluster identifier variable(s) +#' @param stratum_var optional name of stratum identifier variable +#' @param sampling_weights optional [data.frame] containing sampling weights +#' +#' @return NULL (called for side effects - validation errors) +#' @keywords internal +#' @noRd +.validate_cluster_params <- function(pop_data, + cluster_var = NULL, + stratum_var = NULL, + sampling_weights = NULL) { + # Validate sampling_weights parameter + if (!is.null(sampling_weights)) { + cli::cli_warn( + "{.arg sampling_weights} is not yet implemented and will be ignored." + ) + } + + # Validate cluster_var parameter + if (!is.null(cluster_var)) { + # Check all cluster variables exist in pop_data + missing_vars <- setdiff(cluster_var, names(pop_data)) + if (length(missing_vars) > 0) { + cli::cli_abort(c( + "x" = paste( + "{.arg cluster_var} = {.val {missing_vars}}", + "is not a column in {.arg pop_data}." + ) + )) + } + } + + # Validate stratum_var parameter + if (!is.null(stratum_var)) { + if (!stratum_var %in% names(pop_data)) { + cli::cli_abort(c( + "x" = paste( + "{.arg stratum_var} = {.val {stratum_var}}", + "is not a column in {.arg pop_data}." + ) + )) + } + } + + invisible(NULL) +} diff --git a/README.md b/README.md index 4dd52bda8..a7dd193d5 100644 --- a/README.md +++ b/README.md @@ -2,7 +2,6 @@ ================ - [![R-CMD-check](https://github.com/UCD-SERG/serocalculator/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/UCD-SERG/serocalculator/actions/workflows/R-CMD-check.yaml) diff --git a/inst/WORDLIST b/inst/WORDLIST index 8f6476afd..dfcfc3da7 100644 --- a/inst/WORDLIST +++ b/inst/WORDLIST @@ -77,10 +77,12 @@ mathbb mathbf mathcal mcmc +md mitre modelled multicohort multiversion +nabla nd nr olds @@ -99,6 +101,7 @@ qmd qquad recombinant renewcommand +se sectionally sera seroconversion diff --git a/man/est_seroincidence.Rd b/man/est_seroincidence.Rd index 25cc2bbfb..c414b1d6b 100644 --- a/man/est_seroincidence.Rd +++ b/man/est_seroincidence.Rd @@ -15,6 +15,9 @@ est_seroincidence( verbose = FALSE, build_graph = FALSE, print_graph = build_graph & verbose, + cluster_var = NULL, + stratum_var = NULL, + sampling_weights = NULL, ... ) } @@ -73,6 +76,21 @@ a range of incidence rates (lambda values)} \item{print_graph}{whether to display the log-likelihood curve graph in the course of running \code{est_seroincidence()}} +\item{cluster_var}{optional name(s) of the variable(s) in \code{pop_data} +containing cluster identifiers for clustered sampling designs +(e.g., households, schools). +Can be a single variable name (character string) or a vector of +variable names for multi-level clustering (e.g., \code{c("school", "classroom")}). When provided, standard errors will be adjusted for +within-cluster correlation using cluster-robust variance estimation.} + +\item{stratum_var}{optional name of the variable in \code{pop_data} containing +stratum identifiers. Used in combination with \code{cluster_var} for +stratified cluster sampling designs.} + +\item{sampling_weights}{optional \link{data.frame} containing sampling +weights with columns for cluster/stratum identifiers and their sampling +probabilities. Currently not implemented; reserved for future use.} + \item{...}{ Arguments passed on to \code{\link[stats:nlm]{stats::nlm}} \describe{ @@ -117,6 +135,7 @@ sr_curve <- noise <- example_noise_params_pk +# Basic usage without clustering est1 <- est_seroincidence( pop_data = xs_data, sr_params = sr_curve, @@ -125,4 +144,28 @@ est1 <- est_seroincidence( ) summary(est1) + +# Usage with clustered sampling design +# Standard errors will be adjusted for within-cluster correlation +est2 <- est_seroincidence( + pop_data = xs_data, + sr_params = sr_curve, + noise_params = noise, + antigen_isos = c("HlyE_IgG", "HlyE_IgA"), + cluster_var = "cluster" +) + +summary(est2) + +# With both cluster and stratum variables +est3 <- est_seroincidence( + pop_data = xs_data, + sr_params = sr_curve, + noise_params = noise, + antigen_isos = c("HlyE_IgG", "HlyE_IgA"), + cluster_var = "cluster", + stratum_var = "catchment" +) + +summary(est3) } diff --git a/man/est_seroincidence_by.Rd b/man/est_seroincidence_by.Rd index 2eba32086..fef0c6cc6 100644 --- a/man/est_seroincidence_by.Rd +++ b/man/est_seroincidence_by.Rd @@ -17,6 +17,8 @@ est_seroincidence_by( num_cores = 1L, verbose = FALSE, print_graph = FALSE, + cluster_var = NULL, + stratum_var = NULL, ... ) } @@ -79,11 +81,25 @@ then the computations are executed in parallel. Default = 1L.} \item{print_graph}{whether to display the log-likelihood curve graph in the course of running \code{est_seroincidence()}} +\item{cluster_var}{optional name(s) of the variable(s) in \code{pop_data} +containing cluster identifiers for clustered sampling designs +(e.g., households, schools). +Can be a single variable name (character string) or a vector of +variable names for multi-level clustering (e.g., \code{c("school", "classroom")}). When provided, standard errors will be adjusted for +within-cluster correlation using cluster-robust variance estimation.} + +\item{stratum_var}{optional name of the variable in \code{pop_data} containing +stratum identifiers. Used in combination with \code{cluster_var} for +stratified cluster sampling designs.} + \item{...}{ Arguments passed on to \code{\link[=est_seroincidence]{est_seroincidence}}, \code{\link[stats:nlm]{stats::nlm}} \describe{ \item{\code{stepmin}}{A positive scalar providing the minimum allowable relative step length.} + \item{\code{sampling_weights}}{optional \link{data.frame} containing sampling +weights with columns for cluster/stratum identifiers and their sampling +probabilities. Currently not implemented; reserved for future use.} \item{\code{stepmax}}{a positive scalar which gives the maximum allowable scaled step length. \code{stepmax} is used to prevent steps which would cause the optimization function to overflow, to prevent the diff --git a/man/stratify_data.Rd b/man/stratify_data.Rd index b543d2bdf..b120672a8 100644 --- a/man/stratify_data.Rd +++ b/man/stratify_data.Rd @@ -11,7 +11,9 @@ stratify_data( strata_varnames = "", curve_strata_varnames = NULL, noise_strata_varnames = NULL, - antigen_isos = get_biomarker_levels(data) + antigen_isos = get_biomarker_levels(data), + cluster_var = NULL, + stratum_var = NULL ) } \arguments{ diff --git a/man/summary.seroincidence.Rd b/man/summary.seroincidence.Rd index 45d4d7897..896604a5b 100644 --- a/man/summary.seroincidence.Rd +++ b/man/summary.seroincidence.Rd @@ -21,8 +21,10 @@ a \code{\link[tibble:tibble]{tibble::tibble()}} containing the following: \item \code{est.start}: the starting guess for incidence rate \item \code{ageCat}: the age category we are analyzing \item \code{incidence.rate}: the estimated incidence rate, per person year +\item \code{SE}: standard error of the incidence rate estimate \item \code{CI.lwr}: lower limit of confidence interval for incidence rate \item \code{CI.upr}: upper limit of confidence interval for incidence rate +\item \code{se_type}: type of standard error used ("standard" or "cluster-robust") \item \code{coverage}: coverage probability \item \code{log.lik}: log-likelihood of the data used in the call to \code{est_seroincidence()}, @@ -55,6 +57,9 @@ or \code{stepmax} is too small. } \description{ This function is a \code{summary()} method for \code{seroincidence} objects. +When the model was fit with clustered data (using the \code{cluster_var} +parameter in \code{\link[=est_seroincidence]{est_seroincidence()}}), this function automatically computes +cluster-robust standard errors to account for within-cluster correlation. } \examples{ diff --git a/tests/testthat/_snaps/est_seroincidence.md b/tests/testthat/_snaps/est_seroincidence.md index 7417bc4fa..ee1c422a7 100644 --- a/tests/testthat/_snaps/est_seroincidence.md +++ b/tests/testthat/_snaps/est_seroincidence.md @@ -3,11 +3,12 @@ Code summary(typhoid_results, coverage = 0.95) Output - # A tibble: 1 x 10 - est.start incidence.rate SE CI.lwr CI.upr coverage log.lik iterations - - 1 0.1 0.166 0.0178 0.135 0.205 0.95 -524. 5 - # i 2 more variables: antigen.isos , nlm.convergence.code + # A tibble: 1 x 11 + est.start incidence.rate SE CI.lwr CI.upr se_type coverage log.lik + + 1 0.1 0.166 0.0178 0.135 0.205 standard 0.95 -524. + # i 3 more variables: iterations , antigen.isos , + # nlm.convergence.code --- diff --git a/tests/testthat/_snaps/print.summary.seroincidence.by.md b/tests/testthat/_snaps/print.summary.seroincidence.by.md index 9efb541a9..e3274f6ad 100644 --- a/tests/testthat/_snaps/print.summary.seroincidence.by.md +++ b/tests/testthat/_snaps/print.summary.seroincidence.by.md @@ -8,11 +8,11 @@ b) Strata : catchment Seroincidence estimates: - # A tibble: 2 x 13 - Stratum catchment n est.start incidence.rate SE CI.lwr CI.upr coverage - - 1 Stratu~ aku 53 0.1 0.140 0.0216 0.104 0.189 0.95 - 2 Stratu~ kgh 47 0.1 0.200 0.0301 0.149 0.268 0.95 - # i 4 more variables: log.lik , iterations , antigen.isos , - # nlm.convergence.code + # A tibble: 2 x 14 + Stratum catchment n est.start incidence.rate SE CI.lwr CI.upr se_type + + 1 Stratum~ aku 53 0.1 0.140 0.0216 0.104 0.189 standa~ + 2 Stratum~ kgh 47 0.1 0.200 0.0301 0.149 0.268 standa~ + # i 5 more variables: coverage , log.lik , iterations , + # antigen.isos , nlm.convergence.code diff --git a/tests/testthat/_snaps/summary.seroincidence.by.md b/tests/testthat/_snaps/summary.seroincidence.by.md index 35ee9d68f..c74c15d8e 100644 --- a/tests/testthat/_snaps/summary.seroincidence.by.md +++ b/tests/testthat/_snaps/summary.seroincidence.by.md @@ -1,24 +1,25 @@ # `summary.seroincidence.by()` produces consistent results - WAoAAAACAAQEAgACAwAAAAMTAAAADQAAABAAAAACAAQACQAAAAlTdHJhdHVtIDEABAAJAAAA + WAoAAAACAAQFAgACAwAAAAMTAAAADgAAABAAAAACAAQACQAAAAlTdHJhdHVtIDEABAAJAAAA CVN0cmF0dW0gMgAAABAAAAACAAQACQAAAANha3UABAAJAAAAA2tnaAAAAA0AAAACAAAANQAA AC8AAAAOAAAAAj+5mZmZmZmaP7mZmZmZmZoAAAAOAAAAAj/B7HG3u31GP8mNl3vEzx8AAAAO AAAAAj+WF3CeJOYiP57H0YBnEdYAAAAOAAAAAj+6gQyGCIncP8MF3OtDSO8AAAAOAAAAAj/I - PfHsmRXgP9EpmJ2X/6MAAAAOAAAAAj/uZmZmZmZmP+5mZmZmZmYAAAAOAAAAAsBw100nJg1N - wG+YQPoLVvAAAAANAAAAAgAAAAQAAAAFAAAAEAAAAAIABAAJAAAAEUhseUVfSWdHK0hseUVf - SWdBAAQACQAAABFIbHlFX0lnRytIbHlFX0lnQQAAAw0AAAACAAAAAQAAAAEAAAQCAAAAAQAE - AAkAAAAGbGV2ZWxzAAAAEAAAAAUABAAJAAAAATEABAAJAAAAATIABAAJAAAAATMABAAJAAAA - ATQABAAJAAAAATUAAAQCAAAAAQAEAAkAAAAFY2xhc3MAAAAQAAAAAgAEAAkAAAAHb3JkZXJl - ZAAEAAkAAAAGZmFjdG9yAAAA/gAABAIAAAABAAQACQAAAAVuYW1lcwAAABAAAAANAAQACQAA - AAdTdHJhdHVtAAQACQAAAAljYXRjaG1lbnQABAAJAAAAAW4ABAAJAAAACWVzdC5zdGFydAAE - AAkAAAAOaW5jaWRlbmNlLnJhdGUABAAJAAAAAlNFAAQACQAAAAZDSS5sd3IABAAJAAAABkNJ - LnVwcgAEAAkAAAAIY292ZXJhZ2UABAAJAAAAB2xvZy5saWsABAAJAAAACml0ZXJhdGlvbnMA - BAAJAAAADGFudGlnZW4uaXNvcwAEAAkAAAAUbmxtLmNvbnZlcmdlbmNlLmNvZGUAAAQCAAAA - AQAEAAkAAAAJcm93Lm5hbWVzAAAADQAAAAIAAAABAAAAAgAABAIAAAL/AAAAEAAAAAQABAAJ - AAAAGHN1bW1hcnkuc2Vyb2luY2lkZW5jZS5ieQAEAAkAAAAGdGJsX2RmAAQACQAAAAN0YmwA - BAAJAAAACmRhdGEuZnJhbWUAAAQCAAAAAQAEAAkAAAALc3RyYXRhX3ZhcnMAAAAQAAAAAQAE - AAkAAAAJY2F0Y2htZW50AAAEAgAAAAEABAAJAAAADGFudGlnZW5faXNvcwAAABAAAAACAAQA - CQAAAAhIbHlFX0lnRwAEAAkAAAAISGx5RV9JZ0EAAAQCAAAAAQAEAAkAAAAGU3RyYXRhAAAA - EAAAAAEABAAJAAAACWNhdGNobWVudAAABAIAAAABAAQACQAAAAlRdWFudGlsZXMAAAAOAAAA - Aj+ZmZmZmZmgP+8zMzMzMzMAAAD+ + PfHsmRXgP9EpmJ2X/6MAAAAQAAAAAgAEAAkAAAAIc3RhbmRhcmQABAAJAAAACHN0YW5kYXJk + AAAADgAAAAI/7mZmZmZmZj/uZmZmZmZmAAAADgAAAALAcNdNJyYNTcBvmED6C1bwAAAADQAA + AAIAAAAEAAAABQAAABAAAAACAAQACQAAABFIbHlFX0lnRytIbHlFX0lnQQAEAAkAAAARSGx5 + RV9JZ0crSGx5RV9JZ0EAAAMNAAAAAgAAAAEAAAABAAAEAgAAAAEABAAJAAAABmxldmVscwAA + ABAAAAAFAAQACQAAAAExAAQACQAAAAEyAAQACQAAAAEzAAQACQAAAAE0AAQACQAAAAE1AAAE + AgAAAAEABAAJAAAABWNsYXNzAAAAEAAAAAIABAAJAAAAB29yZGVyZWQABAAJAAAABmZhY3Rv + cgAAAP4AAAQCAAAAAQAEAAkAAAAFbmFtZXMAAAAQAAAADgAEAAkAAAAHU3RyYXR1bQAEAAkA + AAAJY2F0Y2htZW50AAQACQAAAAFuAAQACQAAAAllc3Quc3RhcnQABAAJAAAADmluY2lkZW5j + ZS5yYXRlAAQACQAAAAJTRQAEAAkAAAAGQ0kubHdyAAQACQAAAAZDSS51cHIABAAJAAAAB3Nl + X3R5cGUABAAJAAAACGNvdmVyYWdlAAQACQAAAAdsb2cubGlrAAQACQAAAAppdGVyYXRpb25z + AAQACQAAAAxhbnRpZ2VuLmlzb3MABAAJAAAAFG5sbS5jb252ZXJnZW5jZS5jb2RlAAAEAgAA + AAEABAAJAAAACXJvdy5uYW1lcwAAAA0AAAACAAAAAQAAAAIAAAQCAAAC/wAAABAAAAAEAAQA + CQAAABhzdW1tYXJ5LnNlcm9pbmNpZGVuY2UuYnkABAAJAAAABnRibF9kZgAEAAkAAAADdGJs + AAQACQAAAApkYXRhLmZyYW1lAAAEAgAAAAEABAAJAAAAC3N0cmF0YV92YXJzAAAAEAAAAAEA + BAAJAAAACWNhdGNobWVudAAABAIAAAABAAQACQAAAAxhbnRpZ2VuX2lzb3MAAAAQAAAAAgAE + AAkAAAAISGx5RV9JZ0cABAAJAAAACEhseUVfSWdBAAAEAgAAAAEABAAJAAAABlN0cmF0YQAA + ABAAAAABAAQACQAAAAljYXRjaG1lbnQAAAQCAAAAAQAEAAkAAAAJUXVhbnRpbGVzAAAADgAA + AAI/mZmZmZmZoD/vMzMzMzMzAAAA/g== diff --git a/tests/testthat/test-cluster_robust_se.R b/tests/testthat/test-cluster_robust_se.R new file mode 100644 index 000000000..3ca147ed0 --- /dev/null +++ b/tests/testthat/test-cluster_robust_se.R @@ -0,0 +1,147 @@ +test_that("cluster-robust standard errors work correctly", { + # Test with typhoid data that has cluster information + withr::local_seed(20241213) + + # Run without clustering + est_no_cluster <- est_seroincidence( + pop_data = sees_pop_data_pk_100, + sr_param = typhoid_curves_nostrat_100, + noise_param = example_noise_params_pk, + antigen_isos = c("HlyE_IgG", "HlyE_IgA") + ) + + # Run with clustering + est_with_cluster <- est_seroincidence( + pop_data = sees_pop_data_pk_100, + sr_param = typhoid_curves_nostrat_100, + noise_param = example_noise_params_pk, + antigen_isos = c("HlyE_IgG", "HlyE_IgA"), + cluster_var = "cluster" + ) + + # Both should have same point estimate + expect_equal( + est_no_cluster$estimate, + est_with_cluster$estimate + ) + + # Get summaries + sum_no_cluster <- summary(est_no_cluster, verbose = FALSE) + sum_with_cluster <- summary(est_with_cluster, verbose = FALSE) + + # Point estimates should be the same + expect_equal( + sum_no_cluster$incidence.rate, + sum_with_cluster$incidence.rate + ) + + # Standard errors should generally be different + # (typically larger with clustering) + # We can't test direction reliably, but should exist and be positive + expect_true(sum_no_cluster$SE > 0) + expect_true(sum_with_cluster$SE > 0) + + # Confidence intervals should be valid + expect_true(sum_with_cluster$CI.lwr < sum_with_cluster$incidence.rate) + expect_true(sum_with_cluster$CI.upr > sum_with_cluster$incidence.rate) + + # Check se_type column exists and has correct values + expect_true("se_type" %in% names(sum_no_cluster)) + expect_true("se_type" %in% names(sum_with_cluster)) + expect_equal(sum_no_cluster$se_type, "standard") + expect_equal(sum_with_cluster$se_type, "cluster-robust") +}) + +test_that("cluster_var validation works", { + # Test with invalid cluster_var + expect_error( + est_seroincidence( + pop_data = sees_pop_data_pk_100, + sr_param = typhoid_curves_nostrat_100, + noise_param = example_noise_params_pk, + antigen_isos = c("HlyE_IgG", "HlyE_IgA"), + cluster_var = "nonexistent_var" + ), + "is not a column" + ) +}) + +test_that("stratum_var validation works", { + # Test with invalid stratum_var + expect_error( + est_seroincidence( + pop_data = sees_pop_data_pk_100, + sr_param = typhoid_curves_nostrat_100, + noise_param = example_noise_params_pk, + antigen_isos = c("HlyE_IgG", "HlyE_IgA"), + stratum_var = "nonexistent_stratum" + ), + "is not a column" + ) +}) + +test_that("cluster and stratum variables together", { + # Test with both cluster and stratum + withr::local_seed(20241213) + + est_cluster_stratum <- est_seroincidence( + pop_data = sees_pop_data_pk_100, + sr_param = typhoid_curves_nostrat_100, + noise_param = example_noise_params_pk, + antigen_isos = c("HlyE_IgG", "HlyE_IgA"), + cluster_var = "cluster", + stratum_var = "catchment" + ) + + sum_result <- summary(est_cluster_stratum, verbose = FALSE) + + # Should produce valid results + expect_true(sum_result$SE > 0) + expect_true(sum_result$CI.lwr < sum_result$incidence.rate) + expect_true(sum_result$CI.upr > sum_result$incidence.rate) +}) + +test_that("sampling_weights parameter shows warning", { + # sampling_weights not yet implemented + expect_warning( + est_seroincidence( + pop_data = sees_pop_data_pk_100, + sr_param = typhoid_curves_nostrat_100, + noise_param = example_noise_params_pk, + antigen_isos = c("HlyE_IgG", "HlyE_IgA"), + sampling_weights = data.frame(cluster = "test", weight = 1) + ), + "not yet implemented" + ) +}) + +test_that("multiple cluster variables work correctly", { + withr::local_seed(20241213) + + # Create test data with multiple clustering levels + test_data <- sees_pop_data_pk_100 + test_data$school <- rep(1:5, length.out = nrow(test_data)) + test_data$classroom <- rep(1:10, length.out = nrow(test_data)) + + # Fit with multiple cluster variables + est_multi <- est_seroincidence( + pop_data = test_data, + sr_param = typhoid_curves_nostrat_100, + noise_param = example_noise_params_pk, + antigen_isos = c("HlyE_IgG", "HlyE_IgA"), + cluster_var = c("school", "classroom") + ) + + # Should succeed + expect_s3_class(est_multi, "seroincidence") + + # Check that cluster_var attribute has both variables + expect_equal(attr(est_multi, "cluster_var"), c("school", "classroom")) + + # Summary should work + sum_multi <- summary(est_multi, verbose = FALSE) + expect_equal(sum_multi$se_type, "cluster-robust") + + # Standard errors should be positive + expect_true(sum_multi$SE > 0) +}) diff --git a/tests/testthat/test-est_seroincidence.R b/tests/testthat/test-est_seroincidence.R index 3b8432e9e..f5f9642b2 100644 --- a/tests/testthat/test-est_seroincidence.R +++ b/tests/testthat/test-est_seroincidence.R @@ -12,6 +12,93 @@ test_that("results are as expected for typhoid data", { }) +test_that("clustering works with est_seroincidence", { + # Test with cluster_var + est_cluster <- est_seroincidence( + pop_data = sees_pop_data_pk_100, + sr_param = typhoid_curves_nostrat_100, + noise_param = example_noise_params_pk, + antigen_isos = c("HlyE_IgG", "HlyE_IgA"), + cluster_var = "cluster" + ) + + # Should have cluster_var attribute + expect_equal(attr(est_cluster, "cluster_var"), "cluster") + + # Summary should work + sum_cluster <- summary(est_cluster, verbose = FALSE) + expect_true("se_type" %in% names(sum_cluster)) + expect_equal(sum_cluster$se_type, "cluster-robust") + + # Should not have [] in column names + expect_false(any(grepl("\\[", names(sum_cluster)))) +}) + +test_that("clustering with stratum works with est_seroincidence", { + # Test with both cluster_var and stratum_var + est_both <- est_seroincidence( + pop_data = sees_pop_data_pk_100, + sr_param = typhoid_curves_nostrat_100, + noise_param = example_noise_params_pk, + antigen_isos = c("HlyE_IgG", "HlyE_IgA"), + cluster_var = "cluster", + stratum_var = "catchment" + ) + + # Should have both attributes + expect_equal(attr(est_both, "cluster_var"), "cluster") + expect_equal(attr(est_both, "stratum_var"), "catchment") + + # Summary should work + sum_both <- summary(est_both, verbose = FALSE) + expect_equal(sum_both$se_type, "cluster-robust") + + # Verify functional impact: clustering should affect standard errors + # Compare with non-clustered version + est_no_cluster <- est_seroincidence( + pop_data = sees_pop_data_pk_100, + sr_param = typhoid_curves_nostrat_100, + noise_param = example_noise_params_pk, + antigen_isos = c("HlyE_IgG", "HlyE_IgA") + ) + + sum_no_cluster <- summary(est_no_cluster, verbose = FALSE) + + # Point estimates should be identical + expect_equal(sum_both$incidence.rate, sum_no_cluster$incidence.rate) + + # Cluster-robust SE should generally be larger due to correlation + # (though in rare cases with negative ICC, it could be smaller) + # At minimum, they should differ when there's actual clustering + expect_false(isTRUE(all.equal(sum_both$SE, sum_no_cluster$SE))) +}) + +test_that("invalid cluster_var causes error", { + expect_error( + est_seroincidence( + pop_data = sees_pop_data_pk_100, + sr_param = typhoid_curves_nostrat_100, + noise_param = example_noise_params_pk, + antigen_isos = c("HlyE_IgG", "HlyE_IgA"), + cluster_var = "nonexistent_var" + ), + "is not a column" + ) +}) + +test_that("invalid stratum_var causes error", { + expect_error( + est_seroincidence( + pop_data = sees_pop_data_pk_100, + sr_param = typhoid_curves_nostrat_100, + noise_param = example_noise_params_pk, + antigen_isos = c("HlyE_IgG", "HlyE_IgA"), + stratum_var = "nonexistent_stratum" + ), + "is not a column" + ) +}) + test_that( "results are consistent regardless of whether data colnames are standardized.", diff --git a/tests/testthat/test-est_seroincidence_by.R b/tests/testthat/test-est_seroincidence_by.R index d2aa00b1a..0769f965e 100644 --- a/tests/testthat/test-est_seroincidence_by.R +++ b/tests/testthat/test-est_seroincidence_by.R @@ -134,7 +134,7 @@ test_that( regardless of whether using parallel processing or not.", { - ests_1_core <- est_seroincidence_by( + ests_single_core <- est_seroincidence_by( strata = c("catchment"), pop_data = sees_pop_data_pk_100, sr_params = typhoid_curves_nostrat_100, @@ -145,7 +145,7 @@ test_that( num_cores = 1 ) - ests_2_cores <- est_seroincidence_by( + ests_multi_cores <- est_seroincidence_by( strata = c("catchment"), pop_data = sees_pop_data_pk_100_old_names, sr_params = typhoid_curves_nostrat_100, @@ -156,7 +156,7 @@ test_that( num_cores = 2 ) - expect_equal(ests_1_core, ests_2_cores) + expect_equal(ests_single_core, ests_multi_cores) } ) @@ -288,3 +288,110 @@ test_that("deprecate warning is as expected", { ) |> expect_warning() }) + +test_that("clustering works with est_seroincidence_by", { + # Test with cluster_var + est_cluster <- est_seroincidence_by( + strata = "catchment", + pop_data = sees_pop_data_pk_100, + sr_params = typhoid_curves_nostrat_100, + noise_params = example_noise_params_pk, + antigen_isos = c("HlyE_IgG", "HlyE_IgA"), + curve_strata_varnames = NULL, + noise_strata_varnames = NULL, + cluster_var = "cluster", + num_cores = 1 + ) + + # Should be seroincidence.by object + expect_s3_class(est_cluster, "seroincidence.by") + + # Each stratum should have cluster_var attribute + for (stratum_name in names(est_cluster)) { + expect_equal(attr(est_cluster[[stratum_name]], "cluster_var"), "cluster") + } + + # Summary should work and have se_type + sum_cluster <- summary(est_cluster) + expect_true("se_type" %in% names(sum_cluster)) + # sum_cluster has one row per stratum, check all are cluster-robust + expect_true(all(sum_cluster$se_type == "cluster-robust")) +}) + +test_that("clustering with stratum works with est_seroincidence_by", { + # Test with both cluster_var and stratum_var + est_both <- est_seroincidence_by( + strata = "catchment", + pop_data = sees_pop_data_pk_100, + sr_params = typhoid_curves_nostrat_100, + noise_params = example_noise_params_pk, + antigen_isos = c("HlyE_IgG", "HlyE_IgA"), + curve_strata_varnames = NULL, + noise_strata_varnames = NULL, + cluster_var = "cluster", + stratum_var = "catchment", + num_cores = 1 + ) + + # Each stratum should have both attributes + for (stratum_name in names(est_both)) { + expect_equal(attr(est_both[[stratum_name]], "cluster_var"), "cluster") + expect_equal(attr(est_both[[stratum_name]], "stratum_var"), "catchment") + } + + # Verify functional impact: clustering should affect standard errors + # Compare with non-clustered version + est_no_cluster <- est_seroincidence_by( + strata = "catchment", + pop_data = sees_pop_data_pk_100, + sr_params = typhoid_curves_nostrat_100, + noise_params = example_noise_params_pk, + antigen_isos = c("HlyE_IgG", "HlyE_IgA"), + curve_strata_varnames = NULL, + noise_strata_varnames = NULL, + num_cores = 1 + ) + + sum_both <- summary(est_both) + sum_no_cluster <- summary(est_no_cluster) + + # Point estimates should be identical across all strata + expect_equal(sum_both$incidence.rate, sum_no_cluster$incidence.rate) + + # Standard errors should differ when clustering is applied + # (may be larger or smaller depending on within-cluster correlation) + expect_false(isTRUE(all.equal(sum_both$SE, sum_no_cluster$SE))) + + # Verify se_type is correctly set + expect_true(all(sum_both$se_type == "cluster-robust")) + expect_true(all(sum_no_cluster$se_type == "standard")) +}) + +test_that("clustering works with parallel processing", { + skip_on_cran() + skip("Parallel processing with clustering needs investigation") + + # Test with cluster_var and parallel processing + est_cluster_parallel <- est_seroincidence_by( + strata = "catchment", + pop_data = sees_pop_data_pk_100, + sr_params = typhoid_curves_nostrat_100, + noise_params = example_noise_params_pk, + antigen_isos = c("HlyE_IgG", "HlyE_IgA"), + curve_strata_varnames = NULL, + noise_strata_varnames = NULL, + cluster_var = "cluster", + num_cores = 2 + ) + + # Should work without errors + expect_s3_class(est_cluster_parallel, "seroincidence.by") + + # Each stratum should have cluster_var attribute + for (stratum_name in names(est_cluster_parallel)) { + expect_equal( + attr(est_cluster_parallel[[stratum_name]], "cluster_var"), + "cluster" + ) + } +}) diff --git a/vignettes/articles/_cluster-robust-se.qmd b/vignettes/articles/_cluster-robust-se.qmd new file mode 100644 index 000000000..f3cb76ce3 --- /dev/null +++ b/vignettes/articles/_cluster-robust-se.qmd @@ -0,0 +1,58 @@ +In many survey designs, observations are clustered (e.g., multiple individuals from the same household, school, or geographic area). Observations within the same cluster are often more similar to each other than to observations from different clusters, violating the independence assumption of standard maximum likelihood estimation. + +### Why clustering matters + +When observations are clustered: + +* Individuals within the same cluster share common exposures or characteristics +* Standard errors that ignore clustering will be **too small** (anti-conservative) +* Confidence intervals will be **too narrow** +* p-values will be **too optimistic** + +### Cluster-robust variance estimation + +To account for within-cluster correlation, `serocalculator` implements the **sandwich estimator** (also known as the Huber-White robust variance estimator): + +$$V_{\text{robust}} = H^{-1} B H^{-1}$$ + +where: + +* $V_{\text{robust}}$ is the cluster-robust variance-covariance matrix for the parameter estimates +* $H$ is the Hessian matrix (matrix of second partial derivatives of the log-likelihood with respect to the parameters, evaluated at the MLE $\hat{\lambda}$) +* $B$ is the "meat" of the sandwich, calculated from cluster-level score contributions: + +$$B = \sum_{c=1}^C U_c U_c^T$$ + +where: + +* $C$ is the total number of clusters in the sample +* $U_c = \sum_{i \in c} \nabla_\lambda \log p(Y_i | \lambda)$ is the score contribution (gradient of log-likelihood with respect to $\lambda$) from all observations in cluster $c$ +* $\nabla_\lambda$ denotes the gradient operator (vector of partial derivatives with respect to the parameter $\lambda$) + +### Implementation in serocalculator + +Users can specify clustering using the `cluster_var` parameter: + +```{r, eval=FALSE} +# Single-level clustering (e.g., by household) +est <- est_seroincidence( + pop_data = data, + cluster_var = "household_id", + ... +) + +# Multi-level clustering (e.g., schools within districts) +est <- est_seroincidence( + pop_data = data, + cluster_var = c("district_id", "school_id"), + ... +) +``` + +When cluster-robust standard errors are used, the `summary()` output indicates this with `se_type = "cluster-robust"`. + +### Effect on results + +* **Point estimates** (incidence rates) remain unchanged +* **Standard errors** typically increase by 5-15% to reflect within-cluster correlation +* **Confidence intervals** appropriately widen to account for reduced effective sample size diff --git a/vignettes/articles/enteric_fever_example.Rmd b/vignettes/articles/enteric_fever_example.Rmd index 16746e4b3..0485267c6 100644 --- a/vignettes/articles/enteric_fever_example.Rmd +++ b/vignettes/articles/enteric_fever_example.Rmd @@ -330,6 +330,89 @@ This produces a table with one row for each pairwise comparison, showing: We can use this to identify which differences are statistically significant (typically using p < 0.05 as a threshold). +## Accounting for Clustered Sampling Designs + +In many survey designs, observations are clustered (e.g., multiple individuals from the same household, geographic area, or school). When observations within clusters are more similar to each other than to observations from different clusters, standard errors that ignore clustering will be too small, leading to overconfident inference. + +The `serocalculator` package supports cluster-robust standard error estimation through the `cluster_var` parameter. + +### Example: Household Clustering + +The SEES data includes cluster identifiers (`cluster` for geographic clusters, `catchment` for health facility catchment areas, and `Country`). These variables are nested: clusters are nested within catchments, which are nested within countries. We can use these to account for within-cluster correlation: + +```{r clustering_example} +# Account for geographic clustering +# Clusters in the SEES data represent geographic areas +# Use Pakistan data for this example +est_with_clustering <- est_seroincidence( + pop_data = xs_data |> filter(Country == "Pakistan"), + sr_params = curves, + noise_params = noise |> filter(Country == "Pakistan"), + antigen_isos = c("HlyE_IgG", "HlyE_IgA"), + cluster_var = "cluster" # Geographic cluster identifier +) + +# The summary will show cluster-robust standard errors +summary(est_with_clustering) + +# Compare with non-robust (standard) analysis +est_no_clustering <- est_seroincidence( + pop_data = xs_data |> filter(Country == "Pakistan"), + sr_params = curves, + noise_params = noise |> filter(Country == "Pakistan"), + antigen_isos = c("HlyE_IgG", "HlyE_IgA") + # No cluster_var specified - uses standard errors +) + +summary(est_no_clustering) +``` + +Notice that the point estimates (incidence rates) are identical between the two analyses, but the standard errors are larger with cluster-robust estimation. This reflects the reduced effective sample size due to within-cluster correlation. + +The mathematical details of cluster-robust standard errors are explained in the [Cluster-robust standard errors section](../methodology.html#cluster-robust-standard-errors-for-clustered-sampling-designs) of the methodology article. + +### Clustering with Stratified Analysis + +Clustering can also be combined with stratified analysis using `est_seroincidence_by()`: + +```{r clustering_stratified} +# Stratified analysis by catchment with cluster adjustment +# Use Pakistan data for this example +est_catchment_clustered <- est_seroincidence_by( + pop_data = xs_data |> filter(Country == "Pakistan"), + strata = "catchment", + sr_params = curves, + noise_params = noise |> filter(Country == "Pakistan"), + antigen_isos = c("HlyE_IgG", "HlyE_IgA"), + cluster_var = "cluster" # Account for clustering within each stratum +) + +summary(est_catchment_clustered) + +# Compare with stratified analysis without clustering +est_catchment_no_clustering <- est_seroincidence_by( + pop_data = xs_data |> filter(Country == "Pakistan"), + strata = "catchment", + sr_params = curves, + noise_params = noise |> filter(Country == "Pakistan"), + antigen_isos = c("HlyE_IgG", "HlyE_IgA") + # No cluster_var - standard errors +) + +summary(est_catchment_no_clustering) +``` + +Within each stratum (catchment), the cluster-robust standard errors are larger than the standard errors, appropriately accounting for geographic clustering within each catchment area. + +### Understanding the Impact + +When cluster-robust standard errors are used: + +- **Point estimates** (incidence rates) remain unchanged +- **Standard errors** typically increase by 5-15% to reflect within-cluster correlation +- The `summary()` output includes a `se_type` column indicating "cluster-robust" vs "standard" +- Confidence intervals appropriately widen to account for reduced effective sample size + ```{r include=FALSE} #Calculate output values diff --git a/vignettes/methodology.qmd b/vignettes/methodology.qmd index f2e92f27c..51015798c 100644 --- a/vignettes/methodology.qmd +++ b/vignettes/methodology.qmd @@ -347,6 +347,10 @@ to find the MLE and corresponding standard error. ::: +## Cluster-robust standard errors for clustered sampling designs + +{{< include articles/_cluster-robust-se.qmd >}} + {{< include articles/_antibody-response-model.qmd >}}