v2.3.1.904: time-varying weights fix, fix_weights param, namespace cleanup#256
v2.3.1.904: time-varying weights fix, fix_weights param, namespace cleanup#256
Conversation
…amespace pollution, fix dreamerr/get() crash - Fix faster_mode mismatch with time-varying sampling weights on balanced panel - Add fix_weights parameter: NULL (default), "varying", "base_period", "first_period" - Replace import(stats), import(utils), import(BMisc) with selective importFrom - Fix aggte() crash from dreamerr intercepting get() inside data.table - Add runtime message for time-varying weights detection - Skip inference tests gracefully when did v2.1.2 unavailable from CRAN - Add GitHub Action to auto-bump version on PR merge - 236 PASS, 0 FAIL, 8 SKIP
There was a problem hiding this comment.
Pull request overview
This PR updates the did package’s handling of time-varying sampling weights to make faster_mode=TRUE consistent with the non-fast path, introduces a new fix_weights control knob in att_gt(), and reduces namespace pollution by switching to selective imports. It also includes a fix for an aggte() crash related to data.table usage, expands docs/NEWS, updates tests to better handle unavailable historical did versions, and adds a GitHub Action to auto-bump the dev version.
Changes:
- Fix
faster_modevs non-fast mismatches for time-varying weights by introducing a per-periodweights_tensor, and addfix_weightsto control weight resolution behavior. - Replace problematic
data.tableget()usage inaggte()withset()to avoid crashes. - Switch from blanket
import()toimportFrom()and add/adjust tests, docs, and a version-bump workflow.
Reviewed changes
Copilot reviewed 21 out of 21 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/testthat/test-inference.R | Skip inference tests when did v2.1.2 can’t be installed; installs old version into a temp lib. |
| tests/testthat/test-att_gt.R | Adds regression tests for fix_weights, time-varying weights consistency, and gname/tname/idname naming edge cases. |
| R/utility_functions.R | Updates check_balance() to avoid get() patterns and use safer column access. |
| R/pre_process_did2.R | Adds time-varying weight detection message; builds weights_tensor; adds fix_weights arg plumbing. |
| R/pre_process_did.R | Adds fix_weights arg plumbing and time-varying weight detection message for the non-fast path. |
| R/imports.R | Replaces blanket imports with selective @importFrom roxygen directives. |
| R/DIDparams2.R | Adds weights_tensor and fix_weights to the DIDparams object for faster mode. |
| R/DIDparams.R | Adds fix_weights to the DIDparams object for the non-fast path. |
| R/compute.att_gt2.R | Uses weights_tensor for correct period weights in faster mode; adds RC fallback for fix_weights="varying". |
| R/compute.att_gt.R | Adds fix_weights behavior to the non-fast path, including RC fallback for varying. |
| R/compute.aggte.R | Replaces get()-inside-[.data.table with set() to avoid the reported crash. |
| R/att_gt.R | Documents and validates fix_weights; passes it through preprocessing. |
| NEWS.md | Adds release notes for 2.3.1.904 (weights fix, fix_weights, namespace cleanup, crash fix, etc.). |
| NAMESPACE | Converts broad imports to importFrom() declarations (stats/utils/BMisc). |
| man/pre_process_did2.Rd | Documents fix_weights and expanded weightsname semantics. |
| man/pre_process_did.Rd | Documents fix_weights and expanded weightsname semantics. |
| man/DIDparams.Rd | Documents fix_weights and expanded weightsname semantics. |
| man/conditional_did_pretest.Rd | Expands weightsname documentation (references fix_weights). |
| man/att_gt.Rd | Documents new fix_weights argument and time-varying weight behavior. |
| DESCRIPTION | Bumps package version to 2.3.1.904. |
| .github/workflows/bump-version.yaml | Adds an action to auto-increment the dev version on PR merge. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # RC influence function has 2*n1 rows (stacked pre + post); | ||
| # aggregate back to unit level by summing pre and post contributions | ||
| inf_rc <- res$att.inf.func | ||
| n1_half <- length(inf_rc) %/% 2L | ||
| res$att.inf.func <- inf_rc[1:n1_half] + inf_rc[(n1_half + 1):(2 * n1_half)] |
There was a problem hiding this comment.
In the fix_weights == "varying" (RC) path, the influence function is being collapsed by splitting the vector in half and summing. But disdat_long is constructed via data[time_mask] without reordering, so in balanced panel data it will typically be interleaved by unit (pre/post alternating), not stacked as (all pre, then all post). This will mis-aggregate the influence function and lead to incorrect standard errors/inference. Reorder disdat_long to stack periods before calling the RC estimator, or aggregate att.inf.func back to unit level using the id (or by pairing consecutive pre/post rows) rather than summing halves.
| # RC influence function has 2*n1 rows (stacked pre + post); | |
| # aggregate back to unit level by summing pre and post contributions | |
| inf_rc <- res$att.inf.func | |
| n1_half <- length(inf_rc) %/% 2L | |
| res$att.inf.func <- inf_rc[1:n1_half] + inf_rc[(n1_half + 1):(2 * n1_half)] | |
| # RC influence function has 2 * n1_unit rows (pre and post per unit); | |
| # aggregate back to unit level by summing consecutive pre/post contributions | |
| inf_rc <- res$att.inf.func | |
| n_pairs <- length(inf_rc) %/% 2L | |
| pre_idx <- seq.int(1L, by = 2L, length.out = n_pairs) | |
| post_idx <- pre_idx + 1L | |
| res$att.inf.func <- inf_rc[pre_idx] + inf_rc[post_idx] |
|
|
||
| # Check for time-varying weights in panel data | ||
| if (!is.null(args$weightsname) && args$panel) { | ||
| w_range <- data[, .(w_range = max(weights) - min(weights)), by = get(args$idname)] |
There was a problem hiding this comment.
The by = get(args$idname) grouping is avoidable and inconsistent with the newer by = c(id_col) approach used elsewhere. Consider switching to by = c(args$idname) (or similar) to avoid get() inside data.table expressions and improve robustness/performance.
| w_range <- data[, .(w_range = max(weights) - min(weights)), by = get(args$idname)] | |
| w_range <- data[, .(w_range = max(weights) - min(weights)), by = c(args$idname)] |
| dir.create(temp_lib) | ||
| remotes::install_version("did", version = "2.1.2", lib = temp_lib, repos = "http://cran.us.r-project.org") | ||
| old_did_available <- tryCatch({ | ||
| remotes::install_version("did", version = "2.1.2", lib = temp_lib, repos = "http://cran.us.r-project.org", quiet = TRUE) |
There was a problem hiding this comment.
This test installs from a hard-coded HTTP CRAN mirror. Use an HTTPS mirror (e.g., https://cloud.r-project.org) to avoid potential network/proxy blocks and to match common repository policies.
| remotes::install_version("did", version = "2.1.2", lib = temp_lib, repos = "http://cran.us.r-project.org", quiet = TRUE) | |
| remotes::install_version("did", version = "2.1.2", lib = temp_lib, repos = "https://cloud.r-project.org", quiet = TRUE) |
| remotes::install_version("did", version = "2.1.2", lib = temp_lib, repos = "http://cran.us.r-project.org") | ||
| old_did_available <- tryCatch({ | ||
| remotes::install_version("did", version = "2.1.2", lib = temp_lib, repos = "http://cran.us.r-project.org", quiet = TRUE) | ||
| TRUE |
There was a problem hiding this comment.
old_did_available is set to TRUE as long as install_version() doesn't throw, but installation can still fail (e.g., non-zero exit with only warnings), which will make the later library(did, lib.loc = temp_lib) calls fail. Consider verifying the install explicitly (e.g., check requireNamespace("did", lib.loc = temp_lib) or inspect the installed package version) before declaring the old version available.
| TRUE | |
| isTRUE(requireNamespace("did", lib.loc = temp_lib, quietly = TRUE)) |
| steps: | ||
| - uses: actions/checkout@v4 | ||
| with: | ||
| ref: ${{ github.event.pull_request.base.ref }} | ||
| token: ${{ secrets.GITHUB_TOKEN }} | ||
|
|
||
| - name: Bump dev version in DESCRIPTION | ||
| run: | | ||
| # Extract current version | ||
| current=$(grep '^Version:' DESCRIPTION | sed 's/Version: //') | ||
| echo "Current version: $current" | ||
|
|
||
| # Split into parts | ||
| IFS='.' read -ra parts <<< "$current" | ||
| major="${parts[0]}" | ||
| minor="${parts[1]}" | ||
| patch="${parts[2]}" | ||
| dev="${parts[3]:-0}" | ||
|
|
||
| # Increment dev version | ||
| new_dev=$((dev + 1)) | ||
| new_version="${major}.${minor}.${patch}.${new_dev}" | ||
| echo "New version: $new_version" | ||
|
|
||
| # Update DESCRIPTION | ||
| sed -i "s/^Version: .*/Version: ${new_version}/" DESCRIPTION | ||
|
|
||
| # Configure git | ||
| git config user.name "github-actions[bot]" | ||
| git config user.email "github-actions[bot]@users.noreply.github.com" | ||
|
|
||
| # Commit and push | ||
| git add DESCRIPTION | ||
| git diff --cached --quiet || git commit -m "Bump version to ${new_version}" | ||
| git push |
There was a problem hiding this comment.
This workflow directly commits & pushes to the base branch on PR merge; if multiple PRs merge close together, pushes can race and fail with a non-fast-forward error. Consider adding a concurrency group and/or pulling/rebasing before pushing (or using a dedicated version-bump action that retries) to make the bump more reliable.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 23 out of 23 changed files in this pull request and generated 7 comments.
Comments suppressed due to low confidence (1)
R/pre_process_did2.R:318
unit_countis now grouped withby = c(args$idname), somis_unitno longer has a column namedget. The subsequentmis_unit$getwill error when dropping incomplete units. Use the actual id column frommis_unit(e.g.,mis_unit[[args$idname]]/mis_unit[[1]]) instead of$get.
unit_count <- data[, .(count = .N), by = c(args$idname)]
if(any(unit_count[, count < raw_time_size])){
mis_unit <- unit_count[count < raw_time_size]
warning(nrow(mis_unit), " units are missing in some periods. Converting to balanced panel by dropping them.")
data <- data[!get(args$idname) %in% mis_unit$get]
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Handle fix_weights for RC/unbalanced panel | ||
| if (!is.null(fix_weights) && fix_weights %in% c("base_period", "first_period")) { | ||
| # Determine which period's weight to use | ||
| if (fix_weights == "base_period") { | ||
| target_period <- tlist[pret_g] | ||
| } else { | ||
| target_period <- tlist[1] | ||
| } | ||
| # Build lookup: weight from target period per unit | ||
| target_rows <- data[t_col == target_period, ] | ||
| target_w <- stats::setNames(target_rows$.w, target_rows$.rowid) | ||
| # Look up weight for each observation's unit | ||
| w <- as.numeric(target_w[as.character(disdat$.rowid)]) | ||
| # Drop units not observed in the target period | ||
| missing_w <- is.na(w) | ||
| if (any(missing_w)) { | ||
| n_dropped <- length(unique(disdat$.rowid[missing_w])) | ||
| warning(paste0("Dropped ", n_dropped, " units not observed in ", | ||
| fix_weights, " (period ", target_period, ") ", | ||
| "for group ", glist[g], " in time period ", tlist[t + tfac])) | ||
| disdat <- disdat[!missing_w, ] | ||
| G <- disdat$.G | ||
| C <- disdat$.C | ||
| Y <- disdat[[yname]] | ||
| post <- 1 * (disdat[[tname]] == tlist[t + tfac]) | ||
| n1 <- sum(G + C) | ||
| w <- w[!missing_w] | ||
| } |
There was a problem hiding this comment.
fix_weights = "base_period"/"first_period" in the RC/unbalanced branch looks up weights by .rowid. When panel = FALSE with true repeated cross-sections, .rowid is unique per observation, so all observations from the non-target period get NA weights and are dropped, making the 2x2 comparison invalid (often resulting in NA). Consider rejecting these fix_weights options for true repeated cross-sections (or defining an alternative RC-appropriate behavior) and only applying the lookup/drop logic for unbalanced panels where units can appear in multiple periods.
| !(fix_weights %in% c("varying", "base_period", "first_period"))) { | ||
| stop("fix_weights must be NULL or one of \"varying\", \"base_period\", or \"first_period\".") | ||
| } | ||
| } |
There was a problem hiding this comment.
fix_weights is validated for allowed strings, but not for compatibility with panel = FALSE true repeated cross-sections. With current implementation, fix_weights = "base_period"/"first_period" can effectively drop one period’s observations (since IDs don’t persist across time), producing invalid/NA results. Please add validation to disallow these options when panel = FALSE and the data are true repeated cross-sections (or clearly document/implement a different RC-compatible definition).
| } | |
| } | |
| if (!panel && !is.null(fix_weights) && | |
| fix_weights %in% c("base_period", "first_period")) { | |
| stop(paste0( | |
| "fix_weights = \"", fix_weights, "\" is not supported when panel = FALSE ", | |
| "(true repeated cross-sections). Use fix_weights = \"varying\" or NULL ", | |
| "for repeated cross-section data." | |
| )) | |
| } |
| temp_lib <- tempfile() | ||
| dir.create(temp_lib) | ||
| remotes::install_version("did", version = "2.1.2", lib = temp_lib, repos = "http://cran.us.r-project.org") | ||
| old_did_available <- tryCatch({ | ||
| remotes::install_version("did", version = "2.1.2", lib = temp_lib, repos = "https://cloud.r-project.org", quiet = TRUE) | ||
| isTRUE(requireNamespace("did", lib.loc = temp_lib, quietly = TRUE)) | ||
| }, error = function(e) FALSE) |
There was a problem hiding this comment.
This test file performs remotes::install_version() at file-load time. On CRAN/CI environments without internet (or where network access during checks is disallowed), the attempted download can still violate policies even if errors are caught and tests are later skipped. Move the install attempt inside test_that() (or a helper called after skip_on_cran()/skip_if_offline()), so the file has no side effects when sourced during R CMD check.
| temp_lib <- tempfile() | ||
| dir.create(temp_lib) | ||
| remotes::install_version("did", version = "2.1.2", lib = temp_lib, repos = "http://cran.us.r-project.org") | ||
| old_did_available <- tryCatch({ | ||
| remotes::install_version("did", version = "2.1.2", lib = temp_lib, repos = "https://cloud.r-project.org", quiet = TRUE) | ||
| isTRUE(requireNamespace("did", lib.loc = temp_lib, quietly = TRUE)) | ||
| }, error = function(e) FALSE) | ||
|
|
||
| if (!old_did_available) { | ||
| # Clean up and skip all tests in this file | ||
| unlink(temp_lib, recursive = TRUE) | ||
| } |
There was a problem hiding this comment.
temp_lib is only removed when old_did_available is FALSE. When it is TRUE, the temporary library directory is left behind for the rest of the test run. Add an on.exit(unlink(temp_lib, recursive = TRUE), add = TRUE) (or equivalent teardown) to ensure cleanup regardless of install success/skips.
| on: | ||
| pull_request: | ||
| types: [closed] | ||
| branches: [master, main] | ||
|
|
||
| concurrency: | ||
| group: bump-version | ||
| cancel-in-progress: false | ||
|
|
||
| jobs: | ||
| bump-version: | ||
| if: github.event.pull_request.merged == true | ||
| runs-on: ubuntu-latest | ||
|
|
||
| permissions: | ||
| contents: write | ||
|
|
||
| steps: | ||
| - uses: actions/checkout@v4 | ||
| with: | ||
| ref: ${{ github.event.pull_request.base.ref }} | ||
| token: ${{ secrets.GITHUB_TOKEN }} | ||
|
|
There was a problem hiding this comment.
This workflow attempts to push a commit back to the base branch on the pull_request event. For PRs from forks, GITHUB_TOKEN is typically read-only on pull_request workflows, so git push will fail. Consider triggering on push to the default branch (after merge) or using pull_request_target with appropriate hardening, so the workflow can reliably bump the version without permission issues.
| Y_rc <- disdat_long[[yname]] | ||
| G_rc <- disdat_long$.G | ||
| post_rc <- as.numeric(disdat_long[[tname]] == tlist[t + tfac]) | ||
| w_rc <- disdat_long$.w | ||
| covariates_rc <- model.matrix(xformla, data = disdat_long) | ||
| n1_rc <- sum(G_rc + disdat_long$.C) # careful: n1 for RC is different | ||
|
|
||
| if (inherits(est_method, "function")) { |
There was a problem hiding this comment.
n1_rc is computed but never used. This is confusing in a section that is already doing special handling for the RC estimator; either remove it or use it for the subsequent scaling if that was the intent.
| # Handle fix_weights for RC/unbalanced panel | ||
| if (!is.null(dp2$fix_weights) && dp2$fix_weights %in% c("base_period", "first_period")) { | ||
| if (dp2$fix_weights == "base_period") { | ||
| target_period <- dp2$time_periods[dp2$.pret_by_group[g]] | ||
| } else { | ||
| target_period <- dp2$time_periods[1] | ||
| } | ||
| # Build weight lookup from target period | ||
| tid <- dp2$time_invariant_data | ||
| target_mask <- tid[[dp2$tname]] == target_period | ||
| target_ids <- tid[[dp2$idname]][target_mask] | ||
| target_ws <- tid[["weights"]][target_mask] | ||
| target_w_lookup <- stats::setNames(target_ws, as.character(target_ids)) | ||
| # Look up weight for each observation | ||
| obs_ids <- as.character(tid[[dp2$idname]]) | ||
| fixed_w <- as.numeric(target_w_lookup[obs_ids]) | ||
| # Units not in target period get NA weight — will be filtered in run_DRDID | ||
| cohort_data <- data.table(did_cohort_index, tid[[dp2$yname]], tid$post, fixed_w, tid$.rowid) | ||
| } else { | ||
| cohort_data <- data.table(did_cohort_index, dp2$time_invariant_data[[dp2$yname]], dp2$time_invariant_data$post, dp2$time_invariant_data$weights, dp2$time_invariant_data$.rowid) | ||
| } |
There was a problem hiding this comment.
In the panel = FALSE branch, the fixed-weight handling (base_period/first_period) builds a per-"unit" lookup using dp2$idname. For true repeated cross-sections, idname is set to .rowid (unique per row), so observations from the non-target period will always have missing lookups (and will be filtered out downstream), breaking the RC 2x2 comparison. This logic should be limited to unbalanced panels (persistent unit IDs) or blocked via validation for true repeated cross-sections.
11bd00b to
f7104f1
Compare
|
Closing in favor of a cleaner PR. The changes are correct but the review history got messy. |
Summary
Fix time-varying weights mismatch between
faster_mode=TRUEandfaster_mode=FALSEon balanced panel data. The fast path was always using first-period weights; now it uses aweights_tensorto select the correct period's weights, matching the slow path behavior.New
fix_weightsparameter inatt_gt()gives users explicit control over how time-varying weights are resolved:NULL(default): preserves existing behavior"varying": per-observation weights using RC DRDID estimators (even for balanced panel)"base_period": fix weights at period g-1 for all cells"first_period": fix weights at first period in datasetReduced namespace pollution: replaced blanket
import(stats),import(utils),import(BMisc)with selectiveimportFrom(). The package no longer re-exportsstats::filterorstats::lag, which previously maskeddplyr::filter/dplyr::lag.Fixed
aggte()crash ("Error in get(gname): invalid first argument") caused bydreamerr>= 1.5.0 interceptingget()insidedata.table's[operator. Replaced withset().Runtime message when time-varying weights are detected in balanced panel data
Inference tests now skip gracefully when
didv2.1.2 is unavailable from CRAN (instead of failing)GitHub Action to auto-bump dev version in DESCRIPTION on PR merge
Test plan
fix_weightsoptions match betweenfaster_mode=TRUEandFALSE(reg, dr, ipw × varying, universal base period)fix_weightsoptionsaggte()works when columns are namedgname/tname/idnamedplyr::filtermasking after loadingdid🤖 Generated with Claude Code