Skip to content

Comments

replace estimate_extraction.R with faster version#8

Open
jimstigler wants to merge 1 commit intomainfrom
jim/faster-extraction
Open

replace estimate_extraction.R with faster version#8
jimstigler wants to merge 1 commit intomainfrom
jim/faster-extraction

Conversation

@jimstigler
Copy link

Performance: Add fast paths for f(), pre(), and p() extraction functions

Problem

The f(), pre(), and p() functions are extremely slow when used in bootstrapping loops because they always build a full supernova() ANOVA table, even when the user only needs the overall model statistic.

Benchmark

library(coursekata)

# Current implementation: ~15 seconds
system.time({
  sdof <- do(1000) * f(shuffle(Thumb) ~ Gender, data = Fingers)
})
#>    user  system elapsed 
#>  14.892   0.156  15.089

# Direct computation: ~3 seconds
system.time({
  sdof <- do(1000) * {
    fit <- lm(shuffle(Thumb) ~ Gender, data = Fingers)
    summary(fit)$fstatistic[1]
  }
})
#>    user  system elapsed 
#>   2.847   0.012   2.867

This represents a ~5x slowdown that significantly impacts the classroom experience when students are running bootstrap simulations.

Root Cause

The extraction functions always call extract_stat(), which internally builds a complete supernova() table:

extract_stat <- function(fit, type, stat, predictor = character(0)) {
  sup_out <- supernova(fit, type)  # Always builds full ANOVA table
  # ...
}

Building the supernova table involves:

  • Fitting the null model
  • Computing SS decomposition
  • For Type III: single-term deletions via drop1()
  • Calculating PRE, MS, F, and p for ALL terms

This is massive overkill when the user only needs the overall model F statistic.

Solution

Add fast paths that bypass supernova() for the most common use case (extracting overall model statistics with default arguments):

f <- function(object, data = NULL, all = FALSE, predictor = character(), type = 3) {
  # ... validation code ...
  
  # Fast path: overall F statistic without building full supernova table
  if (!all && is_empty(predictor)) {
    fstat <- summary(fit)$fstatistic
    if (is.null(fstat)) return(NA_real_)
    return(unname(fstat[1]))
  }

  # Full path: need supernova table for predictor-specific or all statistics
  stats <- extract_stat(fit, type, "F", predictor)
  if (all || !is_empty(predictor)) stats else stats[[1]]
}

Similar optimizations for:

  • pre() → uses summary(fit)$r.squared
  • p() → computes directly via pf(F, df1, df2, lower.tail = FALSE)

Changes

Modified: R/estimate_extraction.R

  • Added pf to @importFrom stats
  • f(): Fast path using summary(fit)$fstatistic[1]
  • pre(): Fast path using summary(fit)$r.squared
  • p(): Fast path computing p-value directly from F statistic

Backward Compatibility

Fully backward compatible. The fast path only activates when:

  • all = FALSE (the default)
  • predictor is not specified (the default)

All other parameter combinations continue to use the full supernova() path.

Tests

test_that("f() fast path returns same value as full supernova", {
  fit <- lm(Thumb ~ Height, data = Fingers)
  fast_result <- summary(fit)$fstatistic[1]
  slow_result <- supernova(fit)$tbl$F[1]
  expect_equal(unname(fast_result), slow_result)
})

test_that("pre() fast path returns same value as full supernova", {
  fit <- lm(Thumb ~ Height, data = Fingers)
  fast_result <- summary(fit)$r.squared
  slow_result <- supernova(fit)$tbl$PRE[1]
  expect_equal(fast_result, slow_result)
})

test_that("p() fast path returns same value as full supernova", {
  fit <- lm(Thumb ~ Height, data = Fingers)
  fstat <- summary(fit)$fstatistic
  fast_result <- pf(fstat[1], fstat[2], fstat[3], lower.tail = FALSE)
  slow_result <- supernova(fit)$tbl$p[1]
  expect_equal(unname(fast_result), slow_result)
})

test_that("f() is fast enough for bootstrapping", {
  skip_on_cran()
  time <- system.time({
    sdof <- do(1000) * f(shuffle(Thumb) ~ Gender, data = Fingers)
  })
  expect_lt(time["elapsed"], 5)  # Should complete in under 5 seconds
})

test_that("f() with predictor argument still uses supernova path", {
  fit <- lm(Thumb ~ Height + RaceEthnic, data = Fingers)
  result <- f(fit, predictor = "Height")
  expect_true(is.numeric(result))
})

test_that("f() with all=TRUE still uses supernova path", {
  fit <- lm(Thumb ~ Height, data = Fingers)
  result <- f(fit, all = TRUE)
  expect_true(is.list(result))
  expect_true("f" %in% names(result))
})

Impact

  • ~5x performance improvement for the most common bootstrapping use case
  • Makes do(1000) * f(...) practical for classroom use
  • No changes to user-facing API or behavior
  • No breaking changes

I am trying to speed up the f and pre functions. Claude helped me speed it up by 5x. I'll make a pull request and include a fuller writeup.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants