replace estimate_extraction.R with faster version by jimstigler · Pull Request #8 · coursekata/coursekata-r

jimstigler · 2025-12-14T18:48:06Z

Performance: Add fast paths for f(), pre(), and p() extraction functions

Problem

The f(), pre(), and p() functions are extremely slow when used in bootstrapping loops because they always build a full supernova() ANOVA table, even when the user only needs the overall model statistic.

Benchmark

library(coursekata)

# Current implementation: ~15 seconds
system.time({
  sdof <- do(1000) * f(shuffle(Thumb) ~ Gender, data = Fingers)
})
#>    user  system elapsed 
#>  14.892   0.156  15.089

# Direct computation: ~3 seconds
system.time({
  sdof <- do(1000) * {
    fit <- lm(shuffle(Thumb) ~ Gender, data = Fingers)
    summary(fit)$fstatistic[1]
  }
})
#>    user  system elapsed 
#>   2.847   0.012   2.867

This represents a ~5x slowdown that significantly impacts the classroom experience when students are running bootstrap simulations.

Root Cause

The extraction functions always call extract_stat(), which internally builds a complete supernova() table:

extract_stat <- function(fit, type, stat, predictor = character(0)) {
  sup_out <- supernova(fit, type)  # Always builds full ANOVA table
  # ...
}

Building the supernova table involves:

Fitting the null model
Computing SS decomposition
For Type III: single-term deletions via drop1()
Calculating PRE, MS, F, and p for ALL terms

This is massive overkill when the user only needs the overall model F statistic.

Solution

Add fast paths that bypass supernova() for the most common use case (extracting overall model statistics with default arguments):

f <- function(object, data = NULL, all = FALSE, predictor = character(), type = 3) {
  # ... validation code ...
  
  # Fast path: overall F statistic without building full supernova table
  if (!all && is_empty(predictor)) {
    fstat <- summary(fit)$fstatistic
    if (is.null(fstat)) return(NA_real_)
    return(unname(fstat[1]))
  }

  # Full path: need supernova table for predictor-specific or all statistics
  stats <- extract_stat(fit, type, "F", predictor)
  if (all || !is_empty(predictor)) stats else stats[[1]]
}

Similar optimizations for:

pre() → uses summary(fit)$r.squared
p() → computes directly via pf(F, df1, df2, lower.tail = FALSE)

Changes

Modified: R/estimate_extraction.R

Added pf to @importFrom stats
f(): Fast path using summary(fit)$fstatistic[1]
pre(): Fast path using summary(fit)$r.squared
p(): Fast path computing p-value directly from F statistic

Backward Compatibility

✅ Fully backward compatible. The fast path only activates when:

all = FALSE (the default)
predictor is not specified (the default)

All other parameter combinations continue to use the full supernova() path.

Tests

test_that("f() fast path returns same value as full supernova", {
  fit <- lm(Thumb ~ Height, data = Fingers)
  fast_result <- summary(fit)$fstatistic[1]
  slow_result <- supernova(fit)$tbl$F[1]
  expect_equal(unname(fast_result), slow_result)
})

test_that("pre() fast path returns same value as full supernova", {
  fit <- lm(Thumb ~ Height, data = Fingers)
  fast_result <- summary(fit)$r.squared
  slow_result <- supernova(fit)$tbl$PRE[1]
  expect_equal(fast_result, slow_result)
})

test_that("p() fast path returns same value as full supernova", {
  fit <- lm(Thumb ~ Height, data = Fingers)
  fstat <- summary(fit)$fstatistic
  fast_result <- pf(fstat[1], fstat[2], fstat[3], lower.tail = FALSE)
  slow_result <- supernova(fit)$tbl$p[1]
  expect_equal(unname(fast_result), slow_result)
})

test_that("f() is fast enough for bootstrapping", {
  skip_on_cran()
  time <- system.time({
    sdof <- do(1000) * f(shuffle(Thumb) ~ Gender, data = Fingers)
  })
  expect_lt(time["elapsed"], 5)  # Should complete in under 5 seconds
})

test_that("f() with predictor argument still uses supernova path", {
  fit <- lm(Thumb ~ Height + RaceEthnic, data = Fingers)
  result <- f(fit, predictor = "Height")
  expect_true(is.numeric(result))
})

test_that("f() with all=TRUE still uses supernova path", {
  fit <- lm(Thumb ~ Height, data = Fingers)
  result <- f(fit, all = TRUE)
  expect_true(is.list(result))
  expect_true("f" %in% names(result))
})

Impact

~5x performance improvement for the most common bootstrapping use case
Makes do(1000) * f(...) practical for classroom use
No changes to user-facing API or behavior
No breaking changes

I am trying to speed up the f and pre functions. Claude helped me speed it up by 5x. I'll make a pull request and include a fuller writeup.

replace estimate_extraction.R with faster version

a38674e

I am trying to speed up the f and pre functions. Claude helped me speed it up by 5x. I'll make a pull request and include a fuller writeup.

jimstigler assigned adamblake Dec 14, 2025

jimstigler requested a review from adamblake December 14, 2025 18:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

replace estimate_extraction.R with faster version#8

replace estimate_extraction.R with faster version#8
jimstigler wants to merge 1 commit intomainfrom
jim/faster-extraction

jimstigler commented Dec 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

jimstigler commented Dec 14, 2025

Performance: Add fast paths for f(), pre(), and p() extraction functions

Problem

Benchmark

Root Cause

Solution

Changes

Backward Compatibility

Tests

Impact

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants