Skip to content

Improve create_age_groups #135

@Nic-Chr

Description

@Nic-Chr

Hi, I think we can improve the speed of create_age_groups quite a bit and also remove the dependency on 'utils' package.

If we avoid cut() which is inefficient in creating factors as it goes through unnecessary unique() + match() steps internally.
We already have our cleaned age breaks which are unique and sorted, meaning we can avoid using cut() and directly use .bincode().
.bincode() is basically a low-level factor constructor and also what cut() uses as well.
To get a character vector, all that's needed is to subset our age breaks onto our bin codes.

On the topic of cut() inefficiency, there is a stack thread I opened a while ago: https://stackoverflow.com/questions/76867914/can-cut-be-improved

Proposed function and benchmark:

create_age_groups <- function(x, from = 0, to = 90, by = 5, as_factor = FALSE){
  
  if (!is.numeric(x)) {
    cli::cli_abort("{.arg x} must be a {.cls numeric} vector, not a {.cls {class(x)}} vector.")
  }
  
  breaks <- seq(from, to, by)
  breaks <- sort(unique(breaks))
  n_breaks <- length(breaks)
  n <- max(n_breaks - 1L, 0L)
  
  bands <- paste0(
    breaks[seq_len(n)], "-", 
    breaks[seq.int(to = n_breaks, length.out = n)] - 1L
  )
  
  rightmost_band <- paste0(breaks[n_breaks], "+")
  bands[n_breaks] <- rightmost_band
  
  codes <- .bincode(x, breaks = c(breaks, Inf), right = FALSE)
  
  if (as_factor) {
    out <- codes
    levels(out) <- bands
    class(out) <- c("ordered", "factor")
  }
  else {
    out <- bands[codes]
  }
  out
}
library(bench)

x <- 20
create_age_groups(x)
#> [1] "20-24"
phsmethods::create_age_groups(x)
#> [1] "20-24"

mark(create_age_groups(x), 
     phsmethods::create_age_groups(x))
#> # A tibble: 2 × 6
#>   expression                            min  median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                       <bch:tm> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 create_age_groups(x)               98.6µs   103µs     9202.    2.41KB    10.4 
#> 2 phsmethods::create_age_groups(x)  155.5µs   168µs     5697.    3.84KB     8.30
mark(create_age_groups(x, as_factor = TRUE), 
     phsmethods::create_age_groups(x, as_factor = TRUE))
#> # A tibble: 2 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                           <bch> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 create_age_groups(x, as_factor = TR… 101µs  106µs     8998.    2.71KB    12.7 
#> 2 phsmethods::create_age_groups(x, as… 153µs  162µs     5868.    3.84KB     8.30

x <- sample(0:100, 10^7, T)
mark(create_age_groups(x), 
     phsmethods::create_age_groups(x))
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#>   expression                            min  median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                       <bch:tm> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 create_age_groups(x)                414ms   416ms      2.40     191MB     2.40
#> 2 phsmethods::create_age_groups(x)    701ms   701ms      1.43     420MB     4.28
mark(create_age_groups(x, as_factor = TRUE), 
     phsmethods::create_age_groups(x, as_factor = TRUE))
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                           <bch> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 create_age_groups(x, as_factor = TR… 361ms  361ms      2.77     153MB     2.77
#> 2 phsmethods::create_age_groups(x, as… 654ms  654ms      1.53     343MB     1.53

Created on 2024-09-19 with reprex v2.0.2

This obviously relates to issues #93 and #54, which I think are also worthwhile but as subsequent step.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions