Hi, I think we can improve the speed of create_age_groups quite a bit and also remove the dependency on 'utils' package.
If we avoid cut() which is inefficient in creating factors as it goes through unnecessary unique() + match() steps internally.
We already have our cleaned age breaks which are unique and sorted, meaning we can avoid using cut() and directly use .bincode().
.bincode() is basically a low-level factor constructor and also what cut() uses as well.
To get a character vector, all that's needed is to subset our age breaks onto our bin codes.
On the topic of cut() inefficiency, there is a stack thread I opened a while ago: https://stackoverflow.com/questions/76867914/can-cut-be-improved
Proposed function and benchmark:
create_age_groups <- function(x, from = 0, to = 90, by = 5, as_factor = FALSE){
if (!is.numeric(x)) {
cli::cli_abort("{.arg x} must be a {.cls numeric} vector, not a {.cls {class(x)}} vector.")
}
breaks <- seq(from, to, by)
breaks <- sort(unique(breaks))
n_breaks <- length(breaks)
n <- max(n_breaks - 1L, 0L)
bands <- paste0(
breaks[seq_len(n)], "-",
breaks[seq.int(to = n_breaks, length.out = n)] - 1L
)
rightmost_band <- paste0(breaks[n_breaks], "+")
bands[n_breaks] <- rightmost_band
codes <- .bincode(x, breaks = c(breaks, Inf), right = FALSE)
if (as_factor) {
out <- codes
levels(out) <- bands
class(out) <- c("ordered", "factor")
}
else {
out <- bands[codes]
}
out
}
library(bench)
x <- 20
create_age_groups(x)
#> [1] "20-24"
phsmethods::create_age_groups(x)
#> [1] "20-24"
mark(create_age_groups(x),
phsmethods::create_age_groups(x))
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:t> <dbl> <bch:byt> <dbl>
#> 1 create_age_groups(x) 98.6µs 103µs 9202. 2.41KB 10.4
#> 2 phsmethods::create_age_groups(x) 155.5µs 168µs 5697. 3.84KB 8.30
mark(create_age_groups(x, as_factor = TRUE),
phsmethods::create_age_groups(x, as_factor = TRUE))
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch> <bch:> <dbl> <bch:byt> <dbl>
#> 1 create_age_groups(x, as_factor = TR… 101µs 106µs 8998. 2.71KB 12.7
#> 2 phsmethods::create_age_groups(x, as… 153µs 162µs 5868. 3.84KB 8.30
x <- sample(0:100, 10^7, T)
mark(create_age_groups(x),
phsmethods::create_age_groups(x))
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:t> <dbl> <bch:byt> <dbl>
#> 1 create_age_groups(x) 414ms 416ms 2.40 191MB 2.40
#> 2 phsmethods::create_age_groups(x) 701ms 701ms 1.43 420MB 4.28
mark(create_age_groups(x, as_factor = TRUE),
phsmethods::create_age_groups(x, as_factor = TRUE))
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch> <bch:> <dbl> <bch:byt> <dbl>
#> 1 create_age_groups(x, as_factor = TR… 361ms 361ms 2.77 153MB 2.77
#> 2 phsmethods::create_age_groups(x, as… 654ms 654ms 1.53 343MB 1.53
Created on 2024-09-19 with reprex v2.0.2
This obviously relates to issues #93 and #54, which I think are also worthwhile but as subsequent step.
Hi, I think we can improve the speed of
create_age_groupsquite a bit and also remove the dependency on 'utils' package.If we avoid
cut()which is inefficient in creating factors as it goes through unnecessaryunique()+match()steps internally.We already have our cleaned age breaks which are unique and sorted, meaning we can avoid using
cut()and directly use.bincode()..bincode()is basically a low-level factor constructor and also whatcut()uses as well.To get a character vector, all that's needed is to subset our age breaks onto our bin codes.
On the topic of
cut()inefficiency, there is a stack thread I opened a while ago: https://stackoverflow.com/questions/76867914/can-cut-be-improvedProposed function and benchmark:
Created on 2024-09-19 with reprex v2.0.2
This obviously relates to issues #93 and #54, which I think are also worthwhile but as subsequent step.