Skip to content

Commit 647b6db

Browse files
authored
allow duckdb to choose memory limit, drop memuse dependency (#167)
* drop memuse dependency, allow duckdb to detect avaialble ram by default
1 parent 1116635 commit 647b6db

File tree

12 files changed

+196
-124
lines changed

12 files changed

+196
-124
lines changed

DESCRIPTION

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,6 @@ Imports:
4141
lifecycle,
4242
lubridate,
4343
memoise,
44-
memuse,
4544
openssl,
4645
parallelly,
4746
paws.storage (>= 0.4.0),

NEWS.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,8 @@
1414

1515
* Metadata fetched by `spod_available_data()` has extra columns such as data `type`, `zones` and `period`, see help `?spod_available_data()` for details.
1616

17+
* Memory allocation is now delegated to `DuckDB` engine, which by default uses 80% of available RAM. Beware that in some HPC environments this may detect more memory than is actually available to your job, so set the limit manually to 80% of RAM available to your job with `max_mem_gb` argument of `spod_get()`, `spod_convert()`, `spod_connect()` functions. This will also improve performance in some cases, as DuckDB is more efficient than R at memory allocation (PR [#167](https://github.com/rOpenSpain/spanishoddata/pull/167)).
18+
1719
## Bug fixes
1820

1921
* More reliable, but still multi-threaded data file downloads using base R `utils::download.file()` instead of `curl::multi_download()` which failed on some connections (issue [#127](https://github.com/rOpenSpain/spanishoddata/issues/127)), so now `curl` dependency is no longer required. PR [#165](https://github.com/rOpenSpain/spanishoddata/pull/165).

R/connect.R

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -39,15 +39,15 @@ spod_connect <- function(
3939
data_path,
4040
target_table_name = NULL,
4141
quiet = FALSE,
42-
max_mem_gb = max(4, spod_available_ram() - 4),
42+
max_mem_gb = NULL,
4343
max_n_cpu = max(1, parallelly::availableCores() - 1),
4444
temp_path = spod_get_temp_dir()
4545
) {
4646
# Validate imputs
4747
checkmate::assert_access(data_path, access = 'r')
4848
checkmate::assert_character(target_table_name, null.ok = TRUE)
4949
checkmate::assert_flag(quiet)
50-
checkmate::assert_number(max_mem_gb, lower = 1)
50+
checkmate::assert_number(max_mem_gb, lower = 1, null.ok = TRUE)
5151
checkmate::assert_integerish(max_n_cpu, lower = 1)
5252
checkmate::assert_directory_exists(temp_path, access = "rw")
5353

R/convert.R

Lines changed: 173 additions & 77 deletions
Large diffs are not rendered by default.

R/duckdb-helpers.R

Lines changed: 9 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -628,21 +628,23 @@ spod_sql_where_dates <- function(dates) {
628628

629629
#' Set maximum memory and number of threads for a `DuckDB` connection
630630
#' @param con A `duckdb` connection
631-
#' @param max_mem_gb The maximum memory to use in GB. A conservative default is 3 GB, which should be enough for resaving the data to `DuckDB` form a folder of CSV.gz files while being small enough to fit in memory of most even old computers. For data analysis using the already converted data (in `DuckDB` or Parquet format) or with the raw CSV.gz data, it is recommended to increase it according to available resources.
631+
#' @param max_mem_gb `integer` value of the maximum operating memory to use in GB. `NULL` by default, delegates the choice to the `DuckDB` engine which usually sets it to 80% of available memory. Caution, in HPC use, the amount of memory available to your job may be determined incorrectly by the `DuckDB` engine, so it is recommended to set this parameter explicitly according to your job's memory limits.
632632
#' @param max_n_cpu The maximum number of threads to use. Defaults to the number of available cores minus 1.
633633
#' @return A `duckdb` connection.
634634
#' @keywords internal
635635
spod_duckdb_limit_resources <- function(
636636
con,
637-
max_mem_gb = max(4, spod_available_ram() - 4),
637+
max_mem_gb = NULL,
638638
max_n_cpu = max(1, parallelly::availableCores() - 1)
639639
) {
640-
DBI::dbExecute(
641-
con,
642-
dplyr::sql(
643-
glue::glue("SET max_memory='{max_mem_gb}GB';")
640+
if (!is.null(max_mem_gb)) {
641+
DBI::dbExecute(
642+
con,
643+
dplyr::sql(
644+
glue::glue("SET max_memory='{max_mem_gb}GB';")
645+
)
644646
)
645-
)
647+
}
646648

647649
DBI::dbExecute(
648650
con,

R/get.R

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -63,7 +63,7 @@ spod_get <- function(
6363
dates = NULL,
6464
data_dir = spod_get_data_dir(),
6565
quiet = FALSE,
66-
max_mem_gb = max(4, spod_available_ram() - 4),
66+
max_mem_gb = NULL,
6767
max_n_cpu = max(1, parallelly::availableCores() - 1),
6868
max_download_size_gb = 1,
6969
duckdb_target = ":memory:",
@@ -100,7 +100,7 @@ spod_get <- function(
100100
)
101101
)
102102
checkmate::assert_flag(quiet)
103-
checkmate::assert_number(max_mem_gb, lower = 1)
103+
checkmate::assert_number(max_mem_gb, lower = 1, null.ok = TRUE)
104104
checkmate::assert_integerish(max_n_cpu, lower = 1)
105105
checkmate::assert_number(max_download_size_gb, lower = 0.1)
106106
checkmate::assert_string(duckdb_target)

R/internal-utils.R

Lines changed: 0 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -394,18 +394,6 @@ spod_match_data_type <- function(
394394
return(NULL)
395395
}
396396

397-
#' Get available RAM
398-
#' @keywords internal
399-
#' @return A `numeric` amount of available RAM in GB.
400-
spod_available_ram <- function() {
401-
return(
402-
as.numeric(unclass(memuse::Sys.meminfo())[1][['totalram']]) /
403-
1024 /
404-
1024 /
405-
1024
406-
)
407-
}
408-
409397
#' Remove duplicate values in a semicolon-separated string
410398
#'
411399
#' @description

man/spod_available_ram.Rd

Lines changed: 0 additions & 15 deletions
This file was deleted.

man/spod_connect.Rd

Lines changed: 2 additions & 2 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

man/spod_convert.Rd

Lines changed: 2 additions & 2 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

0 commit comments

Comments
 (0)