Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
52 commits
Select commit Hold shift + click to select a range
6ea2b56
use abs path instead of real path to avoid issues with network drives
e-kotov May 16, 2025
f302f24
abs instead of real path in get data dir
e-kotov May 16, 2025
da696ac
get v1 available data from s3 and fetch file sizes
e-kotov May 16, 2025
da79487
v2 availalble data from S3 with file sizes and etags
e-kotov May 17, 2025
e40d6bc
reformat spod_download with air
e-kotov May 17, 2025
99cee6d
check_local_files based on the file size check
e-kotov May 17, 2025
c9fe3fe
s3 is default and is cached on disk
e-kotov May 17, 2025
14bac0e
minor fixes to scoping
e-kotov May 17, 2025
9afcd72
disable quick get live test
e-kotov May 17, 2025
dce2147
remove non ascii chars, scope tail to utils
e-kotov May 17, 2025
6461eeb
spod_download defaults to s3 metadata and checks local file sizes
e-kotov May 17, 2025
11e1bc8
update docs for download and available_data
e-kotov May 17, 2025
bdccdf9
xml as failsafe for S3 and memoised xml load
e-kotov May 17, 2025
6fb127c
depend on paws instead of aws.s3 pkg, fix minor bugs in available dat…
e-kotov May 18, 2025
933ae6c
depend on paws.storage explicitly to prevent too many deps of paws pa…
e-kotov May 18, 2025
fdb506d
disable region and url style env vars for S3, fixes in metadata fetch…
e-kotov May 18, 2025
436f83d
air format internal utils
e-kotov May 18, 2025
66b614b
custom multi file downloader because curl multi down fails a lot on s…
e-kotov May 18, 2025
9ce412d
drop curl dependency, only use base r downloads and custom function
e-kotov May 18, 2025
0f0d5dd
store v1 data metadata esp. true remote file sizes with the package
e-kotov May 18, 2025
61ce5c5
quick fix for requested_files
e-kotov May 18, 2025
3f65500
make avail data s3 internal
e-kotov May 18, 2025
4d4b126
down speed and eta in progress bar
e-kotov May 18, 2025
74fd2fc
fix spod get zones v1 error
e-kotov May 19, 2025
3aa7bda
fix progress ETA display
e-kotov May 19, 2025
4fb8894
show file names for files without data_ymd column
e-kotov May 19, 2025
748e03d
file checker function
e-kotov May 19, 2025
4b3e6e5
parallel downloader with base R download.file with libcurl backend
e-kotov May 19, 2025
7a7b3d1
fixes in download data batches
e-kotov May 19, 2025
0f82114
add timeout
e-kotov May 19, 2025
3e68b93
proper batching with retries
e-kotov May 19, 2025
ee622d7
add internal dev helper to store etags for downloaded v1 files
e-kotov May 19, 2025
6072d65
update news
e-kotov May 19, 2025
1d4b31a
spod_store_etags returns tbl
e-kotov May 19, 2025
ab3c8c6
Update man/spod_download.Rd
e-kotov May 19, 2025
146682b
fix typo
e-kotov May 20, 2025
8d55176
add true etags to v1 data
e-kotov May 21, 2025
ef15db4
Merge branch 'main' into s3-metadata
e-kotov May 21, 2025
b6c778c
fix for check all cached files
e-kotov May 21, 2025
9dd8ae4
internal spod check parallel for testing
e-kotov May 21, 2025
f135a7d
switch to mirai multisession
e-kotov May 21, 2025
237f4a0
add parallel as an option to exported check files function
e-kotov May 21, 2025
a7f079e
fixes n_threads arg name in file check
e-kotov May 21, 2025
7b9f61f
delete old v1 data size cache
e-kotov May 21, 2025
be64225
improve messaging of the check function
e-kotov May 21, 2025
7f5f87c
fix docs for spod_store_etags
e-kotov May 22, 2025
77cb0f3
add file classification columns in avaialble data
e-kotov Jun 10, 2025
f1fd5b1
improve error message for dowload larger then limit
e-kotov Jun 13, 2025
de0cfe2
add .data in spod_store_etags to prevent r cmd check notes
e-kotov Jun 13, 2025
c47ad3a
update docs for spod_store_etags
e-kotov Jun 13, 2025
44b24d2
add warning to spod_check_files
e-kotov Jun 13, 2025
dfdc1ca
docs cleanup
e-kotov Jun 13, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,6 @@ Depends:
R (>= 4.1.0)
Imports:
checkmate,
curl (>= 5.0.0),
DBI,
digest,
dplyr,
Expand All @@ -45,6 +44,7 @@ Imports:
memuse,
openssl,
parallelly,
paws.storage (>= 0.4.0),
purrr,
readr,
rlang,
Expand All @@ -58,6 +58,7 @@ Suggests:
flowmapper (>= 0.1.2),
furrr,
future,
future.mirai,
hexSticker,
mapSpain,
quarto,
Expand Down
1 change: 1 addition & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# Generated by roxygen2: do not edit by hand

export(spod_available_data)
export(spod_check_files)
export(spod_cite)
export(spod_codebook)
export(spod_connect)
Expand Down
14 changes: 11 additions & 3 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,21 @@

## New features

* `spod_quick_get_zones()` is a new function to quickly get municipality geometries to match with the data retrieved with `spod_quick_get_od()` [#163](https://github.com/rOpenSpain/spanishoddata/pull/163). Requests to get geometies are cached in memory of the current R session with `memoise` package.
* `spod_quick_get_zones()` is a new function to quickly get municipality geometries to match with the data retrieved with `spod_quick_get_od()` [#163](https://github.com/rOpenSpain/spanishoddata/pull/163). Requests to get geometies are cached in memory of the current R session with `memoise` package. This function is experimental, just as the `spod_quick_get_od()` function, as the API of the Spanish Ministry of Transport may change in the future. It is only intended for quick analysis in educational or other demonstration purposes, as it downloads very little data compared to the regular `spod_get_od()`, `spod_download()` and `spod_convert()` functions.

* `spod_check_files()` function allows to check consistency of downloaded files with Amazon S3 checksums (PR [#165](https://github.com/rOpenSpain/spanishoddata/pull/165)). ETags for v1 data are stored with the package, and for v2 data they are fetched from Amazon S3. This function is experimental.

## Improvements

* Metadata is now fetched from Amazon S3 storage of the original data files, which allows validation of downloaded files ((#126)[https://github.com/rOpenSpain/spanishoddata/issues/126]) with both size and checksum. PR [#165](https://github.com/rOpenSpain/spanishoddata/pull/165).

## Bug fixes

* `spod_quick_get_od()` is working again. We fixed it to work with the updated API of the Spanish Ministry of Transport (PR [#163](https://github.com/rOpenSpain/spanishoddata/pull/163), issue [#162](https://github.com/rOpenSpain/spanishoddata/issues/162)). It will remain experimental, as the API may change in the future.
* More reliable, but still multi-threaded data file downloads using base R `utils::download.file()` instead of `curl::multi_download()` which failed on some connections ([#127](https://github.com/rOpenSpain/spanishoddata/issues/127)), so now `curl` dependency is no longer required. PR [#165](https://github.com/rOpenSpain/spanishoddata/pull/165).

* `spod_quick_get_od()` is working again. We fixed it to work with the updated API of the Spanish Ministry of Transport (PR [#163](https://github.com/rOpenSpain/spanishoddata/pull/163), issue [#162](https://github.com/rOpenSpain/spanishoddata/issues/162)). This function will remain experimental, just as the `spod_quick_get_zones()` function, as the API of the Spanish Ministry of Transport may change in the future. It is only intended for quick analysis in educational or other demonstration purposes, as it downloads very little data compared to the regular `spod_get_od()`, `spod_download()` and `spod_convert()` functions.

* `spod_convert()` can now accept `overwrite = 'update'` with `save_format = 'parquet'` ([#161](https://github.com/rOpenSpain/spanishoddata/pull/161)) previously it failed because of the incorrect check that asserted only `TRUE` or `FALSE` ([#160](https://github.com/rOpenSpain/spanishoddata/issues/160))
* `spod_convert()` now accepts `overwrite = 'update'` with `save_format = 'parquet'` ([#161](https://github.com/rOpenSpain/spanishoddata/pull/161)) previously it failed because of the incorrect check that asserted only `TRUE` or `FALSE` ([#160](https://github.com/rOpenSpain/spanishoddata/issues/160))

# spanishoddata 0.1.1

Expand Down
187 changes: 187 additions & 0 deletions R/available-data-s3.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,187 @@
#' Get available data list from Amazon S3 storage
#'
#' @description
#'
#' Get a table with links to available data files for the specified data version from Amazon S3 storage.
#'
#' @inheritParams spod_available_data
#' @inheritParams global_quiet_param
#' @return A tibble with links, release dates of files in the data, dates of data coverage, local paths to files, and the download status.
#'
#' @keywords internal
spod_available_data_s3 <- function(
ver = c(1, 2),
force = FALSE,
quiet = FALSE,
data_dir = spod_get_data_dir()
) {
ver <- as.character(ver)
ver <- match.arg(ver)
metadata_folder <- glue::glue("{data_dir}/{spod_subfolder_metadata_cache()}")

# if forcing, evict the in-session cache now
if (isTRUE(force)) {
memoise::forget(spod_available_data_s3_memoised)
}

# shortcut: if we already have it memoised, return immediately
if (!force && memoise::has_cache(spod_available_data_s3_memoised)(ver)) {
if (!quiet) message("Using memory-cached available data from S3")
return(spod_available_data_s3_memoised(ver))
}

# no in-session data, check your on-disk RDS pool
pattern <- glue::glue("metadata_s3_v{ver}_\\d{{4}}-\\d{{2}}-\\d{{2}}\\.rds$")
rds_files <- fs::dir_ls(
path = metadata_folder,
type = "file",
regexp = pattern
) |>
sort()

latest_file <- utils::tail(rds_files, 1)
latest_date <- if (length(latest_file) == 1) {
stringr::str_extract(basename(latest_file), "\\d{4}-\\d{2}-\\d{2}") |>
as.Date()
} else {
NA
}

needs_update <- isTRUE(force) ||
length(rds_files) == 0 ||
(!is.na(latest_date) && latest_date < Sys.Date())

if (!needs_update) {
if (!quiet) message("Using existing disk cache: ", latest_file)
return(readRDS(latest_file))
}

# if forcing, also clear old disk files
if (isTRUE(force) && length(rds_files) > 0) {
fs::file_delete(rds_files)
}

# fetch via the memoised function (this will re-hit S3 if we forgot it)
if (!quiet) message("Fetching latest metadata from AmazonS3 (v", ver, ")...")
dat <- spod_available_data_s3_memoised(ver)

# write a new RDS stamped with today's date
file_date <- format(Sys.Date(), "%Y-%m-%d")
out_path <- file.path(
metadata_folder,
glue::glue("metadata_s3_v{ver}_{file_date}.rds")
)
saveRDS(dat, out_path)
if (!quiet) message("Cached new data to: ", out_path)

dat
}


spod_available_data_s3_function <- function(
ver = c(1, 2)
) {
ver <- as.character(ver)
ver <- match.arg(ver)

bucket <- paste0("mitma-movilidad-v", ver)

# original_aws_region <- Sys.getenv("AWS_DEFAULT_REGION")
# original_aws_url_style <- Sys.getenv("AWS_S3_URL_STYLE")
# on.exit({
# Sys.setenv(
# AWS_DEFAULT_REGION = original_aws_region,
# AWS_S3_URL_STYLE = original_aws_url_style
# )
# })
# Sys.setenv(
# AWS_DEFAULT_REGION = "eu-west-1",
# AWS_S3_URL_STYLE = "virtual"
# )

if (ver == 1) {
url_prefix <- "https://opendata-movilidad.mitma.es/"
} else {
url_prefix <- "https://movilidad-opendata.mitma.es/"
}

s3 <- paws.storage::s3(
config = list(
credentials = list(
anonymous = TRUE
)
)
)

all_objects <- list_objects_v2_all(s3, bucket)

# all_objects <- aws.s3::get_bucket_df(
# bucket = bucket,
# prefix = "", # root of bucket
# max = Inf # fetch beyond the default 1000
# )

all_objects <- all_objects |>
dplyr::as_tibble() |>
dplyr::mutate(
target_url = paste0(url_prefix, .data$Key),
pub_ts = as.POSIXct(
.data$LastModified,
format = "%Y-%m-%dT%H:%M:%OSZ",
tz = "UTC"
),
file_size_bytes = as.numeric(.data$Size),
etag = gsub('\\"', '', .data$ETag)
) |>
dplyr::select(
.data$target_url,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not

Suggested change
.data$target_url,
target_url,

out of interest?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This use case seems different, in our case target_url is the hardcoded column name, not a text string, right?

for (var in names(mtcars)) {
mtcars %>% count(.data[[var]]) %>% print()
}

Copy link
Copy Markdown
Member Author

@e-kotov e-kotov Jun 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recall R CMD check complaining about undefined variables in functions and pointing to those tidy-eval column names, unless they are prepended by .data from rlang.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Robinlovelace

for example, I forgot to fix the function below

spod_store_etags <- function() {
  available_data <- spod_available_data(1, check_local_files = TRUE)
  available_data <- available_data |>
    dplyr::filter(downloaded == TRUE)
  local_etags <- available_data$local_path |>
    purrr::map_chr(~ spod_compute_s3_etag(.x), .progress = TRUE)
  available_data <- available_data |>
    dplyr::mutate(local_etag = local_etags) |>
    dplyr::as_tibble()
  return(available_data)
}

I get:

❯ checking R code for possible problems ... NOTE
  spod_store_etags: no visible binding for global variable ‘downloaded’
  Undefined global functions or variables:
    downloaded

Then if I prepend downloaded with bangs(is that what they called...?)/exclamation marks:

dplyr::filter(!!downloaded == TRUE)

Same NOTE:

❯ checking R code for possible problems ... NOTE
  spod_store_etags: no visible binding for global variable ‘downloaded’
  Undefined global functions or variables:
    downloaded

Then replacing the problematic line with:

dplyr::filter(.data$downloaded == TRUE)

No notes anymore 🤷‍♂️

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's a fair reason for using the .data syntax.

I think messages like this

  spod_store_etags: no visible binding for global variable ‘downloaded’

Can be resolved with utils::globalVariables("downloaded") somewhere in the package code, as outlined here https://forum.posit.co/t/how-to-solve-no-visible-binding-for-global-variable-note/28887/2 but in the very next post the .data syntax is recommended. Was just trying to understand the reasoning.

.data$pub_ts,
.data$file_size_bytes,
.data$etag
)

return(all_objects)
}

spod_available_data_s3_memoised <- memoise::memoise(
spod_available_data_s3_function
)

list_objects_v2_all <- function(s3, bucket, prefix = "", max_keys = 10000) {
pages <- paws.storage::paginate(
s3$list_objects_v2(
Bucket = bucket,
Prefix = prefix,
MaxKeys = max_keys
),
PageSize = max_keys
)

all_objects <- unlist(
lapply(pages, `[[`, "Contents"),
recursive = FALSE
)

metadata <- dplyr::tibble(
Key = vapply(all_objects, `[[`, character(1), "Key"),
LastModified = as.POSIXct(
vapply(all_objects, `[[`, numeric(1), "LastModified"),
origin = "1970-01-01",
tz = "UTC"
),
Size = vapply(all_objects, `[[`, numeric(1), "Size"),
ETag = vapply(all_objects, `[[`, character(1), "ETag")
)

# S3 generate download urls
# urls <- metadata$Key |>
# purrr::map(
# ~ s3$generate_presigned_url(
# client_method = "get_object",
# params = list(Bucket = "mitma-movilidad-v1", Key = .x)
# ),
# .progress = TRUE
# )

return(metadata)
}
Loading