Skip to content

Commit b808fbb

Browse files
simonelbazpitrou
andauthored
GH-42173: [R][C++] Writing partitioned dataset on S3 fails if ListBucket is not allowed for the user (#47599)
### Rationale for this change This PR gives the user to choose not to create directory in the bucket before writing dataset. In case the `create_directory` option is set to FALSE, no verification will be made by R arrow. The S3 storage will itself verify if the directory exists and if the users has the rigth to modify it. This way no `ListBucket` or ` HeadBucket` are necessary to achieve the write operation. ``` df |> arrow::write_dataset( minio$path(paste0("smartsla-bucket/rarrow/")), partitioning = "qualitative", create_directory = FALSE, format = "parquet" ) ``` ### What changes are included in this PR? `create_directory` is now available to the user in the `write_dataset` function. Before this PR, this option was automatically set to TRUE (by default). ### Are these changes tested? Yes ### Are there any user-facing changes? No, the default value for `create_directory` is still TRUE. * GitHub Issue: #42173 Lead-authored-by: Simon ELBAZ <[email protected]> Co-authored-by: Simon Elbaz <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
1 parent 52704cb commit b808fbb

File tree

5 files changed

+21
-9
lines changed

5 files changed

+21
-9
lines changed

r/R/arrowExports.R

Lines changed: 2 additions & 2 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

r/R/dataset-write.R

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -67,6 +67,10 @@
6767
#' group and when this number of rows is exceeded, it is split and the next set
6868
#' of rows is written to the next group. This value must be set such that it is
6969
#' greater than `min_rows_per_group`. Default is 1024 * 1024.
70+
#' @param create_directory whether to create the directories written into.
71+
#' Requires appropriate permissions on the storage backend. If set to FALSE,
72+
#' directories are assumed to be already present if writing on a classic
73+
#' hierarchical filesystem. Default is TRUE
7074
#' @param ... additional format-specific arguments. For available Parquet
7175
#' options, see [write_parquet()]. The available Feather options are:
7276
#' - `use_legacy_format` logical: write data formatted so that Arrow libraries
@@ -132,6 +136,7 @@ write_dataset <- function(dataset,
132136
max_rows_per_file = 0L,
133137
min_rows_per_group = 0L,
134138
max_rows_per_group = bitwShiftL(1, 20),
139+
create_directory = TRUE,
135140
...) {
136141
format <- match.arg(format)
137142
if (format %in% c("feather", "ipc")) {
@@ -224,7 +229,7 @@ write_dataset <- function(dataset,
224229
partitioning, basename_template,
225230
existing_data_behavior, max_partitions,
226231
max_open_files, max_rows_per_file,
227-
min_rows_per_group, max_rows_per_group
232+
min_rows_per_group, max_rows_per_group, create_directory
228233
)
229234
}
230235

r/man/write_dataset.Rd

Lines changed: 5 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

r/src/arrowExports.cpp

Lines changed: 6 additions & 5 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

r/src/compute-exec.cpp

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -318,7 +318,7 @@ void ExecPlan_Write(const std::shared_ptr<acero::ExecPlan>& plan,
318318
arrow::dataset::ExistingDataBehavior existing_data_behavior,
319319
int max_partitions, uint32_t max_open_files,
320320
uint64_t max_rows_per_file, uint64_t min_rows_per_group,
321-
uint64_t max_rows_per_group) {
321+
uint64_t max_rows_per_group, bool create_directory) {
322322
arrow::dataset::internal::Initialize();
323323

324324
// TODO(ARROW-16200): expose FileSystemDatasetWriteOptions in R
@@ -335,6 +335,7 @@ void ExecPlan_Write(const std::shared_ptr<acero::ExecPlan>& plan,
335335
opts.max_rows_per_file = max_rows_per_file;
336336
opts.min_rows_per_group = min_rows_per_group;
337337
opts.max_rows_per_group = max_rows_per_group;
338+
opts.create_dir = create_directory;
338339

339340
ds::WriteNodeOptions options(std::move(opts));
340341
options.custom_schema = std::move(schema);

0 commit comments

Comments
 (0)