-
Notifications
You must be signed in to change notification settings - Fork 6
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Uploading a dataset to JuliaHub is actually a multi-step process where you "open an upload" to get S3 credentials, then talk directly to S3, and finally you "close the upload". We currently hide away that complexity in:
Lines 482 to 606 in 27e5c72
@_authuser function upload_dataset( | |
dsref::_DatasetRefTuple, | |
local_path::AbstractString; | |
# Operation type | |
create::Bool=true, | |
update::Bool=false, | |
replace::Bool=false, | |
# Dataset metadata | |
description::Union{AbstractString, Missing}=missing, | |
tags=missing, | |
visibility::Union{AbstractString, Missing}=missing, | |
license::Union{AbstractString, Tuple{Symbol, <:AbstractString}, Missing}=missing, | |
groups=missing, | |
# Authentication | |
auth::Authentication=__auth__(), | |
) | |
username, dataset_name = dsref | |
_assert_current_user(username, auth; op="upload_new_dataset") | |
if !create && !update | |
throw(ArgumentError("'create' and 'update' can not both be false")) | |
end | |
if update && replace | |
throw(ArgumentError("'update' and 'replace' can not both be true")) | |
end | |
tags = _validate_iterable_argument(String, tags; argument="tags") | |
groups = _validate_iterable_argument(String, groups; argument="groups") | |
# We determine the dataset dtype from the local path. | |
# This may throw an ArgumentError. | |
dtype = _dataset_dtype(local_path) | |
# We need to declare `r` here, because we want to reuse the variable name | |
local r::_RESTResponse | |
# If `create`, then we first try to create the dataset. If the dataset name | |
# is already taken, then we should get a 409 back. | |
local newly_created_dataset::Bool = false | |
if create | |
# Note: we do not set tags or description here (even though we could), but we | |
# will do that in an update_dataset() call later. | |
r = _new_dataset(dataset_name, dtype; auth) | |
if r.status == 409 | |
# 409 Conflict indicates that a dataset with this name already exists. | |
if !update && !replace | |
# If neither update nor replace is set, and the dataset exists, then | |
# we must throw an invalid request error. | |
throw( | |
InvalidRequestError( | |
"Dataset '$dataset_name' for user '$username' already exists, but update=false and replace=false.", | |
), | |
) | |
elseif replace | |
# In replace mode we will delete the existing dataset and | |
# create a new one. | |
delete_dataset((username, dataset_name); auth) | |
r_recreated::_RESTResponse = _new_dataset(dataset_name, dtype; auth) | |
if r_recreated.status == 200 | |
newly_created_dataset = true | |
else | |
_throw_invalidresponse(r_recreated) | |
end | |
end | |
# There is one more case -- `update && !replace` -- but in this case | |
# we just move on to uploading a new version. | |
elseif r.status == 200 | |
# The only other valid response is 200, when we create the dataset | |
newly_created_dataset = true | |
else | |
# For any non-200/409 responses we throw a backend error. | |
_throw_invalidresponse(r) | |
end | |
end | |
# If `!create`, the only option allowed is `update` (`replace` is excluded). | |
# | |
# Acquire an upload for the dataset. By this point, the dataset with this name | |
# should definitely exist, although race conditions are always a possibility. | |
r = _open_dataset_version(dataset_name; auth) | |
if (r.status == 404) && !create | |
# A non-existent dataset if create=false indicates a user error. | |
throw( | |
InvalidRequestError( | |
"Dataset '$dataset_name' for '$username' does not exist and create=false." | |
), | |
) | |
elseif r.status != 200 | |
# Any other 404 or other non-200 response indicates a backend failure | |
_throw_invalidresponse(r) | |
end | |
upload_config, _ = _parse_response_json(r, Dict) | |
# Verify that the dtype of the remote dataset is what we expect it to be. | |
if upload_config["dataset_type"] != dtype | |
if newly_created_dataset | |
# If we just created the dataset, then there has been some strange error if dtypes | |
# do not match. | |
throw(JuliaHubError("Dataset types do not match.")) | |
else | |
# Otherwise, it's a user error (i.e. they are trying to update dataset with the wrong | |
# dtype). | |
throw( | |
InvalidRequestError( | |
"Local data type ($dtype) does not match existing dataset dtype $(upload_config["dataset_type"])", | |
), | |
) | |
end | |
end | |
# Upload the actual data | |
try | |
_upload_dataset(upload_config, local_path) | |
catch e | |
throw(JuliaHubError("Data upload failed", e, catch_backtrace())) | |
end | |
# Finalize the upload | |
try | |
# _close_dataset_version will also throw on non-200 responses | |
_close_dataset_version(dataset_name, upload_config; local_path, auth) | |
catch e | |
throw(JuliaHubError("Finalizing upload failed", e, catch_backtrace())) | |
end | |
# Finally, update the dataset metadata with the new metadata fields. | |
if !all(ismissing.((description, tags, visibility, license, groups))) | |
update_dataset( | |
(username, dataset_name); auth, | |
description, tags, visibility, license, groups | |
) | |
end | |
# If everything was successful, we'll return an updated DataSet object. | |
return dataset((username, dataset_name); auth) | |
end |
Sometimes the users may want to control the upload step themselves. So we should expose a slightly lower level API that returns an object to the user that contains the information and credentials for the active dataset upload. The users can then upload the data themselves, and finally they just need to close the upload.
The use cases I see for this:
- Users wanting to upload things to S3 themselves using another tool, and they just want the credentials (e.g. they want to invoke
rclone
by hand for one reason or another). - Tools that want to have more control over how data gets written to the S3 bucket (e.g. to avoid writing all the files as temporary files).
pfitzseb
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request