Skip to content

Commit d182f1c

Browse files
committed
feat: add .versions property to Dataset
1 parent a2f678e commit d182f1c

File tree

13 files changed

+254
-37
lines changed

13 files changed

+254
-37
lines changed

CHANGELOG.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,12 @@
22

33
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
44

5+
## Unreleased
6+
7+
### Added
8+
9+
* Information about dataset versions can now be accessed via the the `.versions` property of a `Dataset` object. (#2)
10+
511
## Version v0.1.0 - 2023-06-20
612

713
Initial package release.

docs/Manifest.toml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
julia_version = "1.9.1"
44
manifest_format = "2.0"
5-
project_hash = "069906b5302a0260a868f88e6f63d03397df4071"
5+
project_hash = "c65c5d89b95b6f812384895c79ee446e88c10682"
66

77
[[deps.ANSIColoredPrinters]]
88
git-tree-sha1 = "574baf8110975760d391c710b6341da1afa48d8c"
@@ -136,7 +136,7 @@ version = "0.21.4"
136136
deps = ["Base64", "Dates", "Glob", "HTTP", "JSON", "Mocking", "Pkg", "PkgAuthentication", "Printf", "Rclone_jll", "SHA", "TOML", "Tar", "TimeZones", "URIs", "UUIDs"]
137137
path = ".."
138138
uuid = "bc7fa6ce-b75e-4d60-89ad-56c957190b6e"
139-
version = "0.0.2-DEV"
139+
version = "0.1.0"
140140

141141
[[deps.LazyArtifacts]]
142142
deps = ["Artifacts", "Pkg"]

docs/Project.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,4 +4,5 @@ DocumenterMermaid = "a078cd44-4d9c-4618-b545-3ab9d77f9177"
44
JSON = "682c06a0-de6a-54ab-a142-c8b1cf79cde6"
55
JuliaHub = "bc7fa6ce-b75e-4d60-89ad-56c957190b6e"
66
Mocking = "78c3b35d-d492-501b-9361-3d52fe80e533"
7+
TimeZones = "f269a46b-ccf7-5d73-abea-4c690281aa53"
78
URIs = "5c2747f8-b7ea-4ff2-ba2e-563bfd36b1d4"

docs/make.jl

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,10 @@
11
using JuliaHub
22
using Documenter, DocumenterMermaid
33

4+
# Timestamp printing is dependent on the timezone, so we force a specific (non-UTC)
5+
# timezone to make sure that the doctests don't fail because of timezone differences.
6+
ENV["TZ"] = "America/New_York"
7+
48
DocMeta.setdocmeta!(
59
JuliaHub, :DocTestSetup,
610
quote

docs/src/guides/datasets.md

Lines changed: 40 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -30,12 +30,13 @@ julia> JuliaHub.datasets()
3030

3131
If you know the name of the dataset, you can also directly access it with the [`dataset`](@ref) function, and you can access the dataset metadata via the properties of the [`Dataset`](@ref) object.
3232

33-
```jldoctest
33+
```jldoctest example-dataset
3434
julia> ds = JuliaHub.dataset("example-dataset")
3535
Dataset: example-dataset (Blob)
3636
owner: username
3737
description: An example dataset
38-
size: 57 bytes
38+
versions: 2
39+
size: 388 bytes
3940
tags: tag1, tag2
4041
4142
julia> ds.owner
@@ -45,7 +46,7 @@ julia> ds.description
4546
"An example dataset"
4647
4748
julia> ds.size
48-
57
49+
388
4950
```
5051

5152
If you want to work with dataset that you do not own but is shared with you in JuliaHub, you can pass `shared=true` to [`datasets`](@ref), or specify the username.
@@ -65,6 +66,7 @@ julia> JuliaHub.dataset(("anotheruser", "publicdataset"))
6566
Dataset: publicdataset (Blob)
6667
owner: anotheruser
6768
description: An example dataset
69+
versions: 1
6870
size: 57 bytes
6971
tags: tag1, tag2
7072
```
@@ -79,6 +81,41 @@ Elapsed time: 2.1s
7981
"/home/username/my-project/mydata"
8082
```
8183

84+
As datasets can have multiple versions, the [`.versions` property of `Dataset`](@ref Dataset) can be used to see information about the individual versions (represented with [`DatasetVersion`](@ref) objects).
85+
When downloading, you can also specify the version you wish to download (with the default being the newest version).
86+
87+
```jldoctest example-dataset; filter = r"\"/.+/mydata\""
88+
julia> ds.versions
89+
2-element Vector{JuliaHub.DatasetVersion}:
90+
JuliaHub.DatasetVersion(dataset = ("username", "example-dataset"), version = 1)
91+
JuliaHub.DatasetVersion(dataset = ("username", "example-dataset"), version = 2)
92+
93+
julia> ds.versions[1]
94+
DatasetVersion: example-dataset @ v1
95+
owner: username
96+
timestamp: 2022-10-13T01:39:42.963-04:00
97+
size: 57 bytes
98+
99+
julia> JuliaHub.download_dataset("example-dataset", "mydata", version=ds.versions[1].id)
100+
Transferred: 86.767 KiB / 86.767 KiB, 100%, 0 B/s, ETA -
101+
Transferred: 1 / 1, 100%
102+
Elapsed time: 2.1s
103+
"/home/username/my-project/mydata"
104+
105+
```
106+
107+
The dataset version are sorted with oldest first.
108+
To explicitly access the newest dataset, you can use the `last` function on the `.versions` property.
109+
110+
```jldoctest example-dataset
111+
julia> last(ds.versions)
112+
DatasetVersion: example-dataset @ v2
113+
owner: username
114+
timestamp: 2022-10-14T01:39:43.237-04:00
115+
size: 331 bytes
116+
117+
```
118+
82119
!!! tip "Tip: DataSets.jl"
83120

84121
In JuliaHub jobs and Cloud IDEs you can also use the [DataSets.jl](https://github.com/JuliaComputing/DataSets.jl) package to access and work with datasets.

docs/src/reference/datasets.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -41,14 +41,15 @@ The versions are indexed with a linear list of integers starting from `1`.
4141
## Reference
4242

4343
```@docs
44+
JuliaHub.Dataset
45+
JuliaHub.DatasetVersion
4446
JuliaHub.datasets
4547
JuliaHub.DatasetReference
4648
JuliaHub.dataset
4749
JuliaHub.download_dataset
4850
JuliaHub.upload_dataset
4951
JuliaHub.update_dataset
5052
JuliaHub.delete_dataset
51-
JuliaHub.Dataset
5253
```
5354

5455
## Index

src/datasets.jl

Lines changed: 83 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,63 @@ Base.@kwdef struct _DatasetStorage
1313
prefix::String
1414
end
1515

16+
"""
17+
struct DatasetVersion
18+
19+
Represents one version of a dataset.
20+
21+
Objects have the following properties:
22+
23+
- `.id`: unique dataset version identifier (used e.g. in [`download_dataset`](@ref) to
24+
identify the dataset version).
25+
- `.size :: Int`: size of the dataset version in bytes
26+
- `.timestamp :: ZonedDateTime`: dataset version timestamp
27+
28+
```
29+
julia> JuliaHub.datasets()
30+
```
31+
32+
See also: [`Dataset`](@ref), [`datasets`](@ref), [`dataset`](@ref).
33+
34+
$(_DOCS_no_constructors_admonition)
35+
"""
36+
struct DatasetVersion
37+
_dsref::Tuple{String, String}
38+
id::Int
39+
size::Int
40+
timestamp::TimeZones.ZonedDateTime
41+
_blobstore_path::String
42+
43+
function DatasetVersion(json::Dict; owner::AbstractString, name::AbstractString)
44+
msg = "Unable to parse dataset version info for ($owner, $name)"
45+
version = _get_json(json, "version", Int; msg)
46+
size = _get_json(json, "size", Int; msg)
47+
timestamp = _parse_tz(_get_json(json, "date", String; msg); msg)
48+
blobstore_path = _get_json(json, "blobstore_path", String; msg)
49+
new((owner, name), version, size, timestamp, blobstore_path)
50+
end
51+
end
52+
53+
function Base.show(io::IO, dsv::DatasetVersion)
54+
owner, name = dsv._dsref
55+
print(
56+
io,
57+
"JuliaHub.DatasetVersion(dataset = (\"",
58+
owner,
59+
"\", \"",
60+
name,
61+
"\"), version = $(dsv.id))",
62+
)
63+
end
64+
function Base.show(io::IO, ::MIME"text/plain", dsv::DatasetVersion)
65+
owner, name = dsv._dsref
66+
printstyled(io, "DatasetVersion:"; bold=true)
67+
print(io, " ", name, " @ v", dsv.id)
68+
print(io, "\n owner: ", owner)
69+
print(io, "\n timestamp: ", dsv.timestamp)
70+
print(io, "\n size: ", dsv.size, " bytes")
71+
end
72+
1673
"""
1774
struct Dataset
1875
@@ -23,6 +80,8 @@ public API:
2380
- `owner :: String`: username of the dataset owner
2481
- `name :: String`: dataset name
2582
- `dtype :: String`: generally either `Blob` or `BlobTree`, but additional values may be added in the future
83+
- `versions :: Vector{DatasetVersion}`: an ordered list of [`DatasetVersion`](@ref) objects, one for
84+
each dataset version, sorted from oldest to latest (i.e. you can use `last` to get the newest version).
2685
- `size :: Int`: total size of the whole dataset (including all the dataset versions) in bytes
2786
- Fields to access user-provided dataset metadata:
2887
- `description :: String`: dataset description
@@ -44,38 +103,38 @@ Base.@kwdef struct Dataset
44103
uuid::UUIDs.UUID
45104
dtype::String
46105
size::Int64
106+
versions::Vector{DatasetVersion}
47107
# User-set metadata
48108
description::String
49109
tags::Vector{String}
50110
# Additional metadata, but not part of public API
51111
_last_modified::Union{Nothing, TimeZones.ZonedDateTime}
52112
_downloadURL::String
53-
_version::Union{Nothing, String}
54-
_versions::Vector
55113
_storage::_DatasetStorage
56114
# Should not be used in code, but stores the full server
57115
# response for developer convenience.
58116
_json::Dict
59117
end
60118

61119
function Dataset(d::Dict)
120+
owner = d["owner"]["username"]
121+
name = d["name"]
122+
versions_json = _get_json_or(d, "versions", Vector, [])
123+
versions = sort([DatasetVersion(json; owner, name) for json in versions_json]; by=dsv -> dsv.id)
62124
Dataset(;
63125
uuid=UUIDs.UUID(d["id"]),
64-
name=d["name"],
65-
owner=d["owner"]["username"],
126+
name, owner, versions,
66127
dtype=d["type"],
67128
description=d["description"],
68129
size=d["size"],
69130
tags=d["tags"],
70-
_versions=d["versions"],
71131
_downloadURL=d["downloadURL"],
72132
_last_modified=_nothing_or(d["lastModified"]) do last_modified
73133
datetime_utc = Dates.DateTime(
74134
last_modified, Dates.dateformat"YYYY-mm-ddTHH:MM:SS.ss"
75135
)
76136
_utc2localtz(datetime_utc)
77137
end,
78-
_version=isnothing(d["version"]) ? nothing : d["version"],
79138
_storage=_DatasetStorage(;
80139
credentials_url=d["credentials_url"],
81140
region=d["storage"]["bucket_region"],
@@ -94,19 +153,19 @@ function Base.show(io::IO, ::MIME"text/plain", d::Dataset)
94153
print(io, " ", d.name, " (", d.dtype, ")")
95154
print(io, "\n owner: ", d.owner)
96155
print(io, "\n description: ", d.description)
156+
print(io, "\n versions: ", length(d.versions))
97157
print(io, "\n size: ", d.size, " bytes")
98158
isempty(d.tags) || print(io, "\n tags: ", join(d.tags, ", "))
99159
end
100160

101161
function Base.:(==)(d1::Dataset, d2::Dataset)
102162
d1.name == d2.name &&
103163
d1.description == d2.description &&
104-
d1._versions == d2._versions &&
164+
d1.versions == d2.versions &&
105165
d1._downloadURL == d2._downloadURL &&
106166
d1.size == d2.size &&
107167
d1.tags == d2.tags &&
108168
d1._last_modified == d2._last_modified &&
109-
d1._version == d2._version &&
110169
d1.dtype == d2.dtype &&
111170
d1.uuid == d2.uuid
112171
end
@@ -297,14 +356,16 @@ julia> dataset = JuliaHub.dataset("example-dataset")
297356
Dataset: example-dataset (Blob)
298357
owner: username
299358
description: An example dataset
300-
size: 57 bytes
359+
versions: 2
360+
size: 388 bytes
301361
tags: tag1, tag2
302362
303363
julia> JuliaHub.dataset(dataset)
304364
Dataset: example-dataset (Blob)
305365
owner: username
306366
description: An example dataset
307-
size: 57 bytes
367+
versions: 2
368+
size: 388 bytes
308369
tags: tag1, tag2
309370
```
310371
@@ -703,8 +764,10 @@ function _parse_dataset_version(version::AbstractString)
703764
end
704765

705766
function _find_dataset_version(dataset::Dataset, version::Integer)
706-
for version_dict in dataset._versions
707-
version_dict["version"] == version && return version_dict
767+
# Starting form latest first, assuming that it's more common to
768+
# try to find newer versions.
769+
for dsv in Iterators.reverse(dataset.versions)
770+
dsv.id == version && return dsv
708771
end
709772
return nothing
710773
end
@@ -722,7 +785,9 @@ If the dataset is a `Blob`, then the created `local_path` will be a file, and if
722785
the `local_path` will be a directory.
723786
724787
By default, it downloads the latest version, but an older version can be downloaded by specifying
725-
the `version` keyword argument.
788+
the `version` keyword argument. Caution: you should never assume that the index of the `.versions` property
789+
of [`Dataset`](@ref) matches the version number -- always explicitly use the `.id` propert of the
790+
[`DatasetVersion`](@ref) object.
726791
727792
The function also prints download progress to standard output. This can be disabled by setting `quiet=true`.
728793
Any error output from the download is still printed.
@@ -766,12 +831,14 @@ function download_dataset(
766831
),
767832
)
768833

834+
isempty(dataset.versions) &&
835+
throw(InvalidRequestError("Dataset '$(dataset.name)' does not have any versions"))
769836
if isnothing(version)
770-
version = _parse_dataset_version(dataset._version)
837+
version = last(dataset.versions).id
771838
end
772839
version_info = _find_dataset_version(dataset, version)
773840
isnothing(version_info) &&
774-
throw(InvalidRequestError("Dataset '$(dataset.name)' does not have version '$version'"))
841+
throw(InvalidRequestError("Dataset '$(dataset.name)' does not have version 'v$version'"))
775842

776843
credentials = Mocking.@mock _get_dataset_credentials(auth, dataset)
777844
credentials["vendor"] == "aws" ||
@@ -780,8 +847,7 @@ function download_dataset(
780847

781848
bucket = dataset._storage.bucket
782849
prefix = dataset._storage.prefix
783-
version_path = version_info["blobstore_path"]
784-
remote_uri = "juliahub_remote:$bucket/$prefix/$(dataset.uuid)/$version_path"
850+
remote_uri = "juliahub_remote:$bucket/$prefix/$(dataset.uuid)/$(version_info._blobstore_path)"
785851
if dataset.dtype == "Blob"
786852
remote_uri *= "/data"
787853
end

src/utils.jl

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -372,3 +372,30 @@ function _throw_or_nothing(
372372
isnothing(nothrow_extra_logic_f) || nothrow_extra_logic_f(msg)
373373
return nothing
374374
end
375+
376+
# Parses a timezoned timestamp string into a local timezone object
377+
const _VALID_TZ_DATEFORMATS = [
378+
Dates.dateformat"yyyy-mm-ddTHH:MM:SS.ssszzz",
379+
Dates.dateformat"yyyy-mm-ddTHH:MM:SS.sszzz",
380+
Dates.dateformat"yyyy-mm-ddTHH:MM:SS.szzz",
381+
Dates.dateformat"yyyy-mm-ddTHH:MM:SSzzz",
382+
]
383+
function _parse_tz(timestamp_str::AbstractString; msg::Union{AbstractString, Nothing}=nothing)
384+
timestamp = nothing
385+
for dateformat in _VALID_TZ_DATEFORMATS
386+
timestamp = try
387+
TimeZones.ZonedDateTime(timestamp_str, dateformat)
388+
catch e
389+
isa(e, ArgumentError) && continue
390+
rethrow(e)
391+
end
392+
end
393+
if isnothing(timestamp)
394+
errmsg = "Unable to parse timestamp '$timestamp_str'"
395+
if !isnothing(msg)
396+
errmsg = string(msg, '\n', errmsg)
397+
end
398+
throw(JuliaHubError(errmsg))
399+
end
400+
return TimeZones.astimezone(timestamp, TimeZones.localzone())
401+
end

test/datasets-live.jl

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -135,7 +135,7 @@ try
135135
dataset = JuliaHub.dataset(datasetname; auth)
136136
@test dataset.name == datasetname
137137
@test dataset.dtype == "Blob"
138-
@test length(dataset._versions) == 2
138+
@test length(dataset.versions) == 2
139139
end
140140

141141
# Updating metadata

0 commit comments

Comments
 (0)