Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
97 changes: 91 additions & 6 deletions docs/src/UserGuide/read.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,17 @@

This section describes how to read files, URLs, and directories into YAXArrays and datasets.

## Read Zarr
## open_dataset

The usual method for reading any format is using this function. See its `docstring` for more information.

````@docs
open_dataset
````

Now, let's explore different examples.

### Read Zarr

Open a Zarr store as a `Dataset`:

Expand All @@ -23,7 +33,7 @@ Individual arrays can be accessed using subsetting:
ds.tas
````

## Read NetCDF
### Read NetCDF

Open a NetCDF file as a `Dataset`:

Expand Down Expand Up @@ -55,7 +65,7 @@ end

This code will ensure that the data is only accessed by one thread at a time, i.e. making it actual single-threaded but thread-safe.

## Read GDAL (GeoTIFF, GeoJSON)
### Read GDAL (GeoTIFF, GeoJSON)

All GDAL compatible files can be read as a `YAXArrays.Dataset` after loading [ArchGDAL](https://yeesian.com/ArchGDAL.jl/latest/):

Expand All @@ -68,11 +78,11 @@ path = download("https://github.com/yeesian/ArchGDALDatasets/raw/307f8f0e584a39a
ds = open_dataset(path)
````

## Load data into memory
### Load data into memory

For datasets or variables that could fit in RAM, you might want to load them completely into memory. This can be done using the `readcubedata` function. As an example, let's use the NetCDF workflow; the same should be true for other cases.

### readcubedata
#### readcubedata

:::tabs

Expand All @@ -99,4 +109,79 @@ ds_loaded["tos"] # Load the variable of interest; the loaded status is shown for

:::

Note how the loading status changes from `loaded lazily` to `loaded in memory`.
Note how the loading status changes from `loaded lazily` to `loaded in memory`.

## open_mfdataset

There are situations when we would like to open and concatenate a list of dataset paths along a certain dimension. For example, to concatenate a list of `NetCDF` files along a new `time` dimension, one can use:

::: details creation of NetCDF files

````@example open_list_netcdf
using YAXArrays, NetCDF, Dates
using YAXArrays: YAXArrays as YAX

dates_1 = [Date(2020, 1, 1) + Dates.Day(i) for i in 1:3]
dates_2 = [Date(2020, 1, 4) + Dates.Day(i) for i in 1:3]

a1 = YAXArray((lon(1:5), lat(1:7)), rand(5, 7))
a2 = YAXArray((lon(1:5), lat(1:7)), rand(5, 7))

a3 = YAXArray((lon(1:5), lat(1:7), YAX.time(dates_1)), rand(5, 7, 3))
a4 = YAXArray((lon(1:5), lat(1:7), YAX.time(dates_2)), rand(5, 7, 3))

savecube(a1, "a1.nc")
savecube(a2, "a2.nc")
savecube(a3, "a3.nc")
savecube(a4, "a4.nc")
````
:::

### along a new dimension

````@example open_list_netcdf
using YAXArrays, NetCDF, Dates
using YAXArrays: YAXArrays as YAX
import DimensionalData as DD

files = ["a1.nc", "a2.nc"]

dates_read = [Date(2024, 1, 1) + Dates.Day(i) for i in 1:2]
ds = open_mfdataset(DD.DimArray(files, YAX.time(dates_read)))
````

and even opening files along a new `Time` dimension that already have a `time` dimension

````@example open_list_netcdf
files = ["a3.nc", "a4.nc"]
ds = open_mfdataset(DD.DimArray(files, YAX.Time(dates_read)))
````

Note that opening along a new dimension name without specifying values also works; however, it defaults to `1:length(files)` for the dimension values.

````@example open_list_netcdf
files = ["a1.nc", "a2.nc"]
ds = open_mfdataset(DD.DimArray(files, YAX.time))
````

### along a existing dimension

Another use case is when we want to open files along an existing dimension. In this case, `open_mfdataset` will concatenate the paths along the specified dimension

````@example open_list_netcdf
using YAXArrays, NetCDF, Dates
using YAXArrays: YAXArrays as YAX
import DimensionalData as DD

files = ["a3.nc", "a4.nc"]

ds = open_mfdataset(DD.DimArray(files, YAX.time()))
````

where the contents of the `time` dimension are the merged values from both files

````@ansi open_list_netcdf
ds["time"]
````

providing us with a wide range of options to work with.
19 changes: 17 additions & 2 deletions src/DatasetAPI/Datasets.jl
Original file line number Diff line number Diff line change
Expand Up @@ -348,7 +348,11 @@ open_mfdataset(g::Vector{<:AbstractString}; kwargs...) =
merge_datasets(map(i -> open_dataset(i; kwargs...), g))

function merge_new_axis(alldatasets, firstcube,var,mergedim)
newdim = DD.rebuild(mergedim,1:length(alldatasets))
newdim = if !(typeof(DD.lookup(mergedim)) <: DD.NoLookup)
DD.rebuild(mergedim, DD.val(mergedim))
else
DD.rebuild(mergedim, 1:length(alldatasets))
end
alldiskarrays = map(ds->ds.cubes[var].data,alldatasets).data
newda = diskstack(alldiskarrays)
newdims = (DD.dims(firstcube)...,newdim)
Expand Down Expand Up @@ -407,10 +411,21 @@ end


"""
open_dataset(g; driver=:all)
open_dataset(g; skip_keys=(), driver=:all)

Open the dataset at `g` with the given `driver`.
The default driver will search for available drivers and tries to detect the useable driver from the filename extension.

### Keyword arguments

- `skip_keys` are passed as symbols, i.e., `skip_keys = (:a, :b)`
- `driver=:all`, common options are `:netcdf` or `:zarr`.

Example:

````julia
ds = open_dataset(f, driver=:zarr, skip_keys = (:c,))
````
"""
function open_dataset(g; skip_keys=(), driver = :all)
str_skipkeys = string.(skip_keys)
Expand Down
Loading