Skip to content

Commit 4229def

Browse files
committed
Expose support for multiple files to DataHandler
This commit surfaces the new capability introduced recently in NCFileReaders to read multiple files as if they were one. In this commit, I only add support for reading one variable split across multiple files, but extending to composing variables should be straightforward (when such capability will be required).
1 parent 9c9d998 commit 4229def

File tree

6 files changed

+288
-63
lines changed

6 files changed

+288
-63
lines changed

NEWS.md

Lines changed: 33 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,13 +4,45 @@ ClimaUtilities.jl Release Notes
44
main
55
------
66

7-
#### Support reading time data across multiple files. PR [#127](https://github.com/CliMA/ClimaUtilities.jl/pull/127)
7+
v0.1.20
8+
------
9+
10+
#### Support reading time data across multiple files. PRs [#127](https://github.com/CliMA/ClimaUtilities.jl/pull/127), [#132](https://github.com/CliMA/ClimaUtilities.jl/pull/132)
811

912
`NCFileReader`s can now read multiple files at the same time. The files have to
1013
contain temporal data for the given variable and they are aggregated along the
1114
time dimension. To use this feature, just pass a vector of file paths to the
1215
constructor.
1316

17+
This capability is also available to `DataHandler`s and `TimeVaryingInput`s. To
18+
use this feature, just pass the list of files that contain your variable of
19+
interested, for example
20+
```julia
21+
timevaryinginput = TimeVaryingInputs.TimeVaryingInput(["era5_1980.nc", "era5_1981.nc"],
22+
"u",
23+
target_space,
24+
start_date = Dates.DateTime(1980, 1, 1),
25+
regridder_type = :InterpolationsRegridder)
26+
```
27+
You can also compose variables
28+
```julia
29+
timevaryinginput = TimeVaryingInputs.TimeVaryingInput(["era5_1980.nc", "era5_1981.nc", "era5_1982.nc"],
30+
["u", "v"],
31+
target_space,
32+
start_date = Dates.DateTime(1980, 1, 1),
33+
regridder_type = :InterpolationsRegridder,
34+
compose_function = (x, y) -> sqrt(x^2 + y^2))
35+
```
36+
37+
When you compose variables, pay attention that `TimeVaryingInput` implements
38+
some heuristics to disambiguate the case where the passed list of files is split
39+
along the time or the variable dimension. You can always pass a list of lists to
40+
be explicit in your intentions. Read the
41+
[documentation](https://clima.github.io/ClimaUtilities.jl/dev/datahandling.html#Heuristics-to-do-what-you-mean)
42+
to learn more about this.
43+
44+
This capability is only available for the `InterpolationsRegridder`.
45+
1446
v0.1.19
1547
------
1648

Project.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
name = "ClimaUtilities"
22
uuid = "b3f4f4ca-9299-4f7f-bd9b-81e1242a7513"
33
authors = ["Gabriele Bozzola <[email protected]>", "Julia Sloan <[email protected]>"]
4-
version = "0.1.19"
4+
version = "0.1.20"
55

66
[deps]
77
Artifacts = "56f22d72-fd6d-98f1-02f0-08ddc0907c33"

docs/src/datahandling.md

Lines changed: 54 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -9,17 +9,18 @@ This is no trivial task. Among the challenges:
99
- IO can be very expensive,
1010
- CPU/GPU communication can be a bottleneck.
1111

12-
The `DataHandling` takes the divide and conquer approach: the various core tasks
12+
The `DataHandling` takes the divide-and-conquer approach: the various core tasks
1313
and features and split into other independent modules (chiefly
1414
[`FileReaders`](@ref), and [`Regridders`](@ref)). Such modules can be developed,
1515
tested, and extended independently (as long as they maintain a consistent
1616
interface). For instance, if need arises, the `DataHandler` can be used (almost)
1717
directly to process files with a different format from NetCDF.
1818

1919
The key struct in `DataHandling` is the `DataHandler`. The `DataHandler`
20-
contains one or more `FileReader`(s), a `Regridder`, and other metadata necessary to perform
21-
its operations (e.g., target `ClimaCore.Space`). The `DataHandler` can be used
22-
for static or temporal data, and exposes the following key functions:
20+
contains one or more `FileReader`(s), a `Regridder`, and other metadata
21+
necessary to perform its operations (e.g., target `ClimaCore.Space`). The
22+
`DataHandler` can be used for static or temporal data, and exposes the following
23+
key functions:
2324
- `regridded_snapshot(time)`: to obtain the regridded field at the given `time`.
2425
`time` has to be available in the data.
2526
- `available_times` (`available_dates`): to list all the `times` (`dates`) over
@@ -66,6 +67,50 @@ are composed.
6667
Composing multiple input variables is currently only supported with the
6768
`InterpolationsRegridder`, not with `TempestRegridder`.
6869

70+
Sometimes, the time development of a variable is split across multiple NetCDF
71+
files. `DataHandler` knows how to combine them and treat multiple files as if
72+
they were a single one. To use this feature, just pass the list of NetCDF files
73+
(while the file don't have to be sorted, it is good practice to pass them sorted
74+
in ascending order by time).
75+
76+
#### Heuristics to do-what-you-mean
77+
78+
`DataHandler` tries to interpret the files provided and identify if they are
79+
split across variables or along the time dimension. The heuristics implement are
80+
the following:
81+
- When just a file is passed, it is assumed that it contains everything
82+
- When multiple files are passed, `DataHandler` will assume that the files are
83+
split along variables if the number of files is the same the number of
84+
variables, otherwise, it will assume that each file contains all the variables
85+
for a portion of the total time.
86+
- When the above assumption is incorrect, you can pass a list of list of files
87+
that fully specifies variables and times.
88+
89+
For example,
90+
```julia
91+
data_handler = DataHandling.DataHandler(
92+
["era1980.nc", "era1981.nc"],
93+
["lai_hv", "lai_lv"],
94+
target_space;
95+
compose_function = (x, y) -> x + y,
96+
)
97+
```
98+
99+
In this case, `DataHandler` will incorrectly assume that `lai_hv` is contained
100+
in `era1980.nc`, and `lai_lv` is in `era1980.nc`. Instead, construct the
101+
`data_handler` by passing a list of lists
102+
```julia
103+
files = ["era1980.nc", "era1981.nc"]
104+
data_handler = DataHandling.DataHandler(
105+
[files, files],
106+
["lai_hv", "lai_lv"],
107+
target_space;
108+
compose_function = (x, y) -> x + y,
109+
)
110+
```
111+
where each element of the list is the collection of files that contain the time
112+
evolution of that variable.
113+
69114
## Example: Linear interpolation of a single data variable
70115

71116
As an example, let us implement a simple linear interpolation for a variable `u`
@@ -108,6 +153,11 @@ function linear_interpolation(data_handler, time)
108153
end
109154
```
110155

156+
If, for example, the data was split across multiple files named `era5_1980.nc`,
157+
`era5_1981.nc`, ... (e.g., each file containing one year), we could directly
158+
pass the list to the constructor for `DataHandler` (instead of just passing one
159+
file path), which knows how to combine them.
160+
111161
### Example appendix: Using multiple input data variables
112162

113163
Suppose that the input NetCDF file `era5_example.nc` contains two variables `u`

docs/src/inputs.md

Lines changed: 24 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -76,7 +76,7 @@ compose_function = (x, y) -> x + y
7676
# Define pre-processing function to convert units of input
7777
unit_conversion_func = (data) -> 1000 * data
7878

79-
data_handler = TimeVaryingInputs.TimeVaryingInput("era5_example.nc",
79+
timevaryinginput = TimeVaryingInputs.TimeVaryingInput("era5_example.nc",
8080
["u", "v"],
8181
target_space,
8282
start_date = Dates.DateTime(2000, 1, 1),
@@ -88,6 +88,29 @@ data_handler = TimeVaryingInputs.TimeVaryingInput("era5_example.nc",
8888
The same arguments (excluding `start_date`) could be passed to a
8989
`SpaceVaryingInput` to compose multiple input variables with that type.
9090

91+
#### Example: Data split across multiple NetCDF files
92+
93+
Often, large datasets come chunked, meaning that the data is split across
94+
multiple files with each file containing only a subset of the time interval.
95+
`TimeVaryingInput`s know to combine data across multiple files as it were
96+
provided in a single file. To do use this feature, just pass the list of file
97+
paths. While it is not required for the files to be in order, it is good
98+
practice to pass them in ascending order by time.
99+
100+
For example:
101+
```julia
102+
timevaryinginput = TimeVaryingInputs.TimeVaryingInput(["era5_1980.nc", "era5_1981.nc"],
103+
"u",
104+
target_space,
105+
start_date = Dates.DateTime(1980, 1, 1),
106+
regridder_type = :InterpolationsRegridder
107+
)
108+
```
109+
110+
This capability is only available for the `InterpolationsRegridder`.
111+
112+
Read more about this feature in the page about [`DataHandler`](@ref).
113+
91114
### Extrapolation boundary conditions
92115

93116
`TimeVaryingInput`s can have multiple boundary conditions for extrapolation. By

ext/DataHandlingExt.jl

Lines changed: 108 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -93,9 +93,40 @@ struct DataHandler{
9393
preallocated_read_data::PR
9494
end
9595

96+
97+
"""
98+
_check_file_paths_varnames(file_paths, varnames, regridder_type, compose_function)
99+
100+
Check consistency of `file_paths`, `varnames`, `regridder_type`, and `compose_function` for
101+
our current `DataHandler`.
102+
"""
103+
function _check_file_paths_varnames(
104+
file_paths,
105+
varnames,
106+
regridder_type,
107+
compose_function,
108+
)
109+
# Verify that the number of file paths and variable names are consistent
110+
if length(varnames) == 1
111+
# Multiple files are not not supported by TempestRegridder
112+
(length(file_paths) > 1 && regridder_type == :TempestRegridder) &&
113+
error("TempestRegridder does not support multiple input files")
114+
else
115+
# We have multiple variables
116+
# This is not supported by TempestRegridder
117+
regridder_type == :TempestRegridder &&
118+
error("TempestRegridder does not support multiple input variables")
119+
120+
# We need a compose_function when passed multiple variables
121+
compose_function == identity && error(
122+
"`compose_function` must be specified when using multiple input variables",
123+
)
124+
end
125+
end
126+
96127
"""
97-
DataHandler(file_paths::Union{AbstractString, AbstractArray{<:AbstractString}},
98-
varnames::Union{AbstractString, AbstractArray{<:AbstractString}},
128+
DataHandler(file_paths,
129+
varnames,
99130
target_space::ClimaCore.Spaces.AbstractSpace;
100131
start_date::Dates.DateTime = Dates.DateTime(1979, 1, 1),
101132
regridder_type = nothing,
@@ -104,8 +135,14 @@ end
104135
file_reader_kwargs = ())
105136
106137
Create a `DataHandler` to read `varnames` from `file_paths` and remap them to `target_space`.
107-
`file_paths` may contain either one path for all variables or one path for each variable.
108-
In the latter case, the entries of `file_paths` and `varnames` are expected to match based on position.
138+
139+
This function supports reading across multiple files and composing variables that are in
140+
different files.
141+
142+
143+
`file_paths` may contain either one path for all variables or one path for each variable. In
144+
the latter case, the entries of `file_paths` and `varnames` are expected to match based on
145+
position.
109146
110147
The DataHandler maintains an LRU cache of Fields that were previously computed.
111148
@@ -114,7 +151,21 @@ Creating this object results in the file being accessed (to preallocate some mem
114151
Positional arguments
115152
=====================
116153
117-
- `file_paths`: Paths of the NetCDF file(s) that contain the input data.
154+
- `file_paths`: Paths of the NetCDF file(s) that contain the input data. `file_paths` should
155+
be as "do-what-I-mean" as possible, meaning that it should behave as you expect.
156+
157+
To be specific, there are three options for `file_paths`:
158+
- It is a string that points to a single NetCDF file.
159+
- It is a list that points to multiple NetCDF files. In this case, we support two modes:
160+
1. if `varnames` is a vector with the number of entries as `file_paths`, we assume that
161+
each file contains a different variable.
162+
2. otherwise, we assume that each file contains all the variables and is temporal chunk.
163+
- It is a list of lists of paths to NetCDF files, where the inner list identifies temporal
164+
chunks of a given variable, and the outer list identifies different variables
165+
(supporting the mode where different variables live in different files and their time
166+
development is split across multiple files). In other words, `file_paths[i]` is the list
167+
of files that define the temporal evolution of `varnames[i]`
168+
118169
- `varnames`: Names of the datasets in the NetCDF that have to be read and processed.
119170
- `target_space`: Space where the simulation is run, where the data has to be regridded to.
120171
@@ -138,13 +189,16 @@ everything more type stable.)
138189
It can be a NamedTuple, or a Dictionary that maps Symbols to values.
139190
- `file_reader_kwargs`: Additional keywords to be passed to the constructor of the file reader.
140191
It can be a NamedTuple, or a Dictionary that maps Symbols to values.
141-
- `compose_function`: Function to combine multiple input variables into a single data variable.
142-
The default, to be used in the case of one input variable, is the identity.
143-
Note that the order of `varnames` must match the argument order of `compose_function`.
192+
- `compose_function`: Function to combine multiple input variables into a single data
193+
variable. The default, to be used in the case of one input variable,
194+
is the identity. The compose function has to take N arguments, where
195+
N is the number of variables in `varnames`, and return a scalar.
196+
The order of the arguments in `compose_function` has to match the order
197+
of `varnames`. This function will be broadcasted to data read from file.
144198
"""
145199
function DataHandling.DataHandler(
146-
file_paths::Union{AbstractString, AbstractArray{<:AbstractString}},
147-
varnames::Union{AbstractString, AbstractArray{<:AbstractString}},
200+
file_paths,
201+
varnames,
148202
target_space::ClimaCore.Spaces.AbstractSpace;
149203
start_date::Union{Dates.DateTime, Dates.Date} = Dates.DateTime(1979, 1, 1),
150204
regridder_type = nothing,
@@ -182,38 +236,51 @@ function DataHandling.DataHandler(
182236
varnames = [varnames]
183237
end
184238

185-
# Verify that the number of file paths and variable names match
186-
(length(file_paths) > 1 && length(file_paths) != length(varnames)) && error(
187-
"Number of file paths ($(length(file_paths))) and variable names ($(length(varnames))) do not match.",
188-
)
189-
190-
# Verify that `compose_function` is specified when using multiple input variables
191-
(length(varnames) > 1 && compose_function == identity) && error(
192-
"`compose_function` must be specified when using multiple input variables",
193-
)
194-
195-
# Verify that `compose_function` is identity when using a single input variable
196-
(length(varnames) == 1 && compose_function != identity) && error(
197-
"`compose_function` must be identity when using a single input variable",
198-
)
199-
200-
# TempestRegridder does not support multiple input variables
201-
(length(varnames) > 1 && regridder_type == :TempestRegridder) &&
202-
error("TempestRegridder does not support multiple input variables")
203-
204239
# Determine which regridder to use if not already specified
205240
regridder_type =
206241
isnothing(regridder_type) ? Regridders.default_regridder_type() :
207242
regridder_type
208243

209-
# Construct a file reader, which deals with ingesting data and is possibly buffered/cached, for each input file
210-
file_readers = Dict{String, AbstractFileReader}()
211-
all_vars_in_same_file = length(file_paths) == 1
212-
for (i, varname) in enumerate(varnames)
213-
file_path = all_vars_in_same_file ? first(file_paths) : file_paths[i]
214-
file_readers[varname] =
215-
NCFileReader(file_path, varname; file_reader_kwargs...)
244+
_check_file_paths_varnames(
245+
file_paths,
246+
varnames,
247+
regridder_type,
248+
compose_function,
249+
)
250+
251+
# We have to deal with the case with have 1 FileReader (with possibly multiple files),
252+
# or with N FileReaders (for when variables are split across files, and with possibly
253+
# multiple files). To accommodate all these cases, we cast everything into the format
254+
# where we have a list of lists, where the outer list is along variable names, and the
255+
# inner list is along times. This is the most general input we expect from this
256+
# constructor.
257+
258+
is_file_paths_list_of_lists = !(first(file_paths) isa AbstractString)
259+
260+
if !is_file_paths_list_of_lists
261+
# If is_file_paths_list_of_lists not already a list of lists, we have two options:
262+
# 1. file_paths identifies the temporal development of the variables
263+
# 2. file_paths identifies different variables
264+
265+
# We use as heuristic that when the number of files provided is the same as the
266+
# number of variables, that means that the files include different variables
267+
if length(file_paths) == length(varnames)
268+
# One file per variable
269+
file_paths = [[f] for f in file_paths]
270+
else
271+
# Every file per every variable
272+
file_paths = [copy(file_paths) for _ in varnames]
273+
end
216274
end
275+
# Now, we have a list of lists, where file_paths[i] is the list of files that define the
276+
# temporal evolution of varnames[i]
277+
278+
# Construct the file readers, which deals with ingesting data and is possibly
279+
# buffered/cached, for each variable
280+
file_readers = Dict(
281+
varname => NCFileReader(paths, varname; file_reader_kwargs...) for
282+
(paths, varname) in zip(file_paths, varnames)
283+
)
217284

218285
# Verify that the spatial dimensions are the same for each variable
219286
@assert length(
@@ -248,9 +315,11 @@ function DataHandling.DataHandler(
248315
regridder_kwargs = merge((; regrid_dir), regridder_kwargs)
249316
end
250317

251-
# Note: using one arbitrary element of `varnames` and of `file_paths` assumes
252-
# that all input variables will use the same regridding
253-
regridder_args = (target_space, first(varnames), first(file_paths))
318+
# Note: using one arbitrary element of `varnames` and of `file_paths`
319+
# assumes that all input variables will use the same regridding (there
320+
# are two firsts in file_paths because we now have a list of lists)
321+
regridder_args =
322+
(target_space, first(varnames), first(first(file_paths)))
254323
elseif regridder_type == :InterpolationsRegridder
255324
regridder_args = (target_space,)
256325
end

0 commit comments

Comments
 (0)