-
Notifications
You must be signed in to change notification settings - Fork 83
Description
What happened?
As the title states, if you set cfgrib.open_datasets() to not generate index files (indexpath='') for heterogeneous GRIB files, there is a severe performance penalty. The larger the dataset, and the more hyper cubes there are, the worst the performance becomes.
After briefly looking into the code a bit it looks like when indexpath is set to the empty string, the index is being regenerated for every hypercube cfgrib finds in the GRIB. Depending on the size and complexity of the grib, this can cause a substantial slow down in loading the datasets. In my case, a file that took 15 seconds to load by letting cfgrib generate an index, took almost 800 seconds to load when setting indexpath to an empty string.
The workaround I'm using to avoid index files is to use python's tempfile to create a named temporary file and setting indexpath to that. Temporary are local, writeable, and are natively cleaned up by the OS.
Unfortunately I'm not allowed to share the GRIB files
What are the steps to reproduce the bug?
grib_file = "some_big_heterogeneous_grib.grb"
dt = time.perf_counter()
#what I use to avoid index files hanging around
kwargs = {'indexpath': tempfile.NamedTemporaryFile(prefix=os.path.basename(grib_file), suffix='.idx', mode='w').name}
datasets = cfgrib.open_datasets(grib_file, backend_kwargs=kwargs)
print("cfgrib index generation " + str(time.perf_counter() - dt) + "s")
dt = time.perf_counter()
kwargs = {'indexpath':''}
datasets = cfgrib.open_datasets(grib_file,backend_kwargs=kwargs)
print("cfgrib no index generation " + str(time.perf_counter() - dt) + "s")
Version
0.9.15
Platform (OS and architecture)
Ubuntu Linux 20.04
Relevant log output
Accompanying data
No response
Organisation
No response