Skip to content

Calling cfgrib.open_datasets() with indexpath='' can have severe performance impacts with heterogeneous GRIBS #420

@Xyrus-WS

Description

@Xyrus-WS

What happened?

As the title states, if you set cfgrib.open_datasets() to not generate index files (indexpath='') for heterogeneous GRIB files, there is a severe performance penalty. The larger the dataset, and the more hyper cubes there are, the worst the performance becomes.

After briefly looking into the code a bit it looks like when indexpath is set to the empty string, the index is being regenerated for every hypercube cfgrib finds in the GRIB. Depending on the size and complexity of the grib, this can cause a substantial slow down in loading the datasets. In my case, a file that took 15 seconds to load by letting cfgrib generate an index, took almost 800 seconds to load when setting indexpath to an empty string.

The workaround I'm using to avoid index files is to use python's tempfile to create a named temporary file and setting indexpath to that. Temporary are local, writeable, and are natively cleaned up by the OS.

Unfortunately I'm not allowed to share the GRIB files

What are the steps to reproduce the bug?

grib_file = "some_big_heterogeneous_grib.grb"
dt = time.perf_counter()
#what I use to avoid index files hanging around
kwargs = {'indexpath': tempfile.NamedTemporaryFile(prefix=os.path.basename(grib_file), suffix='.idx', mode='w').name}
datasets = cfgrib.open_datasets(grib_file, backend_kwargs=kwargs)
print("cfgrib index generation " + str(time.perf_counter() - dt) + "s")

dt = time.perf_counter()
kwargs = {'indexpath':''}
datasets = cfgrib.open_datasets(grib_file,backend_kwargs=kwargs)
print("cfgrib no index generation " + str(time.perf_counter() - dt) + "s")

Version

0.9.15

Platform (OS and architecture)

Ubuntu Linux 20.04

Relevant log output

Accompanying data

No response

Organisation

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions