The search hash is an obstacle to efficiency #422

wmay · 2025-04-04T20:20:36Z

wmay
Apr 4, 2025

I've been trying some modifications of Herbie, and the hash appeared as an obstacle that's preventing some common sense improvements. Are you open to changing it?

The problem is that the hash for the search string requires an index file. And you need that hash to create the file paths. So you can't even check for a file locally without downloading an unnecessary index file to calculate the hash.

For some uses that can waste a huge amount of time. But it also causes downstream problems -- for example, I'd like to create a version of Herbie with lazy loading, so files aren't downloaded in the init method. That turns out to be pretty complicated in part because of the hash.

As a very simple alternative, Herbie could directly hash the search string, without using any index file. As far as I know this would only lead to different behavior when someone uses multiple search strings that have exactly the same results. That seems like a marginal case that Herbie could safely ignore, with the benefit that it would save a bunch of time in many other cases.

This would also go at least partway to solving #169

blaylockbk · 2025-04-05T05:25:23Z

blaylockbk
Apr 5, 2025
Maintainer

Hi @wmay, thanks for your comment.

The problem is that the hash for the search string requires an index file. And you need that hash to create the file paths. So you can't even check for a file locally without downloading an unnecessary index file to calculate the hash.

Totally agree. In fact, the original hashing implementation was based on the search string. I changed to the current implementation because, like you point out, different search strings can give you the same result, and I wanted to avoid downloading duplicate data as much as possible.

You have a good point, though, we could put the responsibility on the user to be consistent. (I'm not consistent, that's why I went with reading the inventory file.)

A bit related, some users have requested a feature to download local copies of the index file and read those if they exist rather than looking on the internet. This makes a lot of sense because they are small files. If a local copy of the inventory always existed for previously downloaded GRIB files, then Herbie could hash on that rather than looking online again. This doesn't really help your interest in a lazy loading without downloading approach, unless you fetched the index files as part of the lazy load.

I've also had the idea of making a "build-your-own-grib" feature where you define the dates, lead times, variables you want, then Herbie would read the necessary index files, and download each subset and pack all your data into a single or multiple files.

I'm still thinking out loud a bit, but I think figuring this out is worth while.

0 replies

wmay · 2025-04-05T06:05:38Z

wmay
Apr 5, 2025
Author

Hm. To the extent that people download duplicate data with different searches, even the current code doesn't seem ideal-- if 5/6 of the variables are the same but 1 is different, you'll get a whole new file regardless. In that sense the hashing and the deduplicating seem to be in tension with each other.

Definitely if the code relies on that original index file, it makes sense to save it IMO. If we have to download it anyway, and it's small, that makes sense. And then, when the index file is missing, we can infer that the data file doesn't exist without needing a hash.

0 replies

wmay · 2025-04-07T20:34:45Z

wmay
Apr 7, 2025
Author

I'd be happy to send a bunch of pull requests along these lines. Just want to make sure you're in agreement since these are significant changes.

From this discussion specifically, I see two potential changes (which are completely independent of each other) --

Save the remote index files. We could give them extensions to reflect the source, like .idx.aws, .idx.nomads, etc. to distinguish them from the local files.
Avoid duplication. This could be done very simply by taking the hash out of the file name, and checking to see if the data already exists in the file before downloading.

As I'm modifying Herbie to improve it for my own use, I'm wondering whether these mods make the most sense as another package, a fork, or what. Mostly it seems like it makes more sense for them to be incorporated back into Herbie, if it fits with what you're trying to do.

2 replies

blaylockbk Apr 9, 2025
Maintainer

Yes, this would be a significant change. Adding mods in a fork could get back into Herbie. It might also make sense to make a new package as a Herbie extension; maybe this new package would use Herbie as a dependency just to get the remote paths for the grib and index files. It's up to you.

Save the remote index files...

I would prefer this be an "opt-in" feature, preferably with the argument Herbie(..., download_index=True). The current behavior to read the remote index file hasn't been an issue for most users and use cases.

...We could give them extensions to reflect the source, like .idx.aws, .idx.nomads, etc. to distinguish them from the local files.

Herbie assumes the idx files are the same between platforms, so it's probably not necessary to append the source to the file name.

Avoid duplication. This could be done very simply by taking the hash out of the file name, and checking to see if the data already exists in the file before downloading.

It would be nice if all subsets were appended to a "subset_*.grib2.idx` file, but it's not clear to me how to both check what data is in an existing subset file and check if any fields match the users search string that don't need to be redownloaded. This would require wgrib2 to list the contents of the local subset, but wgrib2 is not easily installed on Windows. While eccodes can list the contents, it doesn't match the wgrib2-style index files. I don't know if it's worth the trouble to handle converting between eccode-style and wgrib2-style index listings.

blaylockbk Apr 9, 2025
Maintainer

...but before you start, I'd recommend looking into these other two projects and see if they can do what you're trying to do

wmay · 2025-04-09T19:16:01Z

wmay
Apr 9, 2025
Author

OK I see, this is a complicated set of constraints.

maybe this new package would use Herbie as a dependency just to get the remote paths for the grib and index files

This comes around full circle, where this is made awkward because the Herbie init method downloads files before it can calculate the paths. So I end up modifying a bunch of the Herbie internals to work around that, which isn't the best long-term solution, but it is fine for now, though.

Those two suggestions look good, thanks for pointing those out.

I'm not a windows user, so I'm actually kind of surprised by the wgrib2 issue. wgrib2 is on conda, and isn't the whole point of conda to make installation easy on windows? Oh, I just saw that there's no conda windows installer for it, wow.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The search hash is an obstacle to efficiency #422

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

The search hash is an obstacle to efficiency #422

Uh oh!

wmay Apr 4, 2025

Replies: 4 comments · 2 replies

Uh oh!

blaylockbk Apr 5, 2025 Maintainer

Uh oh!

wmay Apr 5, 2025 Author

Uh oh!

wmay Apr 7, 2025 Author

Uh oh!

blaylockbk Apr 9, 2025 Maintainer

Uh oh!

blaylockbk Apr 9, 2025 Maintainer

Uh oh!

wmay Apr 9, 2025 Author

wmay
Apr 4, 2025

Replies: 4 comments 2 replies

blaylockbk
Apr 5, 2025
Maintainer

wmay
Apr 5, 2025
Author

wmay
Apr 7, 2025
Author

blaylockbk Apr 9, 2025
Maintainer

blaylockbk Apr 9, 2025
Maintainer

wmay
Apr 9, 2025
Author