Replies: 4 comments 2 replies
-
|
Hi @wmay, thanks for your comment.
Totally agree. In fact, the original hashing implementation was based on the search string. I changed to the current implementation because, like you point out, different search strings can give you the same result, and I wanted to avoid downloading duplicate data as much as possible. You have a good point, though, we could put the responsibility on the user to be consistent. (I'm not consistent, that's why I went with reading the inventory file.) A bit related, some users have requested a feature to download local copies of the index file and read those if they exist rather than looking on the internet. This makes a lot of sense because they are small files. If a local copy of the inventory always existed for previously downloaded GRIB files, then Herbie could hash on that rather than looking online again. This doesn't really help your interest in a lazy loading without downloading approach, unless you fetched the index files as part of the lazy load. I've also had the idea of making a "build-your-own-grib" feature where you define the dates, lead times, variables you want, then Herbie would read the necessary index files, and download each subset and pack all your data into a single or multiple files. I'm still thinking out loud a bit, but I think figuring this out is worth while. |
Beta Was this translation helpful? Give feedback.
-
|
Hm. To the extent that people download duplicate data with different searches, even the current code doesn't seem ideal-- if 5/6 of the variables are the same but 1 is different, you'll get a whole new file regardless. In that sense the hashing and the deduplicating seem to be in tension with each other. Definitely if the code relies on that original index file, it makes sense to save it IMO. If we have to download it anyway, and it's small, that makes sense. And then, when the index file is missing, we can infer that the data file doesn't exist without needing a hash. |
Beta Was this translation helpful? Give feedback.
-
|
I'd be happy to send a bunch of pull requests along these lines. Just want to make sure you're in agreement since these are significant changes. From this discussion specifically, I see two potential changes (which are completely independent of each other) --
As I'm modifying Herbie to improve it for my own use, I'm wondering whether these mods make the most sense as another package, a fork, or what. Mostly it seems like it makes more sense for them to be incorporated back into Herbie, if it fits with what you're trying to do. |
Beta Was this translation helpful? Give feedback.
-
|
OK I see, this is a complicated set of constraints.
This comes around full circle, where this is made awkward because the Herbie init method downloads files before it can calculate the paths. So I end up modifying a bunch of the Herbie internals to work around that, which isn't the best long-term solution, but it is fine for now, though. Those two suggestions look good, thanks for pointing those out. I'm not a windows user, so I'm actually kind of surprised by the wgrib2 issue. wgrib2 is on conda, and isn't the whole point of conda to make installation easy on windows? Oh, I just saw that there's no conda windows installer for it, wow. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I've been trying some modifications of Herbie, and the hash appeared as an obstacle that's preventing some common sense improvements. Are you open to changing it?
The problem is that the hash for the search string requires an index file. And you need that hash to create the file paths. So you can't even check for a file locally without downloading an unnecessary index file to calculate the hash.
For some uses that can waste a huge amount of time. But it also causes downstream problems -- for example, I'd like to create a version of Herbie with lazy loading, so files aren't downloaded in the init method. That turns out to be pretty complicated in part because of the hash.
As a very simple alternative, Herbie could directly hash the search string, without using any index file. As far as I know this would only lead to different behavior when someone uses multiple search strings that have exactly the same results. That seems like a marginal case that Herbie could safely ignore, with the benefit that it would save a bunch of time in many other cases.
This would also go at least partway to solving #169
Beta Was this translation helpful? Give feedback.
All reactions