Skip to content

Add support for loading datasets from Hugging Face using from_huggingface_oscur#49

Merged
simonprovost merged 5 commits intomainfrom
direct_hf_load
May 10, 2025
Merged

Add support for loading datasets from Hugging Face using from_huggingface_oscur#49
simonprovost merged 5 commits intomainfrom
direct_hf_load

Conversation

@soniacq
Copy link
Contributor

@soniacq soniacq commented May 2, 2025

WIP
Add support for loading datasets from Hugging Face using from_huggingface_oscur method in LoaderFactory


📚 Documentation preview 📚: https://UrbanMapper--49.org.readthedocs.build/en/49/

@soniacq soniacq requested a review from simonprovost May 2, 2025 15:37
Copy link
Member

@simonprovost simonprovost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So needed feature to Urban Mapper ! Thanksss @soniacq for this one 1️⃣

Overall, nice PR. Let's just turn it to be more of a wide scope being able to target any datasets from any HF hub repository.

More work will need to be done for any potential merge as per the comments below, but here are a couple I could not make throughout the code:

  • Re working the loader basic jupyter notebook so that: Following #45 it will need to show this current better integration or show both, so that users can do it themselves if they want, I guess it's not bad to show both ways (our integration, and a manual one as per #45). Your call @soniacq. <-------- A couple of lines of code to add to the Notebook.
  • Update the documentation guide for the loader to show a full section on this current nice integration of HF. See more in docs/user-guide/modules/1-loaders.md. <-------- One to three paragraphs at most and one title.
  • Update the API reference point to the Loader_Factory now that we have a new from_X method thanks to your PR. See more in docs/api/loaders.md. <------ One line.

Cheers and thanks once more 🎉

PS: Feel free to resolve the comments when they are done, it notifies my email so that I can come back as soon as possible to review again!

@simonprovost
Copy link
Member

simonprovost commented May 7, 2025

Hey @soniacq ! Thanks so so much for this one this is so cool 😎

Was just passing by seeing if you let me a comment for me the re review. So I'll take the following opportunity:

I can see the compilation does not go through. Easy peasy ! Now that the integration of HF will be mandatory in the Loader module – you will need to add datasets to the dependencies. To do so: uv add datasets should be enough :) Modifying therefore the pyproject.tml (git commit -m "core: add datasets from HF").

Let me know if you have any question (on discord is all good as always)

Cheers

@simonprovost
Copy link
Member

simonprovost commented May 8, 2025

@soniacq Myyyy mistake for missing to also say that when adding a new package; the requirements txt files also need update (on the long term this will vanish).

  1. Update requirements.txt (for people not using UV)
uv pip compile pyproject.toml -o requirements.txt    
  1. Update requirements-dev.txt (for ReadTheDoc to compile the doc)
 uv export --dev --no-hashes | awk '{print $1}' FS=' ;' > requirements-dev.txt

Note

The above (1) will vanish when PyPi will get Urban Mapper first release. (2) will never vanish as this is how Read The Doc works. Unless, this appears to be merged at some point: astral-sh/uv#10074

Next review I intend to look at the generated preview documentation (When it compiles) 👌

Cheers

@soniacq
Copy link
Contributor Author

soniacq commented May 8, 2025

@simonprovost do we really need dill>=0.3.9?.
There is a version conflict between the datasets package and the dill package. More info here.
According to the error message:

  • datasets versions from 2.14.5 up to <2.17.0 require dill>=0.3.0,<0.3.8.
  • datasets versions 2.17.0 and above require dill>=0.3.0,<0.3.9.

UrbanMapper explicitly asks for dill>=0.3.9, which is incompatible with all available datasets versions.
Can we downgrade dill to <0.3.8to solve this?

@simonprovost
Copy link
Member

@simonprovost do we really need dill>=0.3.9?. There is a version conflict between the datasets package and the dill package. More info here. According to the error message:

  • datasets versions from 2.14.5 up to <2.17.0 require dill>=0.3.0,<0.3.8.
  • datasets versions 2.17.0 and above require dill>=0.3.0,<0.3.9.

UrbanMapper explicitly asks for dill>=0.3.9, which is incompatible with all available datasets versions. Can we downgrade dill to <0.3.8to solve this?

@soniacq Go for it! I do not think this will cause any issue to the saved Urban Pipeline a alpha version ✅

Cheers

@simonprovost
Copy link
Member

simonprovost commented May 8, 2025

All right @soniacq, amazing PR thanks for this contribution, certainly is going to be helpful to more than one (I hope so!) 💪

Let me summarize what I've modified (as agreed upon on Discord):

  • Docstring of from_huggingface(.) is much more flourished, sort of guiding the users with what is it, where to look, taking the opportunity to promote OSCUR datasets.

See here: https://urbanmapper--49.org.readthedocs.build/en/49/api/loaders/#urban_mapper.modules.loader.LoaderFactory.from_huggingface

Feel free to modify, I am a bit tired and am not I guess an expert with doc either so if changes needed to be done feel free to do it ✅

  • When catching exceptions, we now show the best matches for a potential typos by simply looking through all the datasets in HF hub. Note that it could take time, around 2 minutes so I introduced a new quiet debug parameter in case and it is explained in the doctoring.

Next, and you did not know about that. I reworked the history entirely because it was becoming a spaghetti kind of history with multiple commits changing the very same file which in practice is not best to do given that it adds noisy commits. If we do that on 100 PRs then the history will not make sense anymore. I recommend this article for fixing up stuff next time but no rush: https://github.com/TheAssemblyArmada/Thyme/wiki/Using-Fixup-Commits. Meanwhile, we went from 10 commits to three therefore. To avoid having the "contribution" reset, feel free to reset HEAD~3 and redo the commits if you want, that's no problem even highly recommended, but I guess we do not need more than three commits here just so you know.

Steps:

git reset --soft HEAD~n # N being three I believe

Then commit again, to have a clean history, what I pushed is like the example of a clean history.

Lastly, I have successfully rebase with main (git checkout main && git pull && git checkout - && git rebase main) and lucky me there was zero conflicts 🎉 So now if you git pull you are up to date with main.

As a result, to merge the PR, the last thing we need to do is now updating the Loader Basic example notebook in reworking it all so that the from_huggingface(.) is introduced. In the User Guide you did it well but in the notebook we also need to do it. May you do this and push indeed here a new commit with the changes made to the notebook please ? 💪

Cheers

@simonprovost
Copy link
Member

@soniacq Just so you are aware. Do not forget to git checkout main && git pull && git checkout - && git rebase main the #50 been merged thanks to your various discoveries!

soniacq added 2 commits May 9, 2025 11:52
…LoaderFactory to ensure latitude and longitude columns are specified.
@simonprovost simonprovost linked an issue May 10, 2025 that may be closed by this pull request
12 tasks
@simonprovost simonprovost removed a link to an issue May 10, 2025
12 tasks
Copy link
Member

@simonprovost simonprovost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brillant! Love this PR so much, thanks @soniacq for this addition to UM 🎉

Merging now, we've covered enough and seems to be pretty reliable so let's gooo!

Cheers!!!!

@simonprovost simonprovost merged commit ab3cda2 into main May 10, 2025
8 checks passed
@soniacq soniacq deleted the direct_hf_load branch June 23, 2025 18:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants