Huggingface by geoalgo · Pull Request #96 · autogluon/tabarena

geoalgo · 2025-02-14T09:37:49Z

Issue #, if available: #66

Description of changes:
(I would recommend to quickly read the issue before doing the PR)
The PR uses HF as the storage layer for TabRepo.
The corresponding HF dataset is https://huggingface.co/datasets/Tabrepo/tabrepo

A few things:

The download speed is much faster, also the download API offer a couple useful options (resuming download from partial files etc)
The options are not exactly the same, the previous logic cant be mapped exactly to HF however the HF options are more standard
If merging this PR, we should simplify BenchmarkPaths (we just need to have the list of datasets not the full list of files) at later point. Ideally, I would be keen to not couple this PR with a large refactoring (it took me quite some time to transfer the files and set the HF repo properly...), adapting to just use a list of datasets will be doable, supporting all download options would be challenging on my end.
We could also offer the utils to download from HF alternatively to the current code, this would avoid a refactoring, however I think we should get rid of the S3 backend as it is very slow to download all the datasets

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Innixma · 2025-02-14T20:51:38Z

Thanks for the PR! Will try to review and test next week

Innixma · 2025-02-14T20:52:56Z

tabrepo/utils/huggingfacehub_utils.py

+        version: str = "2023_11_14",
+        force_download: bool = False,
+        local_files_only: bool = False,
+        datasets: list[str] | None = None,


Maybe we can also add folds: list[int] | None = None, as an option, in case the user wants to only download the first fold?

I wonder if there is a big benefit, if you want to reduce the size, you could just consider a smaller dataset list right?

That is true, but I wonder if it would be easy to add?

Yes we could but I do not see the benefit, perhaps it can be done in a second step if we see a strong need?

Innixma · 2025-02-14T20:58:05Z

tabrepo/utils/huggingfacehub_utils.py

+def upload_hugging_face(version: str, repo_id: str, override_existing_files: bool = True):
+    """
+    Uploads tabrepo data to Hugging Face repository.
+    You should set your env variable HF_TOKEN and ask write access to tabrepo before using the script.
+
+    Args:
+        version (str): The version of the data to be uploaded, the folder data/results/{version}/ should
+        be present and should contain baselines.parquet, configs.parquet and model_predictions/ folder
+        repo_id (str): The ID of the Hugging Face repository.
+        override_existing_files (bool): Whether to re-upload files if they are already found in HuggingFace.
+    Returns:
+        None
+    """


Note: Currently there is an additional file that are available in s3 after my earlier code refactor:

task_metadata.csv

I'm assuming this is still being fetched from s3 in the PR.
Don't need to add it now, as I probably will move a few other files to being downloaded instead of being hard-coded into the TabRepo codebase, such as config_hyperparameters.

Just mentioning so we are aware.

No, it is downloaded from HF (the following files are being downloaded in addition to the predictions, see l87:

allow_patterns += [ "*baselines.parquet", "*configs.parquet", "*baselines.parquet", "*task_metadata.csv", ]

If we want to move other files to HF it should be pretty easy (and it has git-versioning which is also a good thing)

I don't see task_metadata.csv in the HuggingFace repo? https://huggingface.co/datasets/Tabrepo/tabrepo/tree/main/2023_11_14

Indeed, thanks for the catch I added it.

Innixma · 2025-02-14T20:58:56Z

tabrepo/utils/huggingfacehub_utils.py

+            print(str(e))
+
+
+def download_from_huggingface(


You mentioned that some functionality is being dropped from the s3 logic. Which functionality is no longer available?

The exists mode with "ignore", "overwrite", "raise" cant be easily matched. Also the logging will be a bit different.

The exists logic is rather important I feel. Which parts still work? What is the blocker? The logging I don't mind if it is different.

It needs to be reworked, the logic for download is different, I would like to not couple this PR with this if possible.

Sure, we can avoid coupling.

Innixma · 2025-02-14T21:00:14Z

tabrepo/utils/huggingfacehub_utils.py

+    """
+    # https://huggingface.co/datasets/Tabrepo/tabrepo/tree/main/2023_11_14/model_predictions
+    api = HfApi()
+    local_dir = str(Path(__file__).parent.parent.parent / "data/results/")


We should import a path from a TabRepo util / constants file instead of calling Path(__file__) here. That way the code doesn't break if we happen to move this file to a new location.

Indeed good catch, I had to do this modification already, will update it.

Innixma · 2025-02-14T21:00:50Z

tabrepo/utils/huggingfacehub_utils.py

+    """
+    # https://huggingface.co/datasets/Tabrepo/tabrepo/tree/main/2023_11_14/model_predictions
+    api = HfApi()
+    local_dir = str(Path(__file__).parent.parent.parent / "data/results/")


We should be able to specify a local_dir as an argument

Makes sense, will update to make this flexible.

tabrepo/utils/huggingfacehub_utils.py

Innixma · 2025-02-14T21:02:46Z

tabrepo/utils/huggingfacehub_utils.py

+        except Exception as e:
+            print(str(e))


Why not raise an exception?

To allow other datasets to upload. I will added a flag to control this.

Innixma · 2025-02-14T21:05:08Z

scripts/simulate/run_download_input.py

@@ -2,14 +2,8 @@


 if __name__ == '__main__':


General comment:

I will probably lean towards keeping the s3 download logic in addition to the HF download logic, as it can be useful when dealing with private data that hasn't been approved for public release yet, as it is easier to put this on a private s3 bucket.

Makes sense, but then wouldnt it makes sense to move the s3 logic in a utils given that it is private?

Let me know if this solution works for you, I can also add a flag to configure whether to use s3 or HF.

The main requirement is that there should be a flag to specify easily between the two. For example, if we have internal teams who want to share their internal runs with us / store them for their own use, they probably won't have permission to create a HF repo, so they would need to use S3.

I can add a flag between the two, but would it be sufficient to merge the PR?

I think a flag with the default still being s3 for now, will probably switch to HF default later on.

Once that is added + the s3 logic is added back in, I'm happy to merge

Ok thanks, I added back an option to use S3.

I would recommend to switch to HF though as downloading TabRepo on external machines is currently painful (it takes more than a day). I benchmarked the two on the 30 datasets and HF was ~3x faster (it will be faster I believe when using larger datasets as it has been optimized for this use-case).

from tabrepo.utils import catchtime from tabrepo import load_repository context_name = "D244_F3_C1530_30" # Time for with S3: 100.1188 secs with catchtime("with S3"): repo = load_repository(context_name, cache=False, use_s3=True) # delete the local files and then # Time for without S3: 36.9548 secs with catchtime("without S3"): repo = load_repository(context_name, cache=False, use_s3=False)

Definitely makes sense, I just want to first do a few things (like refactoring configs_hyperparamters) before switching by default.

Innixma · 2025-02-17T20:31:06Z

tabrepo/utils/huggingfacehub_utils.py

    api = HfApi()
-    local_dir = str(Path(__file__).parent.parent.parent / "data/results/")
+    if local_dir is None:
+        local_dir = str(Path(tabrepo.__path__[0]).parent / "data/results/"),


Maybe we make a path_data constant that we can import that is equivalent to this? Then it is easy to check which logic in TabRepo relates to a given output artifact directory. I think we had something similar for the paper scripts

Ok I did this proposed change.

Was the change pushed? I don't see it

https://github.com/autogluon/tabrepo/pull/96/files#diff-51bfa356430ae1cbb13d712982449a39196b3a8d1932d2bd0be54492d600dabcR87

Innixma · 2025-02-21T01:50:46Z

tabrepo/contexts/context.py

+        if verbose:
+            print(f'Downloading files for {self.name} context... '
+                  f'(include_zs={include_zs}, exists="{exists}")')
+        if use_s3:
+            print(f'Downloading files for {self.name} context... '
+                  f'(include_zs={include_zs}, exists="{exists}", dry_run={dry_run})')


duplicated logs?

Innixma · 2025-02-21T01:52:02Z

tabrepo/contexts/context.py

+        if verbose:
+            print(f'Downloading files for {self.name} context... '
+                  f'(include_zs={include_zs}, exists="{exists}")')
+        if use_s3:


Can we move this into a download_from_s3 method? That way it will look a good bit cleaner.

Innixma · 2025-02-21T01:53:50Z

tabrepo/utils/huggingfacehub_utils.py

+        None
+    """
+    commit_message = f"Upload tabrepo new version"
+    root = Path(__file__).parent.parent.parent / f"data/results/{version}/"


use results_path instead, similar to download_from_huggingface?

Innixma

LGTM, added some final comments. Once addressed happy to merge, thanks!

Innixma self-requested a review February 14, 2025 20:51

Innixma reviewed Feb 14, 2025

View reviewed changes

Innixma reviewed Feb 17, 2025

View reviewed changes

geoalgo mentioned this pull request Feb 20, 2025

How to access your precomputed results (1530 models on 211 datasets) ? #71

Closed

add files to upload/download from the hub

1f1c535

geoalgo force-pushed the huggingface branch from b521eec to 1f1c535 Compare February 20, 2025 08:59

Innixma reviewed Feb 21, 2025

View reviewed changes

Innixma approved these changes Feb 21, 2025

View reviewed changes

PR comment

7bf1d09

Innixma merged commit 2016166 into autogluon:main Feb 22, 2025
1 check passed

Conversation

geoalgo commented Feb 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Innixma commented Feb 14, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

geoalgo Feb 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

geoalgo Feb 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Innixma Feb 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

geoalgo commented Feb 14, 2025 •

edited

Loading

geoalgo Feb 17, 2025 •

edited

Loading

geoalgo Feb 17, 2025 •

edited

Loading

Innixma Feb 18, 2025 •

edited

Loading

Innixma Feb 21, 2025 •

edited

Loading