Skip to content

Huggingface#96

Merged
Innixma merged 2 commits intoautogluon:mainfrom
geoalgo:huggingface
Feb 22, 2025
Merged

Huggingface#96
Innixma merged 2 commits intoautogluon:mainfrom
geoalgo:huggingface

Conversation

@geoalgo
Copy link
Collaborator

@geoalgo geoalgo commented Feb 14, 2025

Issue #, if available: #66

Description of changes:
(I would recommend to quickly read the issue before doing the PR)
The PR uses HF as the storage layer for TabRepo.
The corresponding HF dataset is https://huggingface.co/datasets/Tabrepo/tabrepo

A few things:

  • The download speed is much faster, also the download API offer a couple useful options (resuming download from partial files etc)
  • The options are not exactly the same, the previous logic cant be mapped exactly to HF however the HF options are more standard
  • If merging this PR, we should simplify BenchmarkPaths (we just need to have the list of datasets not the full list of files) at later point. Ideally, I would be keen to not couple this PR with a large refactoring (it took me quite some time to transfer the files and set the HF repo properly...), adapting to just use a list of datasets will be doable, supporting all download options would be challenging on my end.
  • We could also offer the utils to download from HF alternatively to the current code, this would avoid a refactoring, however I think we should get rid of the S3 backend as it is very slow to download all the datasets

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@Innixma
Copy link
Collaborator

Innixma commented Feb 14, 2025

Thanks for the PR! Will try to review and test next week

@Innixma Innixma self-requested a review February 14, 2025 20:51
version: str = "2023_11_14",
force_download: bool = False,
local_files_only: bool = False,
datasets: list[str] | None = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can also add folds: list[int] | None = None, as an option, in case the user wants to only download the first fold?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there is a big benefit, if you want to reduce the size, you could just consider a smaller dataset list right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is true, but I wonder if it would be easy to add?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we could but I do not see the benefit, perhaps it can be done in a second step if we see a strong need?

Comment on lines +8 to +20
def upload_hugging_face(version: str, repo_id: str, override_existing_files: bool = True):
"""
Uploads tabrepo data to Hugging Face repository.
You should set your env variable HF_TOKEN and ask write access to tabrepo before using the script.

Args:
version (str): The version of the data to be uploaded, the folder data/results/{version}/ should
be present and should contain baselines.parquet, configs.parquet and model_predictions/ folder
repo_id (str): The ID of the Hugging Face repository.
override_existing_files (bool): Whether to re-upload files if they are already found in HuggingFace.
Returns:
None
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: Currently there is an additional file that are available in s3 after my earlier code refactor:

task_metadata.csv

I'm assuming this is still being fetched from s3 in the PR.
Don't need to add it now, as I probably will move a few other files to being downloaded instead of being hard-coded into the TabRepo codebase, such as config_hyperparameters.

Just mentioning so we are aware.

Copy link
Collaborator Author

@geoalgo geoalgo Feb 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it is downloaded from HF (the following files are being downloaded in addition to the predictions, see l87:

    allow_patterns += [
        "*baselines.parquet",
        "*configs.parquet",
        "*baselines.parquet",
        "*task_metadata.csv",
    ]

If we want to move other files to HF it should be pretty easy (and it has git-versioning which is also a good thing)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see task_metadata.csv in the HuggingFace repo? https://huggingface.co/datasets/Tabrepo/tabrepo/tree/main/2023_11_14

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, thanks for the catch I added it.

print(str(e))


def download_from_huggingface(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mentioned that some functionality is being dropped from the s3 logic. Which functionality is no longer available?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exists mode with "ignore", "overwrite", "raise" cant be easily matched. Also the logging will be a bit different.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exists logic is rather important I feel. Which parts still work? What is the blocker? The logging I don't mind if it is different.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It needs to be reworked, the logic for download is different, I would like to not couple this PR with this if possible.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, we can avoid coupling.

"""
# https://huggingface.co/datasets/Tabrepo/tabrepo/tree/main/2023_11_14/model_predictions
api = HfApi()
local_dir = str(Path(__file__).parent.parent.parent / "data/results/")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should import a path from a TabRepo util / constants file instead of calling Path(__file__) here. That way the code doesn't break if we happen to move this file to a new location.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed good catch, I had to do this modification already, will update it.

"""
# https://huggingface.co/datasets/Tabrepo/tabrepo/tree/main/2023_11_14/model_predictions
api = HfApi()
local_dir = str(Path(__file__).parent.parent.parent / "data/results/")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be able to specify a local_dir as an argument

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, will update to make this flexible.

Comment on lines +61 to +62
except Exception as e:
print(str(e))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not raise an exception?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To allow other datasets to upload. I will added a flag to control this.

@@ -2,14 +2,8 @@


if __name__ == '__main__':
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General comment:

I will probably lean towards keeping the s3 download logic in addition to the HF download logic, as it can be useful when dealing with private data that hasn't been approved for public release yet, as it is easier to put this on a private s3 bucket.

Copy link
Collaborator Author

@geoalgo geoalgo Feb 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, but then wouldnt it makes sense to move the s3 logic in a utils given that it is private?

Let me know if this solution works for you, I can also add a flag to configure whether to use s3 or HF.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main requirement is that there should be a flag to specify easily between the two. For example, if we have internal teams who want to share their internal runs with us / store them for their own use, they probably won't have permission to create a HF repo, so they would need to use S3.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add a flag between the two, but would it be sufficient to merge the PR?

Copy link
Collaborator

@Innixma Innixma Feb 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a flag with the default still being s3 for now, will probably switch to HF default later on.

Once that is added + the s3 logic is added back in, I'm happy to merge

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok thanks, I added back an option to use S3.

I would recommend to switch to HF though as downloading TabRepo on external machines is currently painful (it takes more than a day). I benchmarked the two on the 30 datasets and HF was ~3x faster (it will be faster I believe when using larger datasets as it has been optimized for this use-case).

from tabrepo.utils import catchtime
from tabrepo import load_repository

context_name = "D244_F3_C1530_30"
# Time for with S3: 100.1188 secs
with catchtime("with S3"):
    repo = load_repository(context_name, cache=False, use_s3=True)

# delete the local files and then
# Time for without S3: 36.9548 secs

with catchtime("without S3"):
    repo = load_repository(context_name, cache=False, use_s3=False)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely makes sense, I just want to first do a few things (like refactoring configs_hyperparamters) before switching by default.

api = HfApi()
local_dir = str(Path(__file__).parent.parent.parent / "data/results/")
if local_dir is None:
local_dir = str(Path(tabrepo.__path__[0]).parent / "data/results/"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we make a path_data constant that we can import that is equivalent to this? Then it is easy to check which logic in TabRepo relates to a given output artifact directory. I think we had something similar for the paper scripts

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I did this proposed change.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was the change pushed? I don't see it

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +282 to +287
if verbose:
print(f'Downloading files for {self.name} context... '
f'(include_zs={include_zs}, exists="{exists}")')
if use_s3:
print(f'Downloading files for {self.name} context... '
f'(include_zs={include_zs}, exists="{exists}", dry_run={dry_run})')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duplicated logs?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

if verbose:
print(f'Downloading files for {self.name} context... '
f'(include_zs={include_zs}, exists="{exists}")')
if use_s3:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this into a download_from_s3 method? That way it will look a good bit cleaner.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

None
"""
commit_message = f"Upload tabrepo new version"
root = Path(__file__).parent.parent.parent / f"data/results/{version}/"
Copy link
Collaborator

@Innixma Innixma Feb 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use results_path instead, similar to download_from_huggingface?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Collaborator

@Innixma Innixma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, added some final comments. Once addressed happy to merge, thanks!

@Innixma Innixma merged commit 2016166 into autogluon:main Feb 22, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants