Conversation
|
Thanks for the PR! Will try to review and test next week |
| version: str = "2023_11_14", | ||
| force_download: bool = False, | ||
| local_files_only: bool = False, | ||
| datasets: list[str] | None = None, |
There was a problem hiding this comment.
Maybe we can also add folds: list[int] | None = None, as an option, in case the user wants to only download the first fold?
There was a problem hiding this comment.
I wonder if there is a big benefit, if you want to reduce the size, you could just consider a smaller dataset list right?
There was a problem hiding this comment.
That is true, but I wonder if it would be easy to add?
There was a problem hiding this comment.
Yes we could but I do not see the benefit, perhaps it can be done in a second step if we see a strong need?
| def upload_hugging_face(version: str, repo_id: str, override_existing_files: bool = True): | ||
| """ | ||
| Uploads tabrepo data to Hugging Face repository. | ||
| You should set your env variable HF_TOKEN and ask write access to tabrepo before using the script. | ||
|
|
||
| Args: | ||
| version (str): The version of the data to be uploaded, the folder data/results/{version}/ should | ||
| be present and should contain baselines.parquet, configs.parquet and model_predictions/ folder | ||
| repo_id (str): The ID of the Hugging Face repository. | ||
| override_existing_files (bool): Whether to re-upload files if they are already found in HuggingFace. | ||
| Returns: | ||
| None | ||
| """ |
There was a problem hiding this comment.
Note: Currently there is an additional file that are available in s3 after my earlier code refactor:
task_metadata.csv
I'm assuming this is still being fetched from s3 in the PR.
Don't need to add it now, as I probably will move a few other files to being downloaded instead of being hard-coded into the TabRepo codebase, such as config_hyperparameters.
Just mentioning so we are aware.
There was a problem hiding this comment.
No, it is downloaded from HF (the following files are being downloaded in addition to the predictions, see l87:
allow_patterns += [
"*baselines.parquet",
"*configs.parquet",
"*baselines.parquet",
"*task_metadata.csv",
]
If we want to move other files to HF it should be pretty easy (and it has git-versioning which is also a good thing)
There was a problem hiding this comment.
I don't see task_metadata.csv in the HuggingFace repo? https://huggingface.co/datasets/Tabrepo/tabrepo/tree/main/2023_11_14
There was a problem hiding this comment.
Indeed, thanks for the catch I added it.
| print(str(e)) | ||
|
|
||
|
|
||
| def download_from_huggingface( |
There was a problem hiding this comment.
You mentioned that some functionality is being dropped from the s3 logic. Which functionality is no longer available?
There was a problem hiding this comment.
The exists mode with "ignore", "overwrite", "raise" cant be easily matched. Also the logging will be a bit different.
There was a problem hiding this comment.
The exists logic is rather important I feel. Which parts still work? What is the blocker? The logging I don't mind if it is different.
There was a problem hiding this comment.
It needs to be reworked, the logic for download is different, I would like to not couple this PR with this if possible.
There was a problem hiding this comment.
Sure, we can avoid coupling.
| """ | ||
| # https://huggingface.co/datasets/Tabrepo/tabrepo/tree/main/2023_11_14/model_predictions | ||
| api = HfApi() | ||
| local_dir = str(Path(__file__).parent.parent.parent / "data/results/") |
There was a problem hiding this comment.
We should import a path from a TabRepo util / constants file instead of calling Path(__file__) here. That way the code doesn't break if we happen to move this file to a new location.
There was a problem hiding this comment.
Indeed good catch, I had to do this modification already, will update it.
| """ | ||
| # https://huggingface.co/datasets/Tabrepo/tabrepo/tree/main/2023_11_14/model_predictions | ||
| api = HfApi() | ||
| local_dir = str(Path(__file__).parent.parent.parent / "data/results/") |
There was a problem hiding this comment.
We should be able to specify a local_dir as an argument
There was a problem hiding this comment.
Makes sense, will update to make this flexible.
| except Exception as e: | ||
| print(str(e)) |
There was a problem hiding this comment.
Why not raise an exception?
There was a problem hiding this comment.
To allow other datasets to upload. I will added a flag to control this.
| @@ -2,14 +2,8 @@ | |||
|
|
|||
|
|
|||
| if __name__ == '__main__': | |||
There was a problem hiding this comment.
General comment:
I will probably lean towards keeping the s3 download logic in addition to the HF download logic, as it can be useful when dealing with private data that hasn't been approved for public release yet, as it is easier to put this on a private s3 bucket.
There was a problem hiding this comment.
Makes sense, but then wouldnt it makes sense to move the s3 logic in a utils given that it is private?
Let me know if this solution works for you, I can also add a flag to configure whether to use s3 or HF.
There was a problem hiding this comment.
The main requirement is that there should be a flag to specify easily between the two. For example, if we have internal teams who want to share their internal runs with us / store them for their own use, they probably won't have permission to create a HF repo, so they would need to use S3.
There was a problem hiding this comment.
I can add a flag between the two, but would it be sufficient to merge the PR?
There was a problem hiding this comment.
I think a flag with the default still being s3 for now, will probably switch to HF default later on.
Once that is added + the s3 logic is added back in, I'm happy to merge
There was a problem hiding this comment.
Ok thanks, I added back an option to use S3.
I would recommend to switch to HF though as downloading TabRepo on external machines is currently painful (it takes more than a day). I benchmarked the two on the 30 datasets and HF was ~3x faster (it will be faster I believe when using larger datasets as it has been optimized for this use-case).
from tabrepo.utils import catchtime
from tabrepo import load_repository
context_name = "D244_F3_C1530_30"
# Time for with S3: 100.1188 secs
with catchtime("with S3"):
repo = load_repository(context_name, cache=False, use_s3=True)
# delete the local files and then
# Time for without S3: 36.9548 secs
with catchtime("without S3"):
repo = load_repository(context_name, cache=False, use_s3=False)There was a problem hiding this comment.
Definitely makes sense, I just want to first do a few things (like refactoring configs_hyperparamters) before switching by default.
| api = HfApi() | ||
| local_dir = str(Path(__file__).parent.parent.parent / "data/results/") | ||
| if local_dir is None: | ||
| local_dir = str(Path(tabrepo.__path__[0]).parent / "data/results/"), |
There was a problem hiding this comment.
Maybe we make a path_data constant that we can import that is equivalent to this? Then it is easy to check which logic in TabRepo relates to a given output artifact directory. I think we had something similar for the paper scripts
There was a problem hiding this comment.
Ok I did this proposed change.
There was a problem hiding this comment.
Was the change pushed? I don't see it
There was a problem hiding this comment.
tabrepo/contexts/context.py
Outdated
| if verbose: | ||
| print(f'Downloading files for {self.name} context... ' | ||
| f'(include_zs={include_zs}, exists="{exists}")') | ||
| if use_s3: | ||
| print(f'Downloading files for {self.name} context... ' | ||
| f'(include_zs={include_zs}, exists="{exists}", dry_run={dry_run})') |
| if verbose: | ||
| print(f'Downloading files for {self.name} context... ' | ||
| f'(include_zs={include_zs}, exists="{exists}")') | ||
| if use_s3: |
There was a problem hiding this comment.
Can we move this into a download_from_s3 method? That way it will look a good bit cleaner.
| None | ||
| """ | ||
| commit_message = f"Upload tabrepo new version" | ||
| root = Path(__file__).parent.parent.parent / f"data/results/{version}/" |
There was a problem hiding this comment.
use results_path instead, similar to download_from_huggingface?
Innixma
left a comment
There was a problem hiding this comment.
LGTM, added some final comments. Once addressed happy to merge, thanks!
Issue #, if available: #66
Description of changes:
(I would recommend to quickly read the issue before doing the PR)
The PR uses HF as the storage layer for TabRepo.
The corresponding HF dataset is https://huggingface.co/datasets/Tabrepo/tabrepo
A few things:
BenchmarkPaths(we just need to have the list of datasets not the full list of files) at later point. Ideally, I would be keen to not couple this PR with a large refactoring (it took me quite some time to transfer the files and set the HF repo properly...), adapting to just use a list of datasets will be doable, supporting all download options would be challenging on my end.By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.