Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
sayakpaul
left a comment
There was a problem hiding this comment.
Should we add a test as well? No strong opinions, of course.
kernels/src/kernels/cli/upload.py
Outdated
| if file_count > 200: | ||
| print( | ||
| f"⚠️ Found {file_count} files to upload, which exceeds the 200 file limit for a single commit. Deleting old build files and re-uploading the whole build folder to avoid hitting file limits." | ||
| ) | ||
| kernel_root_dir = build_dir.parent | ||
| api.upload_large_folder( | ||
| repo_id=repo_id, | ||
| folder_path=kernel_root_dir, | ||
| revision=branch, | ||
| repo_type="model", | ||
| allow_patterns=["build/torch*"], | ||
| ) |
There was a problem hiding this comment.
We should not just remove old build files, since it can break our version contract. Also, if we automatically switch to a different upload type, the behavior should be exactly the same.
Is it possible to add delete_patterns to upload_large_folder?
There was a problem hiding this comment.
unfortunately upload_large_folder does not contain delete_patterns. and I actually misunderstood the upload functionality and added an incorrect comment. The upload_large_folder function doesn't seem to delete any files. It upserts the files that match the allow pattern.
I fully agree they should act exactly the same, however don't we want to delete the older build within a version? For example my understanding is the upload_folder path deletes old files like
{'torch210-cxx11-cu130-x86_64-linux/**', '.cache/**', 'torch210-cxx11-cu126-x86_64-linux/**', 'torch29-cxx11-cu126-x86_64-linux/**', 'torch29-cxx11-cu128-x86_64-linux/**', 'torch210-cxx11-cu128-x86_64-linux/**', 'torch29-cxx11-cu130-x86_64-linux/**'}
so the old .so is removed and replaced with the new .so (since they will have unique suffixes). I think to fulfill this logic with the upload_large_folder there will need to be an explicit delete call made before or after the large folder upload.
another possible option is to avoid upload_large_folder all together and in the case of many files, we either retry the upload command a couple times (this works in practice) or we iterate over each build variant and upload each in its own commit (but this may still suffer if a single build variant is too large)
what do you think is the best approach?
There was a problem hiding this comment.
Yes, we need two different behaviors:
- We need to if a variant
torch210-cxx11-cu130-x86_64-linuxis uploaded, we need to remove all stale files like you mention, e.g. the old.sofile. - At the same time, if we e.g. upload only
torch210-cxx11-cu130-x86_64-linuxandtorch210-cxx11-cu126-x86_64-linux, we should not removetorch29-cxx11-cu126-x86_64-linux.
So the logic is: remove stale files of build variants that are uploaded, but do not remove any files of build variants that are not uploaded.
I think to fulfill this logic with the upload_large_folder there will need to be an explicit delete call made before or after the large folder upload.
But I think deletes should be atomic or at least after the large upload to avoid having an undownloadable kernel if upload fails.
For now, for large folders doing one upload per build variant sounds like the best option? We guarantee the existing semantics and it might be enough for most cases?
There was a problem hiding this comment.
For now, for large folders doing one upload per build variant sounds like the best option? We guarantee the existing semantics and it might be enough for most cases?
Good point! I agree this should cut it for now. If anything more advanced arises, we can also discuss a bit with the Hub team to figure out what's best.
There was a problem hiding this comment.
Thanks for the ping! General note, never hesitate to ping me or @hanouticelina on such topics either in Github or directly on Slack so we can either advice on a solution or adapt huggingface_hub if some inconsistencies can be resolved in it.
There was a problem hiding this comment.
Regarding this specific case, the "why" upload_large_folder do not have delete_patterns it's because the method is meant for heavy uploads, with a resumable and multi-commit process. Primary purpose is for example to upload very large datasets once, not necessarily as an automated process. In such cases, deleting files as part of the upload is not really important (repo author can always clean-up their files afterwards). If integrating in a more automated workflow -typically in this PR-, then we would change a bit the API I think.
In your very specific case, I think the best is to use the low-level API create_commit with a list of CommitOperationAdd and CommitOperationDelete. Something like this should work:
# build list of delete ops from previous build
delete_operations = [CommitOperationDelete(...) for file in list_repo_files(...) if file ...]
# build list of add ops
files_to_add = [...] # to implement
add_operations = [CommitOperationAdd(...) for file in files_to_add]
# commit by batch of 200 ops (nb to tweak?)
for chunk_operations in chunk_iterator(delete_operations + add_operations, n=200):
create_commit(...operations=chunk_operations)This is a bit more work but you have full flexibility and it works well. And you can adapt the commit message as you want. The good part is that you have exactly the same logic no matter the number of ops (i.e. no "if ... then use upload_folder else use upload_large_folder").
Let me know what you think!
There was a problem hiding this comment.
I really like this approach, thanks for the tip! I've updated to follow this pattern in the latest commits
kernels/src/kernels/cli/upload.py
Outdated
| delete_patterns=list(delete_patterns), | ||
| commit_message="Build uploaded using `kernels`.", | ||
| allow_patterns=["torch*"], | ||
| file_count = sum( |
There was a problem hiding this comment.
I think this should be a separate function.
There was a problem hiding this comment.
agreed, I've moved this logic to a new function in the latest commit
good idea! I've added a test to check that we enter the path that calls *note, will update the test accordingly if we change the approach based on the question above. |
sayakpaul
left a comment
There was a problem hiding this comment.
Left some more comments. LMK if they're clear.
kernels/src/kernels/cli/upload.py
Outdated
| ) | ||
| file_count = get_file_count_in_build(build_dir) | ||
|
|
||
| if file_count > 1_000: |
There was a problem hiding this comment.
(nit): 1_000 could be made a constant.
kernels/tests/test_kernel_upload.py
Outdated
| _get_hf_api().delete_repo(repo_id=REPO_ID) | ||
|
|
||
|
|
||
| def test_large_kernel_upload_uses_kernel_root_path(monkeypatch, tmp_path): |
kernels/src/kernels/cli/upload.py
Outdated
| print( | ||
| f"⚠️ Found {file_count} files to upload, which exceeds the 1,000 file limit for a single commit." | ||
| ) | ||
| kernel_root_dir = build_dir.parent |
There was a problem hiding this comment.
We can call delete_files() here to maintain parity with the existing behaviour of upload_kernels().
kernels/src/kernels/cli/upload.py
Outdated
| repo_type="model", | ||
| allow_patterns=["build/torch*"], | ||
| ) | ||
| else: |
There was a problem hiding this comment.
see my comment above to avoid having 2 different upload logic #300 (comment)
c7e74ee to
f0ea3b7
Compare
|
@Wauplin thanks for the suggestions! I've updated in the latest commits, would you mind taking a quick look to ensure that I'm using the api correctly. Thanks |
This PR adds a path to use the
upload_large_folderapi when there are more than 200 files in the build output. This helps avoid timeouts when many files are in the build. Otherwise the normalupload_folderis preferred since it has a bit more flexibility arounddelete_patternsand thecommit_message