Skip to content

Conversation

@Mchristos
Copy link
Contributor

@Mchristos Mchristos commented Nov 4, 2024

Adds GSClient option to include a timeout in seconds. When using google.cloud.storage, it's possible to manually increase the timeout from the default of 60 seconds, but with cloudpathlib it's not possible to tweak this. This makes, for example, uploading larger files with slow internet connection impossible to achieve using cloudpathlib

Closes #484


Contributor checklist:

  • I have read and understood CONTRIBUTING.md
  • Confirmed an issue exists for the PR, and the text Closes #issue appears in the PR summary (e.g., Closes #123).
  • Confirmed PR is rebased onto the latest base
  • Confirmed failure before change and success after change
  • Any generic new functionality is replicated across cloud providers if necessary
  • Tested manually against live server backend for at least one provider
  • Added tests for any new functionality
  • Linting passes locally
  • Tests pass locally
  • Updated HISTORY.md with the issue that is addressed and the PR you are submitting. If the top section is not `## UNRELEASED``, then you need to add a new section to the top of the document for your change.

Usage:

from cloudpathlib import CloudPath, GSClient
client = GSClient(timeout=.00001)
with CloudPath("gs://path/to/test_file.txt", client=client).open("wb") as f:
    f.write(os.urandom(10000000))
# Raises Timeout Error

@pjbull
Copy link
Member

pjbull commented Nov 30, 2024

Thanks @Mchristos! Code looks reasonable, but I was wondering if there are other kwargs that are either shared or that we should accept as upload_kwargs and download_kwargs to the client and use in the respective methods so that we can support all of the potential kwargs. Would you mind checking the SDK and seeing what makes sense?

@codecov
Copy link

codecov bot commented Nov 30, 2024

Codecov Report

Attention: Patch coverage is 44.44444% with 5 lines in your changes missing coverage. Please review.

Project coverage is 92.0%. Comparing base (7f1a5dc) to head (c23e45d).
Report is 4 commits behind head on 484-live-tests.

Files with missing lines Patch % Lines
cloudpathlib/gs/gsclient.py 44.4% 5 Missing ⚠️

❌ Your patch status has failed because the patch coverage (44.4%) is below the target coverage (80.0%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@               Coverage Diff                @@
##           484-live-tests    #485     +/-   ##
================================================
- Coverage            94.4%   92.0%   -2.5%     
================================================
  Files                  23      23             
  Lines                1776    1805     +29     
================================================
- Hits                 1678    1661     -17     
- Misses                 98     144     +46     
Files with missing lines Coverage Δ
cloudpathlib/gs/gsclient.py 87.2% <44.4%> (-4.3%) ⬇️

... and 5 files with indirect coverage changes

@Mchristos
Copy link
Contributor Author

Thanks @Mchristos! Code looks reasonable, but I was wondering if there are other kwargs that are either shared or that we should accept as upload_kwargs and download_kwargs to the client and use in the respective methods so that we can support all of the potential kwargs. Would you mind checking the SDK and seeing what makes sense?

Hi @pjbull, thanks a lot for reviewing my code! I'll include the 3 function signatures that we use below, and the common kwargs between them. I think only timeout and retry are relevant, so if it makes sense to you, I will add retry and be done with it.

The other option is that we put the onus on the user to specify a valid GCP kwarg, but then of course we may have the possible issue of mismatching kwargs between these functions (possibly even in a future gcsfs release).

Function signatures:

download_to_filename(filename, client=None, start=None, end=None, raw_download=False, if_etag_match=None, if_etag_not_match=None, if_generation_match=None, if_generation_not_match=None, if_metageneration_match=None, if_metageneration_not_match=None, timeout=60, checksum='md5', retry=<google.api_core.retry.retry_unary.Retry object>)

upload_from_filename(filename, content_type=None, num_retries=None, client=None, predefined_acl=None, if_generation_match=None, if_generation_not_match=None, if_metageneration_match=None, if_metageneration_not_match=None, timeout=60, checksum=None, retry=<google.cloud.storage.retry.ConditionalRetryPolicy object>)

copy_blob(blob, destination_bucket, new_name=None, client=None, preserve_acl=True, source_generation=None, if_generation_match=None, if_generation_not_match=None, if_metageneration_match=None, if_metageneration_not_match=None, if_source_generation_match=None, if_source_generation_not_match=None, if_source_metageneration_match=None, if_source_metageneration_not_match=None, timeout=60, retry=<google.cloud.storage.retry.ConditionalRetryPolicy object>)

Common kwargs between these are:

[
    "client",
    "if_generation_match",
    "if_generation_not_match",
    "if_metageneration_match",
    "if_metageneration_not_match",
    "timeout",
    "retry"
]

Of these, the client kwarg is redundant, and the kwargs around generation and metageneration matches are not relevant, since these are are related to copying between blobs, and I don't think we require support for that.

That leaves timeout and retry. It seems relatively involved to add a retry configuration, but provided the user is responsible fr forming it, I don't see why not:

image

@pjbull let me know if you agree and I will include the retry kwarg.

@pjbull
Copy link
Member

pjbull commented Jan 30, 2025

@Mchristos Sorry for the delay, getting this PR moving again. I think your plan sounds good for now. Do you still have time to add retry and we can merge this?

Could you also open an issue for supporting kwargs to these functions generically? That way we can track if there are other kwargs of interest and if we need a more flexible future solution with multiple kwargs.

@Mchristos
Copy link
Contributor Author

@pjbull Apologies for the delay from my end too. Please see the latest change, which adds optional Retry configuration to GSClient. Let me know if you're happy with it

@pjbull
Copy link
Member

pjbull commented Feb 21, 2025

@Mchristos Looks like a mypy issue. You can use make format to do the formatting and make lint to test if things are working. You'll want to make sure you have the latest black/mypy.

https://cloudpathlib.drivendata.org/stable/contributing/#linting-and-formatting

@Mchristos
Copy link
Contributor Author

@pjbull Thanks, I had some issues getting make lint to run correctly on my machine, but the mypy issue should be resolved now.

@Mchristos
Copy link
Contributor Author

@Mchristos Sorry for the delay, getting this PR moving again. I think your plan sounds good for now. Do you still have time to add retry and we can merge this?

Could you also open an issue for supporting kwargs to these functions generically? That way we can track if there are other kwargs of interest and if we need a more flexible future solution with multiple kwargs.

@pjbull I've added an issue for supporting generic kwargs: #503

@pjbull pjbull changed the base branch from master to 484-live-tests February 25, 2025 19:26
@pjbull
Copy link
Member

pjbull commented Feb 25, 2025

Merging into 484-live-tests to run live tests.

@pjbull pjbull merged commit f761238 into drivendataorg:484-live-tests Feb 25, 2025
21 of 24 checks passed
pjbull added a commit that referenced this pull request Feb 27, 2025
* #484 Add GSClient timeout support (#485)

* add timeout support

* no implicit optional

* add optional retry

* black

* try this

* type annotations

---------

Co-authored-by: Chris Marais <[email protected]>

* add changelog and tests

---------

Co-authored-by: chris <[email protected]>
Co-authored-by: Chris Marais <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Set a GSPath upload timeout

2 participants