Skip to content

Commit 7624ed3

Browse files
authored
Make s3.request_timeout configurable (apache#1568)
Similarly to apache#218, we see occasional timeout errors when writing data to S3-compatible object storage: ``` When uploading part for key 'drivestats/data/date_month=2014-08/00000-0-9c7baab5-af18-4558-ae10-1678aa90b6a5.parquet' in bucket 'drivestats-iceberg': AWS Error NETWORK_CONNECTION during UploadPart operation: curlCode: 28, Timeout was reached ``` [I don't believe the issue is specific to the fact that I'm using [Backblaze B2](https://www.backblaze.com/cloud-storage) rather than Amazon S3 - I saw references to similar error messages with the latter as I was researching this issue.] The issue happens when the underlying `PUT` operation takes longer than the request timeout, which is [set to a default of 3 seconds in the AWS C++ SDK](https://github.com/aws/aws-sdk-cpp/blob/c9eaae91b9eaa77f304a12cd4b15ec5af3e8a726/src/aws-cpp-sdk-core/source/client/ClientConfiguration.cpp#L184) used by Arrow via PyArrow. The changes in this PR allow configuration of `s3.request_timeout` when working directly or indirectly with `pyiceberg.io.pyarrow.PyArrowFileIO`, just as apache#218 allowed configuration of `s3.connect_timeout`. For example, when creating a catalog: ```python catalog = load_catalog( "docs", **{ "uri": "http://127.0.0.1:8181", "s3.endpoint": "http://127.0.0.1:9000", "py-io-impl": "pyiceberg.io.pyarrow.PyArrowFileIO", "s3.access-key-id": "admin", "s3.secret-access-key": "password", "s3.request-timeout": 5.0, "s3.connect-timeout": 20.0, } ) ```
1 parent 9850290 commit 7624ed3

File tree

4 files changed

+13
-0
lines changed

4 files changed

+13
-0
lines changed

mkdocs/docs/configuration.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -119,6 +119,7 @@ For the FileIO there are several configuration options available:
119119
| s3.region | us-west-2 | Configure the default region used to initialize an `S3FileSystem`. `PyArrowFileIO` attempts to automatically resolve the region for each S3 bucket, falling back to this value if resolution fails. |
120120
| s3.proxy-uri | <http://my.proxy.com:8080> | Configure the proxy server to be used by the FileIO. |
121121
| s3.connect-timeout | 60.0 | Configure socket connection timeout, in seconds. |
122+
| s3.request-timeout | 60.0 | Configure socket read timeouts on Windows and macOS, in seconds. |
122123
| s3.force-virtual-addressing | False | Whether to use virtual addressing of buckets. If true, then virtual addressing is always enabled. If false, then virtual addressing is only enabled if endpoint_override is empty. This can be used for non-AWS backends that only support virtual hosted-style access. |
123124

124125
<!-- markdown-link-check-enable-->

pyiceberg/io/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -61,6 +61,7 @@
6161
S3_REGION = "s3.region"
6262
S3_PROXY_URI = "s3.proxy-uri"
6363
S3_CONNECT_TIMEOUT = "s3.connect-timeout"
64+
S3_REQUEST_TIMEOUT = "s3.request-timeout"
6465
S3_SIGNER_URI = "s3.signer.uri"
6566
S3_SIGNER_ENDPOINT = "s3.signer.endpoint"
6667
S3_SIGNER_ENDPOINT_DEFAULT = "v1/aws/s3/sign"

pyiceberg/io/fsspec.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -65,6 +65,7 @@
6565
S3_ENDPOINT,
6666
S3_PROXY_URI,
6767
S3_REGION,
68+
S3_REQUEST_TIMEOUT,
6869
S3_SECRET_ACCESS_KEY,
6970
S3_SESSION_TOKEN,
7071
S3_SIGNER_ENDPOINT,
@@ -150,6 +151,9 @@ def _s3(properties: Properties) -> AbstractFileSystem:
150151
if connect_timeout := properties.get(S3_CONNECT_TIMEOUT):
151152
config_kwargs["connect_timeout"] = float(connect_timeout)
152153

154+
if request_timeout := properties.get(S3_REQUEST_TIMEOUT):
155+
config_kwargs["read_timeout"] = float(request_timeout)
156+
153157
fs = S3FileSystem(client_kwargs=client_kwargs, config_kwargs=config_kwargs)
154158

155159
for event_name, event_function in register_events.items():

pyiceberg/io/pyarrow.py

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -106,6 +106,7 @@
106106
S3_FORCE_VIRTUAL_ADDRESSING,
107107
S3_PROXY_URI,
108108
S3_REGION,
109+
S3_REQUEST_TIMEOUT,
109110
S3_ROLE_ARN,
110111
S3_ROLE_SESSION_NAME,
111112
S3_SECRET_ACCESS_KEY,
@@ -396,6 +397,9 @@ def _initialize_oss_fs(self) -> FileSystem:
396397
if connect_timeout := self.properties.get(S3_CONNECT_TIMEOUT):
397398
client_kwargs["connect_timeout"] = float(connect_timeout)
398399

400+
if request_timeout := self.properties.get(S3_REQUEST_TIMEOUT):
401+
client_kwargs["request_timeout"] = float(request_timeout)
402+
399403
if role_arn := get_first_property_value(self.properties, S3_ROLE_ARN, AWS_ROLE_ARN):
400404
client_kwargs["role_arn"] = role_arn
401405

@@ -440,6 +444,9 @@ def _initialize_s3_fs(self, netloc: Optional[str]) -> FileSystem:
440444
if connect_timeout := self.properties.get(S3_CONNECT_TIMEOUT):
441445
client_kwargs["connect_timeout"] = float(connect_timeout)
442446

447+
if request_timeout := self.properties.get(S3_REQUEST_TIMEOUT):
448+
client_kwargs["request_timeout"] = float(request_timeout)
449+
443450
if role_arn := get_first_property_value(self.properties, S3_ROLE_ARN, AWS_ROLE_ARN):
444451
client_kwargs["role_arn"] = role_arn
445452

0 commit comments

Comments
 (0)