Skip to content

Commit b4eff2f

Browse files
Add Parallel Large File Upload and Download in FilesAPI (#1075)
## What changes are proposed in this pull request? **WHAT** * Add a new interface `upload_from` to `databricks.sdk.mixins.FilesExt` to support upload from a file in local filesystem. * Improve `databricks.sdk.mixins.FilesExt` upload throughput by uploading data in parallel by default. * Add a new interface `download_to` to `databricks.sdk.mixins.FilesExt` to support download to a file in local filesystem. This interface downloads file in parallel to reduce the end-to-end latency of the download. The parallel downloading feature is temporarily unavailable to Windows. * Improve `databricks.sdk.mixins.FilesExt.upload` to support uploading when Presigned URL is not enabled for the Workspace by introducing a fallback to Single Part Upload. * Add `use_parallel`, `parallelism`, `part_size` field for `databricks.sdk.mixins.FilesExt.upload`. **WHY** * The `upload_from` and `download_to` are added for two purposes: * Free users from opening file when uploading and downloading * Allow client to perform parallel uploading and downloading to improve end-to-end latency of the operations. * The function fields were added to allow users to fine tune the performance of the upload operation easily. The configurations will be automatically set to give good enough performance, but the users can easily overwrite them if they have specific requirements. * The configurations for the `databricks.sdk.mixins.FilesExt` were updated to have a `files_ext` prefix to organize the configurations. ## How is this tested? The functionalities are tested using unit tests, and manual tests over benchmarking scrips running in local laptop and in Notebooks in different real workspaces.
1 parent 17383ec commit b4eff2f

File tree

17 files changed

+3270
-598
lines changed

17 files changed

+3270
-598
lines changed

NEXT_CHANGELOG.md

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,10 +4,26 @@
44

55
### New Features and Improvements
66

7+
* Add a new interface `upload_from` to `databricks.sdk.mixins.FilesExt` to support upload from a file in local filesystem.
8+
* Improve `databricks.sdk.mixins.FilesExt` upload throughput by uploading data in parallel by default.
9+
* Add a new interface `download_to` to `databricks.sdk.mixins.FilesExt` to support download to a file in local filesystem. This interface will also download the file in parallel by default. Parallel downloading is currently unavailable on Windows.
10+
* Improve `databricks.sdk.mixins.FilesExt.upload` to support uploading when Presigned URL is not enabled for the Workspace by introducing a fallback to Single Part Upload.
11+
712
### Bug Fixes
813

914
### Documentation
1015

1116
### Internal Changes
1217

1318
### API Changes
19+
20+
* Add `upload_from()`, `download_to()` method for `databricks.sdk.mixins.FilesExt`.
21+
* Add `use_parallel`, `parallelism`, `part_size` field for `databricks.sdk.mixins.FilesExt.upload`.
22+
* [Breaking] Change `files_api_client_download_max_total_recovers` to `files_ext_client_download_max_total_recovers` for `databricks.sdk.Config`
23+
* [Breaking] Change `files_api_client_download_max_total_recovers_without_progressing` to `files_ext_client_download_max_total_recovers_without_progressing` for `databricks.sdk.Config`
24+
* [Breaking] Change `multipart_upload_min_stream_size` to `files_ext_multipart_upload_min_stream_size` for `databricks.sdk.Config`
25+
* [Breaking] Change `multipart_upload_batch_url_count` to `files_ext_multipart_upload_batch_url_count` for `databricks.sdk.Config`
26+
* [Breaking] Change `multipart_upload_chunk_size` to `files_ext_multipart_upload_default_part_size` for `databricks.sdk.Config`
27+
* [Breaking] Change `multipart_upload_url_expiration_duration` to `files_ext_multipart_upload_url_expiration_duration` for `databricks.sdk.Config`
28+
* [Breaking] Change `multipart_upload_max_retries` to `files_ext_multipart_upload_max_retries` for `databricks.sdk.Config`
29+
* Add `files_ext_client_download_streaming_chunk_size`, `files_ext_multipart_upload_part_size_options`, `files_ext_multipart_upload_max_part_size`, `files_ext_multipart_upload_default_parallelism`, `files_ext_presigned_download_url_expiration_duration`, `files_ext_parallel_download_default_parallelism`, `files_ext_parallel_download_min_file_size`, `files_ext_parallel_download_default_part_size`, `files_ext_parallel_download_max_retries` for `databricks.sdk.Config`

databricks/sdk/__init__.py

Lines changed: 6 additions & 10 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

databricks/sdk/config.py

Lines changed: 62 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
import pathlib
77
import sys
88
import urllib.parse
9-
from typing import Dict, Iterable, Optional
9+
from typing import Dict, Iterable, List, Optional
1010

1111
import requests
1212

@@ -110,18 +110,27 @@ class Config:
110110

111111
disable_async_token_refresh: bool = ConfigAttribute(env="DATABRICKS_DISABLE_ASYNC_TOKEN_REFRESH")
112112

113-
enable_experimental_files_api_client: bool = ConfigAttribute(env="DATABRICKS_ENABLE_EXPERIMENTAL_FILES_API_CLIENT")
114-
files_api_client_download_max_total_recovers = None
115-
files_api_client_download_max_total_recovers_without_progressing = 1
113+
disable_experimental_files_api_client: bool = ConfigAttribute(
114+
env="DATABRICKS_DISABLE_EXPERIMENTAL_FILES_API_CLIENT"
115+
)
116+
117+
files_ext_client_download_streaming_chunk_size: int = 2 * 1024 * 1024 # 2 MiB
118+
119+
# When downloading a file, the maximum number of attempts to retry downloading the whole file. Default is no limit.
120+
files_ext_client_download_max_total_recovers: Optional[int] = None
116121

117-
# File multipart upload parameters
122+
# When downloading a file, the maximum number of attempts to retry downloading from the same offset without progressing.
123+
# This is to avoid infinite retrying when the download is not making any progress. Default is 1.
124+
files_ext_client_download_max_total_recovers_without_progressing = 1
125+
126+
# File multipart upload/download parameters
118127
# ----------------------
119128

120129
# Minimal input stream size (bytes) to use multipart / resumable uploads.
121130
# For small files it's more efficient to make one single-shot upload request.
122131
# When uploading a file, SDK will initially buffer this many bytes from input stream.
123132
# This parameter can be less or bigger than multipart_upload_chunk_size.
124-
multipart_upload_min_stream_size: int = 5 * 1024 * 1024
133+
files_ext_multipart_upload_min_stream_size: int = 50 * 1024 * 1024
125134

126135
# Maximum number of presigned URLs that can be requested at a time.
127136
#
@@ -131,31 +140,70 @@ class Config:
131140
# the stream back. In case of a non-seekable stream we cannot rewind, so we'll abort
132141
# the upload. To reduce the chance of this, we're requesting presigned URLs one by one
133142
# and using them immediately.
134-
multipart_upload_batch_url_count: int = 1
143+
files_ext_multipart_upload_batch_url_count: int = 1
135144

136-
# Size of the chunk to use for multipart uploads.
145+
# Size of the chunk to use for multipart uploads & downloads.
137146
#
138147
# The smaller chunk is, the less chance for network errors (or URL get expired),
139148
# but the more requests we'll make.
140149
# For AWS, minimum is 5Mb: https://docs.aws.amazon.com/AmazonS3/latest/userguide/qfacts.html
141150
# For GCP, minimum is 256 KiB (and also recommended multiple is 256 KiB)
142151
# boto uses 8Mb: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/customizations/s3.html#boto3.s3.transfer.TransferConfig
143-
multipart_upload_chunk_size: int = 10 * 1024 * 1024
144-
145-
# use maximum duration of 1 hour
146-
multipart_upload_url_expiration_duration: datetime.timedelta = datetime.timedelta(hours=1)
152+
files_ext_multipart_upload_default_part_size: int = 10 * 1024 * 1024 # 10 MiB
153+
154+
# List of multipart upload part sizes that can be automatically selected
155+
files_ext_multipart_upload_part_size_options: List[int] = [
156+
10 * 1024 * 1024, # 10 MiB
157+
20 * 1024 * 1024, # 20 MiB
158+
50 * 1024 * 1024, # 50 MiB
159+
100 * 1024 * 1024, # 100 MiB
160+
200 * 1024 * 1024, # 200 MiB
161+
500 * 1024 * 1024, # 500 MiB
162+
1 * 1024 * 1024 * 1024, # 1 GiB
163+
2 * 1024 * 1024 * 1024, # 2 GiB
164+
4 * 1024 * 1024 * 1024, # 4 GiB
165+
]
166+
167+
# Maximum size of a single part in multipart upload.
168+
# For AWS, maximum is 5 GiB: https://docs.aws.amazon.com/AmazonS3/latest/userguide/qfacts.html
169+
# For Azure, maximum is 4 GiB: https://learn.microsoft.com/en-us/rest/api/storageservices/put-block
170+
# For CloudFlare R2, maximum is 5 GiB: https://developers.cloudflare.com/r2/objects/multipart-objects/
171+
files_ext_multipart_upload_max_part_size: int = 4 * 1024 * 1024 * 1024 # 4 GiB
172+
173+
# Default parallel multipart upload concurrency. Set to 10 because of the experiment results show that it
174+
# gives good performance result.
175+
files_ext_multipart_upload_default_parallelism: int = 10
176+
177+
# The expiration duration for presigned URLs used in multipart uploads and downloads.
178+
# The client will request new presigned URLs if the previous one is expired. The duration should be long enough
179+
# to complete the upload or download of a single part.
180+
files_ext_multipart_upload_url_expiration_duration: datetime.timedelta = datetime.timedelta(hours=1)
181+
files_ext_presigned_download_url_expiration_duration: datetime.timedelta = datetime.timedelta(hours=1)
182+
183+
# When downloading a file in parallel, how many worker threads to use.
184+
files_ext_parallel_download_default_parallelism: int = 10
185+
186+
# When downloading a file, if the file size is smaller than this threshold,
187+
# We'll use a single-threaded download even if the parallel download is enabled.
188+
files_ext_parallel_download_min_file_size: int = 50 * 1024 * 1024 # 50 MiB
189+
190+
# Default chunk size to use when downloading a file in parallel. Not effective for single threaded download.
191+
files_ext_parallel_download_default_part_size: int = 10 * 1024 * 1024 # 10 MiB
147192

148193
# This is not a "wall time" cutoff for the whole upload request,
149194
# but a maximum time between consecutive data reception events (even 1 byte) from the server
150-
multipart_upload_single_chunk_upload_timeout_seconds: float = 60
195+
files_ext_network_transfer_inactivity_timeout_seconds: float = 60
151196

152197
# Cap on the number of custom retries during incremental uploads:
153198
# 1) multipart: upload part URL is expired, so new upload URLs must be requested to continue upload
154199
# 2) resumable: chunk upload produced a retryable response (or exception), so upload status must be
155200
# retrieved to continue the upload.
156201
# In these two cases standard SDK retries (which are capped by the `retry_timeout_seconds` option) are not used.
157202
# Note that retry counter is reset when upload is successfully resumed.
158-
multipart_upload_max_retries = 3
203+
files_ext_multipart_upload_max_retries = 3
204+
205+
# Cap on the number of custom retries during parallel downloads.
206+
files_ext_parallel_download_max_retries = 3
159207

160208
def __init__(
161209
self,

0 commit comments

Comments
 (0)