|
| 1 | +PROMPTS FOR STRAGGLER DOWNLOAD MITIGATION LLD |
| 2 | +============================================== |
| 3 | + |
| 4 | +Prompt 1: |
| 5 | +--------- |
| 6 | +I want to implement a functionality in the databricks ADBC driver. [SIMBA] Addressing the Straggling File Download Issue for Cloud Fetch |
| 7 | + |
| 8 | +Overview |
| 9 | +In Cloud Fetch mode, the driver uses a thread pool with 80 threads by default (set by MaxNumResultFileDownloadThreads) to download files using pre-signed URLs generated by the server. The server caps the amount of data in the set of URLs returned per fetch to 300 MB (set by MaxBytesPerFetchRequest, hard-capped by the server to 1GB) and each file has a maximum size of 20 MB (server side configuration). |
| 10 | +File download |
| 11 | +The driver receives a set of file links which are downloaded in parallel. Each such set of files is considered a batch and all files within the batch need to be successfully downloaded to move to the next one. If some of the file downloads fail, the driver re-attempts to download them after requesting new URLs from the server for a maximum of 10 times. The retry count is configurable by MaxConsecutiveResultFileDownloadRetries. |
| 12 | + |
| 13 | +If one of the file downloads fails, the driver requests new URLs starting from the offset of that file. The files preceding the offset which were successfully downloaded are skipped. The files from higher offsets than the failed one that have been downloaded successfully are re-downloaded. Basically, all re-generated URLs are re-downloaded irrespective of their prior attempts. |
| 14 | + |
| 15 | +The driver uses another knob to disable the parallel downloads and fall-back to sequential downloads EnableAsyncQueryResultDownload. |
| 16 | +Pitfalls |
| 17 | +Few customers reported issues with the parallel file download from Azure in which a single file would experience very low download speeds, roughly 10x slower than the other concurrent file downloads, i.e., in the order of KB. The file transfer would eventually complete, though the progress is very slow, leading to noticeable regressions. We've seen this issue rarely and we have not been successful in reproducing it. However, we observed the issue is isolated to a single file download and that subsequent batches typically complete without experiencing the issue again. |
| 18 | +Proposed solution |
| 19 | +Currently, the driver doesn't enforce a timeout nor cancels and retries file downloads that are slow. We would like to implement a strategy for re-trying the straggling file downloads. |
| 20 | + |
| 21 | +Retry policy. This section explains how to identify a straggling file download. |
| 22 | +The driver keeps track of how long each file transfer takes within a batch. Detecting a straggler is done based on a fresh calculation for the batch. To do so, the driver derives the download throughput for each of the files within a batch as the ratio between the time it takes to complete the download and its size. When at least a fraction of the file downloads within the batch have completed (e.g., 0.75), the driver identifies straggler downloads. To do so, it computes the median throughput across the completed file tasks. A straggler download is a download that takes longer than f x file_size x median_throughput + padding, where f is a straggler multiplier (e.g., 1.5) and the padding adds an extra buffer of a few seconds (e.g., 5 s). |
| 23 | + |
| 24 | +Cancellation mechanism. This section explains how to cancel the file download. |
| 25 | +The timeout cannot be set proactively, as the timeout value depends on runtime metrics such as the current progress of the file download. This is a limitation of the libcURL layer. Instead, the driver will cancel the download in between receiving chunks of the file and will re-attempt the download |
| 26 | + |
| 27 | +Fallback policy. This section explains how to disable parallel downloads. |
| 28 | +If a query experiences more than a predefined number of straggler file downloads, let the driver disable asynchronous download mode and continue to download the files within a batch sequentially. Apply only for the current query. |
| 29 | + |
| 30 | +Configuration Default value Description |
| 31 | +EnableStragglerDownloadMitigation 0 If 1, the driver timeouts and retries straggler downloads. Disabled by default. |
| 32 | +StragglerDownloadMultiplier 1.5 How many times slower a file download needs to be to be considered a straggler. |
| 33 | +StragglerDownloadQuantile 0.6 Fraction of downloads which must be completed before enabling straggler mitigation. |
| 34 | +StraggleDownloadPadding 5s Extra buffer in seconds before declaring a file download is a straggler. |
| 35 | +MaximumStragglersPerQuery 10 Maximum stragglers re-attempted per query before switching to sequential downloads. |
| 36 | +EnableSynchronousDownloadFallback 0 If 1 & EnableStragglerDownloadMitigation, the driver falls-back automatically to sequential downloads if MaximumStragglersPerQuery is exceeded. Applies only to the current query. |
| 37 | + |
| 38 | + |
| 39 | + |
| 40 | +This is a connection param of straggle download. This is implemented in ODBC and we want to implement this is ADBC databricks as well |
| 41 | + . I want you to create a concise LLD doc for implementing this feature. Try to keep the number of classes minimal. Use DRY principles wherever possible. Keep the doc short. |
| 42 | + |
| 43 | +Prompt 2: |
| 44 | +--------- |
| 45 | +Remove the details on testing from the design doc. Also make sure the variable and function naming is appropriate and defining enough. |
| 46 | + |
| 47 | +Prompt 3: |
| 48 | +--------- |
| 49 | +Instead of one, create two docs. One which is sort of a summary and the other one refers to the integration. Refer to the PR. Also create a .txt that contains the prompts I give. https://github.com/apache/arrow-adbc/pull/3624 . There are a lot of comments on the PR. Learn from those comments on what they suggest and do not make those mistakes |
| 50 | + |
| 51 | +Prompt 4: |
| 52 | +--------- |
| 53 | +For connection params, follow the general adbc repo structure. Make changes in the design doc to align with the existing implementation in the databricks ADBC C# driver |
| 54 | + |
| 55 | +Prompt 5: |
| 56 | +--------- |
| 57 | +We're aligned. Is the logging pattern defined in the design doc aligned with the general logging pattern in cloudFetch? |
| 58 | + |
| 59 | +Prompt 6: |
| 60 | +--------- |
| 61 | +Update the design doc accordingly |
| 62 | + |
| 63 | +Prompt 7: |
| 64 | +--------- |
| 65 | +Why are we just using a single retry upon straggle identification. Instead we should just retry straggler and the remaining behaviour stays the same. Basically straggle retry should just be one of the retries which in a way ensures this download won't straggle the next time but there could be some other error so we'll still be following the standard retry policy just adding this one extra retry |
| 66 | + |
| 67 | +Prompt 8: |
| 68 | +--------- |
| 69 | +Now add testing details to both the docs as well. Follow the structure from the current repo. Also remember to take care of the comments on this PR https://github.com/apache/arrow-adbc/pull/3624 and follow the right practises. I see there are two comments saying: "we don't need this level of detail in a design doc, in stead we should focus more on interface/contract between different class objects". "Focus on adding more class diagram and sequence diagram, etc, instead of putting big block of code into the design doc." Are we following these in our design docs? If not modify to follow this pattern |
0 commit comments