[KYUUBI #7245] Fix arrow batch converter error #7246

echo567 · 2025-11-16T11:16:42Z

Why are the changes needed?

Control the amount of data to prevent memory overflow and increase to initial speed.

When kyuubi.operation.result.format=arrow, spark.connect.grpc.arrow.maxBatchSize does not work as expected.

Reproduction:
You can debug KyuubiArrowConverters or add the following log to line 300 of KyuubiArrowConverters:

logInfo(s"Total limit: ${limit}, rowCount: ${rowCount}, " +
s"rowCountInLastBatch:${rowCountInLastBatch}," +
s"estimatedBatchSize: ${estimatedBatchSize}," +
s"maxEstimatedBatchSize: ${maxEstimatedBatchSize}," +
s"maxRecordsPerBatch:${maxRecordsPerBatch}")

Test data: 1.6 million rows, 30 columns per row. Command executed:

bin/beeline \
  -u 'jdbc:hive2://10.168.X.X:XX/default;thrift.client.max.message.size=2000000000' \
  --hiveconf kyuubi.operation.result.format=arrow \
  -n test -p 'testpass' \
  --outputformat=csv2 -e "select * from db.table" > /tmp/test.csv

Log output

25/11/13 13:52:57 INFO KyuubiArrowConverters: Total limit: -1, rowCount: 200000, lastBatchRowCount:200000, estimatedBatchSize: 145600000 maxEstimatedBatchSize: 4,maxRecordsPerBatch:10000
25/11/13 13:52:57 INFO KyuubiArrowConverters: Total limit: -1, rowCount: 200000, lastBatchRowCount:200000, estimatedBatchSize: 145600000

Original Code

while (rowIter.hasNext && (
rowCountInLastBatch == 0 && maxEstimatedBatchSize > 0 ||
estimatedBatchSize <= 0 ||
estimatedBatchSize < maxEstimatedBatchSize ||
maxRecordsPerBatch <= 0 ||
rowCountInLastBatch < maxRecordsPerBatch ||
rowCount < limit ||
limit < 0))

When the limit is not set, i.e., -1, all data will be retrieved at once. If the row count is too large, the following three problems will occur:
(1) Driver/executor oom
(2) Array oom cause of array length is not enough
(3) Transfer data slowly

After updating the code, the log output is as follows:

25/11/14 10:57:16 INFO KyuubiArrowConverters: Total limit: -1, rowCount: 5762, rowCountInLastBatch:5762, estimatedBatchSize: 4194736, maxEstimatedBatchSize: 4194304, maxRecordsPerBatch:10000
25/11/14 10:57:16 INFO KyuubiArrowConverters: Total limit: -1, rowCount: 11524, rowCountInLastBatch: 5762, estimatedBatchSize: 4194736, maxEstimatedBatchSize: 4194304, maxRecordsPerBatch: 10000 
25/11/14 10:57:16 INFO KyuubiArrowConverters: Total limit: -1, rowCount: 17286, rowCountInLastBatch: 5762, estimatedBatchSize: 4194736, maxEstimatedBatchSize: 4194304, maxRecordsPerBatch: 10000

The estimatedBatchSize is slightly larger than the maxEstimatedBatchSize. Data can be written in batches as expected.

Fix #7245.

How was this patch tested?

Test data: 1.6 million rows, 30 columns per row.

25/11/14 10:57:16 INFO KyuubiArrowConverters: Total limit: -1, rowCount: 5762, rowCountInLastBatch:5762, estimatedBatchSize: 4194736, maxEstimatedBatchSize: 4194304, maxRecordsPerBatch:10000
25/11/14 10:57:16 INFO KyuubiArrowConverters: Total limit: -1, rowCount: 11524, rowCountInLastBatch: 5762, estimatedBatchSize: 4194736, maxEstimatedBatchSize: 4194304, maxRecordsPerBatch: 10000 
25/11/14 10:57:16 INFO KyuubiArrowConverters: Total limit: -1, rowCount: 17286, rowCountInLastBatch: 5762, estimatedBatchSize: 4194736, maxEstimatedBatchSize: 4194304, maxRecordsPerBatch: 10000

Was this patch authored or co-authored using generative AI tooling?

No

pan3793 · 2025-11-17T12:04:59Z

@echo567, please keep the PR template and fill in seriously, especially "Was this patch authored or co-authored using generative AI tooling?", it does matter for legal purposes.

codecov-commenter · 2025-11-17T14:45:00Z

Codecov Report

❌ Patch coverage is 0% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 0.00%. Comparing base (fba1f94) to head (6ef4ef1).

Files with missing lines	Patch %	Lines
...rk/sql/execution/arrow/KyuubiArrowConverters.scala	0.00%	6 Missing ⚠️
...g/apache/spark/sql/kyuubi/SparkDatasetHelper.scala	0.00%	2 Missing ⚠️

Additional details and impacted files

@@          Coverage Diff           @@
##           master   #7246   +/-   ##
======================================
  Coverage    0.00%   0.00%           
======================================
  Files         696     696           
  Lines       43530   43528    -2     
  Branches     5883    5881    -2     
======================================
+ Misses      43530   43528    -2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

echo567 · 2025-11-19T02:46:31Z

@echo567, please keep the PR template and fill in seriously, especially "Was this patch authored or co-authored using generative AI tooling?", it does matter for legal purposes.

Sorry, the changes have been made.

pan3793 · 2025-11-19T03:34:05Z

The code is copied from Spark, seems it was changed at SPARK-44657. Can we just follow that?

echo567 · 2025-11-26T12:18:22Z

The code is copied from Spark, seems it was changed at SPARK-44657. Can we just follow that?

okay,I made modifications based on this Spark issue

echo567 · 2025-12-08T13:10:30Z

hi I've merged the code from the latest master branch. Is there anything else I need to change?

pan3793 · 2025-12-24T04:46:11Z

@cfmcgrady, do you want to have another look?

Copilot

Pull request overview

This PR fixes a critical bug where the spark.connect.grpc.arrow.maxBatchSize configuration was not being respected when using kyuubi.operation.result.format=arrow, leading to memory overflow issues and slow data transfer when processing large datasets without explicit limits.

Key changes:

Fixed incorrect byte unit conversion in SparkDatasetHelper that was treating 4 MiB as 4 bytes
Refactored batch size limit checking logic in KyuubiArrowConverters to properly enforce limits even when no global row limit is set
Introduced clear helper methods to improve code readability and maintainability

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
externals/kyuubi-spark-sql-engine/src/main/scala/org/apache/spark/sql/kyuubi/SparkDatasetHelper.scala	Fixes critical bug in `maxBatchSize` calculation by correctly converting "4m" to 4194304 bytes instead of 4
externals/kyuubi-spark-sql-engine/src/main/scala/org/apache/spark/sql/execution/arrow/KyuubiArrowConverters.scala	Refactors batch iteration logic to properly enforce batch size and record count limits, replacing complex conditional logic with clear helper methods

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-24T14:24:02Z

...k-sql-engine/src/main/scala/org/apache/spark/sql/execution/arrow/KyuubiArrowConverters.scala

+              // If either limit is hit, create a batch. This implies that the limit that is hit
+              // first triggers the creation of a batch even if the other limit is not yet hit
+              // hence preferring the more restrictive limit.


The comment states "If either limit is hit, create a batch" but the logic actually continues the loop when neither limit is exceeded. The comment should be clarified to say "Continue adding rows to the batch until either limit is exceeded" or similar wording to accurately reflect the loop continuation condition rather than batch creation trigger.

Suggested change

// If either limit is hit, create a batch. This implies that the limit that is hit

// first triggers the creation of a batch even if the other limit is not yet hit

// hence preferring the more restrictive limit.

// Continue adding rows to the current batch until either limit is exceeded.

// The limit that is reached first determines the batch boundary, even if the

// other limit has not yet been reached, thus preferring the more restrictive limit.

### Why are the changes needed? Control the amount of data to prevent memory overflow and increase to initial speed. When `kyuubi.operation.result.format=arrow`, `spark.connect.grpc.arrow.maxBatchSize` does not work as expected. Reproduction: You can debug `KyuubiArrowConverters` or add the following log to line 300 of `KyuubiArrowConverters`: ``` logInfo(s"Total limit: ${limit}, rowCount: ${rowCount}, " + s"rowCountInLastBatch:${rowCountInLastBatch}," + s"estimatedBatchSize: ${estimatedBatchSize}," + s"maxEstimatedBatchSize: ${maxEstimatedBatchSize}," + s"maxRecordsPerBatch:${maxRecordsPerBatch}") ``` Test data: 1.6 million rows, 30 columns per row. Command executed: ``` bin/beeline \ -u 'jdbc:hive2://10.168.X.X:XX/default;thrift.client.max.message.size=2000000000' \ --hiveconf kyuubi.operation.result.format=arrow \ -n test -p 'testpass' \ --outputformat=csv2 -e "select * from db.table" > /tmp/test.csv ``` Log output ``` 25/11/13 13:52:57 INFO KyuubiArrowConverters: Total limit: -1, rowCount: 200000, lastBatchRowCount:200000, estimatedBatchSize: 145600000 maxEstimatedBatchSize: 4,maxRecordsPerBatch:10000 25/11/13 13:52:57 INFO KyuubiArrowConverters: Total limit: -1, rowCount: 200000, lastBatchRowCount:200000, estimatedBatchSize: 145600000 ``` Original Code ``` while (rowIter.hasNext && ( rowCountInLastBatch == 0 && maxEstimatedBatchSize > 0 || estimatedBatchSize <= 0 || estimatedBatchSize < maxEstimatedBatchSize || maxRecordsPerBatch <= 0 || rowCountInLastBatch < maxRecordsPerBatch || rowCount < limit || limit < 0)) ``` When the `limit` is not set, i.e., `-1`, all data will be retrieved at once. If the row count is too large, the following three problems will occur: (1) Driver/executor oom (2) Array oom cause of array length is not enough (3) Transfer data slowly After updating the code, the log output is as follows: ``` 25/11/14 10:57:16 INFO KyuubiArrowConverters: Total limit: -1, rowCount: 5762, rowCountInLastBatch:5762, estimatedBatchSize: 4194736, maxEstimatedBatchSize: 4194304, maxRecordsPerBatch:10000 25/11/14 10:57:16 INFO KyuubiArrowConverters: Total limit: -1, rowCount: 11524, rowCountInLastBatch: 5762, estimatedBatchSize: 4194736, maxEstimatedBatchSize: 4194304, maxRecordsPerBatch: 10000 25/11/14 10:57:16 INFO KyuubiArrowConverters: Total limit: -1, rowCount: 17286, rowCountInLastBatch: 5762, estimatedBatchSize: 4194736, maxEstimatedBatchSize: 4194304, maxRecordsPerBatch: 10000 ``` The estimatedBatchSize is slightly larger than the maxEstimatedBatchSize. Data can be written in batches as expected. Fix #7245. ### How was this patch tested? Test data: 1.6 million rows, 30 columns per row. ``` 25/11/14 10:57:16 INFO KyuubiArrowConverters: Total limit: -1, rowCount: 5762, rowCountInLastBatch:5762, estimatedBatchSize: 4194736, maxEstimatedBatchSize: 4194304, maxRecordsPerBatch:10000 25/11/14 10:57:16 INFO KyuubiArrowConverters: Total limit: -1, rowCount: 11524, rowCountInLastBatch: 5762, estimatedBatchSize: 4194736, maxEstimatedBatchSize: 4194304, maxRecordsPerBatch: 10000 25/11/14 10:57:16 INFO KyuubiArrowConverters: Total limit: -1, rowCount: 17286, rowCountInLastBatch: 5762, estimatedBatchSize: 4194736, maxEstimatedBatchSize: 4194304, maxRecordsPerBatch: 10000 ``` ### Was this patch authored or co-authored using generative AI tooling? No Closes #7246 from echo567/fix-arrow-converter. Closes #7245 6ef4ef1 [echo567] Merge branch 'master' into fix-arrow-converter c9d0d18 [echo567] fix(arrow): repairing arrow based on spark 479d7e4 [echo567] fix(spark): fix arrow batch converter error Authored-by: echo567 <[email protected]> Signed-off-by: Cheng Pan <[email protected]> (cherry picked from commit acdb6a3) Signed-off-by: Cheng Pan <[email protected]>

pan3793 · 2025-12-25T03:09:12Z

thanks, merged to master/1.11/1.10

fix(spark): fix arrow batch converter error

479d7e4

github-actions bot added the module:spark label Nov 16, 2025

pan3793 changed the title ~~[KYUUBI apache#7245] fix arrow batch converter error~~ [KYUUBI #7245] Fix arrow batch converter error Nov 17, 2025

pan3793 requested a review from cfmcgrady November 19, 2025 03:26

fix(arrow): repairing arrow based on spark

c9d0d18

Merge branch 'master' into fix-arrow-converter

6ef4ef1

pan3793 approved these changes Dec 24, 2025

View reviewed changes

pan3793 assigned echo567 Dec 24, 2025

pan3793 added this to the v1.10.3 milestone Dec 24, 2025

cxzl25 requested a review from Copilot December 24, 2025 14:19

Copilot started reviewing on behalf of cxzl25 December 24, 2025 14:19 View session

Copilot AI reviewed Dec 24, 2025

View reviewed changes

cfmcgrady approved these changes Dec 25, 2025

View reviewed changes

pan3793 closed this in acdb6a3 Dec 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[KYUUBI #7245] Fix arrow batch converter error #7246

[KYUUBI #7245] Fix arrow batch converter error #7246

echo567 commented Nov 16, 2025 •

edited by pan3793

Loading

Uh oh!

pan3793 commented Nov 17, 2025

Uh oh!

codecov-commenter commented Nov 17, 2025 •

edited

Loading

Uh oh!

echo567 commented Nov 19, 2025

Uh oh!

pan3793 commented Nov 19, 2025

Uh oh!

echo567 commented Nov 26, 2025

Uh oh!

echo567 commented Dec 8, 2025

Uh oh!

pan3793 commented Dec 24, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Dec 24, 2025

Uh oh!

pan3793 commented Dec 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[KYUUBI #7245] Fix arrow batch converter error #7246

[KYUUBI #7245] Fix arrow batch converter error #7246

Conversation

echo567 commented Nov 16, 2025 • edited by pan3793 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are the changes needed?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

pan3793 commented Nov 17, 2025

Uh oh!

codecov-commenter commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

echo567 commented Nov 19, 2025

Uh oh!

pan3793 commented Nov 19, 2025

Uh oh!

echo567 commented Nov 26, 2025

Uh oh!

echo567 commented Dec 8, 2025

Uh oh!

pan3793 commented Dec 24, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

pan3793 commented Dec 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

echo567 commented Nov 16, 2025 •

edited by pan3793

Loading

codecov-commenter commented Nov 17, 2025 •

edited

Loading