Skip to content

Commit 65a189c

Browse files
BryanCutlerHyukjinKwon
authored andcommitted
[SPARK-29376][SQL][PYTHON] Upgrade Apache Arrow to version 0.15.1
### What changes were proposed in this pull request? Upgrade Apache Arrow to version 0.15.1. This includes Java artifacts and increases the minimum required version of PyArrow also. Version 0.12.0 to 0.15.1 includes the following selected fixes/improvements relevant to Spark users: * ARROW-6898 - [Java] Fix potential memory leak in ArrowWriter and several test classes * ARROW-6874 - [Python] Memory leak in Table.to_pandas() when conversion to object dtype * ARROW-5579 - [Java] shade flatbuffer dependency * ARROW-5843 - [Java] Improve the readability and performance of BitVectorHelper#getNullCount * ARROW-5881 - [Java] Provide functionalities to efficiently determine if a validity buffer has completely 1 bits/0 bits * ARROW-5893 - [C++] Remove arrow::Column class from C++ library * ARROW-5970 - [Java] Provide pointer to Arrow buffer * ARROW-6070 - [Java] Avoid creating new schema before IPC sending * ARROW-6279 - [Python] Add Table.slice method or allow slices in \_\_getitem\_\_ * ARROW-6313 - [Format] Tracking for ensuring flatbuffer serialized values are aligned in stream/files. * ARROW-6557 - [Python] Always return pandas.Series from Array/ChunkedArray.to_pandas, propagate field names to Series from RecordBatch, Table * ARROW-2015 - [Java] Use Java Time and Date APIs instead of JodaTime * ARROW-1261 - [Java] Add container type for Map logical type * ARROW-1207 - [C++] Implement Map logical type Changelog can be seen at https://arrow.apache.org/release/0.15.0.html ### Why are the changes needed? Upgrade to get bug fixes, improvements, and maintain compatibility with future versions of PyArrow. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing tests, manually tested with Python 3.7, 3.8 Closes apache#26133 from BryanCutler/arrow-upgrade-015-SPARK-29376. Authored-by: Bryan Cutler <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
1 parent bb8b04d commit 65a189c

File tree

6 files changed

+18
-16
lines changed

6 files changed

+18
-16
lines changed

dev/deps/spark-deps-hadoop-2.7

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -17,9 +17,9 @@ apacheds-kerberos-codec-2.0.0-M15.jar
1717
api-asn1-api-1.0.0-M20.jar
1818
api-util-1.0.0-M20.jar
1919
arpack_combined_all-0.1.jar
20-
arrow-format-0.12.0.jar
21-
arrow-memory-0.12.0.jar
22-
arrow-vector-0.12.0.jar
20+
arrow-format-0.15.1.jar
21+
arrow-memory-0.15.1.jar
22+
arrow-vector-0.15.1.jar
2323
audience-annotations-0.5.0.jar
2424
automaton-1.11-8.jar
2525
avro-1.8.2.jar
@@ -83,7 +83,6 @@ hadoop-yarn-server-web-proxy-2.7.4.jar
8383
hk2-api-2.5.0.jar
8484
hk2-locator-2.5.0.jar
8585
hk2-utils-2.5.0.jar
86-
hppc-0.7.2.jar
8786
htrace-core-3.1.0-incubating.jar
8887
httpclient-4.5.6.jar
8988
httpcore-4.4.10.jar

dev/deps/spark-deps-hadoop-3.2

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -12,9 +12,9 @@ antlr4-runtime-4.7.1.jar
1212
aopalliance-1.0.jar
1313
aopalliance-repackaged-2.5.0.jar
1414
arpack_combined_all-0.1.jar
15-
arrow-format-0.12.0.jar
16-
arrow-memory-0.12.0.jar
17-
arrow-vector-0.12.0.jar
15+
arrow-format-0.15.1.jar
16+
arrow-memory-0.15.1.jar
17+
arrow-vector-0.15.1.jar
1818
audience-annotations-0.5.0.jar
1919
automaton-1.11-8.jar
2020
avro-1.8.2.jar
@@ -96,7 +96,6 @@ hive-vector-code-gen-2.3.6.jar
9696
hk2-api-2.5.0.jar
9797
hk2-locator-2.5.0.jar
9898
hk2-utils-2.5.0.jar
99-
hppc-0.7.2.jar
10099
htrace-core4-4.1.0-incubating.jar
101100
httpclient-4.5.6.jar
102101
httpcore-4.4.10.jar

pom.xml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -200,9 +200,9 @@
200200
<commons-crypto.version>1.0.0</commons-crypto.version>
201201
<!--
202202
If you are changing Arrow version specification, please check ./python/pyspark/sql/utils.py,
203-
./python/run-tests.py and ./python/setup.py too.
203+
and ./python/setup.py too.
204204
-->
205-
<arrow.version>0.12.0</arrow.version>
205+
<arrow.version>0.15.1</arrow.version>
206206

207207
<test.java.home>${java.home}</test.java.home>
208208
<test.exclude.tags></test.exclude.tags>

python/pyspark/sql/utils.py

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -160,9 +160,10 @@ def require_minimum_pyarrow_version():
160160
""" Raise ImportError if minimum version of pyarrow is not installed
161161
"""
162162
# TODO(HyukjinKwon): Relocate and deduplicate the version specification.
163-
minimum_pyarrow_version = "0.12.1"
163+
minimum_pyarrow_version = "0.15.1"
164164

165165
from distutils.version import LooseVersion
166+
import os
166167
try:
167168
import pyarrow
168169
have_arrow = True
@@ -174,6 +175,9 @@ def require_minimum_pyarrow_version():
174175
if LooseVersion(pyarrow.__version__) < LooseVersion(minimum_pyarrow_version):
175176
raise ImportError("PyArrow >= %s must be installed; however, "
176177
"your version was %s." % (minimum_pyarrow_version, pyarrow.__version__))
178+
if os.environ.get("ARROW_PRE_0_15_IPC_FORMAT", "0") == "1":
179+
raise RuntimeError("Arrow legacy IPC format is not supported in PySpark, "
180+
"please unset ARROW_PRE_0_15_IPC_FORMAT")
177181

178182

179183
def require_test_compiled():

python/setup.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -105,7 +105,7 @@ def _supports_symlinks():
105105
# For Arrow, you should also check ./pom.xml and ensure there are no breaking changes in the
106106
# binary format protocol with the Java version, see ARROW_HOME/format/* for specifications.
107107
_minimum_pandas_version = "0.23.2"
108-
_minimum_pyarrow_version = "0.12.1"
108+
_minimum_pyarrow_version = "0.15.1"
109109

110110
try:
111111
# We copy the shell script to be under pyspark/python/pyspark so that the launcher scripts

sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ import org.apache.arrow.flatbuf.MessageHeader
2626
import org.apache.arrow.memory.BufferAllocator
2727
import org.apache.arrow.vector._
2828
import org.apache.arrow.vector.ipc.{ArrowStreamWriter, ReadChannel, WriteChannel}
29-
import org.apache.arrow.vector.ipc.message.{ArrowRecordBatch, MessageSerializer}
29+
import org.apache.arrow.vector.ipc.message.{ArrowRecordBatch, IpcOption, MessageSerializer}
3030

3131
import org.apache.spark.TaskContext
3232
import org.apache.spark.api.java.JavaRDD
@@ -64,7 +64,7 @@ private[sql] class ArrowBatchStreamWriter(
6464
* End the Arrow stream, does not close output stream.
6565
*/
6666
def end(): Unit = {
67-
ArrowStreamWriter.writeEndOfStream(writeChannel)
67+
ArrowStreamWriter.writeEndOfStream(writeChannel, new IpcOption)
6868
}
6969
}
7070

@@ -251,8 +251,8 @@ private[sql] object ArrowConverters {
251251
// Only care about RecordBatch messages, skip Schema and unsupported Dictionary messages
252252
if (msgMetadata.getMessage.headerType() == MessageHeader.RecordBatch) {
253253

254-
// Buffer backed output large enough to hold the complete serialized message
255-
val bbout = new ByteBufferOutputStream(4 + msgMetadata.getMessageLength + bodyLength)
254+
// Buffer backed output large enough to hold 8-byte length + complete serialized message
255+
val bbout = new ByteBufferOutputStream(8 + msgMetadata.getMessageLength + bodyLength)
256256

257257
// Write message metadata to ByteBuffer output stream
258258
MessageSerializer.writeMessageBuffer(

0 commit comments

Comments
 (0)