Skip to content

Commit c8922d9

Browse files
nchammasHyukjinKwon
authored andcommitted
[SPARK-30113][SQL][PYTHON] Expose mergeSchema option in PySpark's ORC APIs
### What changes were proposed in this pull request? This PR is a follow-up to apache#24043 and cousin of apache#26730. It exposes the `mergeSchema` option directly in the ORC APIs. ### Why are the changes needed? So the Python API matches the Scala API. ### Does this PR introduce any user-facing change? Yes, it adds a new option directly in the ORC reader method signatures. ### How was this patch tested? I tested this manually as follows: ``` >>> spark.range(3).write.orc('test-orc') >>> spark.range(3).withColumnRenamed('id', 'name').write.orc('test-orc/nested') >>> spark.read.orc('test-orc', recursiveFileLookup=True, mergeSchema=True) DataFrame[id: bigint, name: bigint] >>> spark.read.orc('test-orc', recursiveFileLookup=True, mergeSchema=False) DataFrame[id: bigint] >>> spark.conf.set('spark.sql.orc.mergeSchema', True) >>> spark.read.orc('test-orc', recursiveFileLookup=True) DataFrame[id: bigint, name: bigint] >>> spark.read.orc('test-orc', recursiveFileLookup=True, mergeSchema=False) DataFrame[id: bigint] ``` Closes apache#26755 from nchammas/SPARK-30113-ORC-mergeSchema. Authored-by: Nicholas Chammas <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
1 parent e766a32 commit c8922d9

File tree

2 files changed

+12
-6
lines changed

2 files changed

+12
-6
lines changed

python/pyspark/sql/readwriter.py

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -520,17 +520,20 @@ def func(iterator):
520520
raise TypeError("path can be only string, list or RDD")
521521

522522
@since(1.5)
523-
def orc(self, path, recursiveFileLookup=None):
523+
def orc(self, path, mergeSchema=None, recursiveFileLookup=None):
524524
"""Loads ORC files, returning the result as a :class:`DataFrame`.
525525
526+
:param mergeSchema: sets whether we should merge schemas collected from all
527+
ORC part-files. This will override ``spark.sql.orc.mergeSchema``.
528+
The default value is specified in ``spark.sql.orc.mergeSchema``.
526529
:param recursiveFileLookup: recursively scan a directory for files. Using this option
527-
disables `partition discovery`_.
530+
disables `partition discovery`_.
528531
529532
>>> df = spark.read.orc('python/test_support/sql/orc_partitioned')
530533
>>> df.dtypes
531534
[('a', 'bigint'), ('b', 'int'), ('c', 'int')]
532535
"""
533-
self._set_opts(recursiveFileLookup=recursiveFileLookup)
536+
self._set_opts(mergeSchema=mergeSchema, recursiveFileLookup=recursiveFileLookup)
534537
if isinstance(path, basestring):
535538
path = [path]
536539
return self._df(self._jreader.orc(_to_seq(self._spark._sc, path)))

python/pyspark/sql/streaming.py

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -514,21 +514,24 @@ def json(self, path, schema=None, primitivesAsString=None, prefersDecimal=None,
514514
raise TypeError("path can be only a single string")
515515

516516
@since(2.3)
517-
def orc(self, path, recursiveFileLookup=None):
517+
def orc(self, path, mergeSchema=None, recursiveFileLookup=None):
518518
"""Loads a ORC file stream, returning the result as a :class:`DataFrame`.
519519
520520
.. note:: Evolving.
521521
522+
:param mergeSchema: sets whether we should merge schemas collected from all
523+
ORC part-files. This will override ``spark.sql.orc.mergeSchema``.
524+
The default value is specified in ``spark.sql.orc.mergeSchema``.
522525
:param recursiveFileLookup: recursively scan a directory for files. Using this option
523-
disables `partition discovery`_.
526+
disables `partition discovery`_.
524527
525528
>>> orc_sdf = spark.readStream.schema(sdf_schema).orc(tempfile.mkdtemp())
526529
>>> orc_sdf.isStreaming
527530
True
528531
>>> orc_sdf.schema == sdf_schema
529532
True
530533
"""
531-
self._set_opts(recursiveFileLookup=recursiveFileLookup)
534+
self._set_opts(mergeSchema=mergeSchema, recursiveFileLookup=recursiveFileLookup)
532535
if isinstance(path, basestring):
533536
return self._df(self._jreader.orc(path))
534537
else:

0 commit comments

Comments
 (0)