Skip to content

Commit 9948b86

Browse files
smurakoziMarcelo Vanzin
authored andcommitted
[SPARK-22516][SQL] Bump up Univocity version to 2.5.9
## What changes were proposed in this pull request? There was a bug in Univocity Parser that causes the issue in SPARK-22516. This was fixed by upgrading from 2.5.4 to 2.5.9 version of the library : **Executing** ``` spark.read.option("header","true").option("inferSchema", "true").option("multiLine", "true").option("comment", "g").csv("test_file_without_eof_char.csv").show() ``` **Before** ``` ERROR Executor: Exception in task 0.0 in stage 6.0 (TID 6) com.univocity.parsers.common.TextParsingException: java.lang.IllegalArgumentException - Unable to skip 1 lines from line 2. End of input reached ... Internal state when error was thrown: line=3, column=0, record=2, charIndex=31 at com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339) at com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:475) at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anon$1.next(UnivocityParser.scala:281) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) ``` **After** ``` +-------+-------+ |column1|column2| +-------+-------+ | abc| def| +-------+-------+ ``` ## How was this patch tested? The already existing `CSVSuite.commented lines in CSV data` test was extended to parse the file also in multiline mode. The test input file was modified to also include a comment in the last line. Author: smurakozi <[email protected]> Closes #19906 from smurakozi/SPARK-22516.
1 parent effca98 commit 9948b86

File tree

5 files changed

+17
-13
lines changed

5 files changed

+17
-13
lines changed

dev/deps/spark-deps-hadoop-2.6

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -180,7 +180,7 @@ stax-api-1.0.1.jar
180180
stream-2.7.0.jar
181181
stringtemplate-3.2.1.jar
182182
super-csv-2.2.0.jar
183-
univocity-parsers-2.5.4.jar
183+
univocity-parsers-2.5.9.jar
184184
validation-api-1.1.0.Final.jar
185185
xbean-asm5-shaded-4.4.jar
186186
xercesImpl-2.9.1.jar

dev/deps/spark-deps-hadoop-2.7

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -181,7 +181,7 @@ stax-api-1.0.1.jar
181181
stream-2.7.0.jar
182182
stringtemplate-3.2.1.jar
183183
super-csv-2.2.0.jar
184-
univocity-parsers-2.5.4.jar
184+
univocity-parsers-2.5.9.jar
185185
validation-api-1.1.0.Final.jar
186186
xbean-asm5-shaded-4.4.jar
187187
xercesImpl-2.9.1.jar

sql/core/pom.xml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@
3838
<dependency>
3939
<groupId>com.univocity</groupId>
4040
<artifactId>univocity-parsers</artifactId>
41-
<version>2.5.4</version>
41+
<version>2.5.9</version>
4242
<type>jar</type>
4343
</dependency>
4444
<dependency>

sql/core/src/test/resources/test-data/comments.csv

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,3 +4,4 @@
44
6,7,8,9,0,2015-08-21 16:58:01
55
~0,9,8,7,6,2015-08-22 17:59:02
66
1,2,3,4,5,2015-08-23 18:00:42
7+
~ comment in last line to test SPARK-22516 - do not add empty line at the end of this file!

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala

Lines changed: 13 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -483,18 +483,21 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils {
483483
}
484484

485485
test("commented lines in CSV data") {
486-
val results = spark.read
487-
.format("csv")
488-
.options(Map("comment" -> "~", "header" -> "false"))
489-
.load(testFile(commentsFile))
490-
.collect()
486+
Seq("false", "true").foreach { multiLine =>
491487

492-
val expected =
493-
Seq(Seq("1", "2", "3", "4", "5.01", "2015-08-20 15:57:00"),
494-
Seq("6", "7", "8", "9", "0", "2015-08-21 16:58:01"),
495-
Seq("1", "2", "3", "4", "5", "2015-08-23 18:00:42"))
488+
val results = spark.read
489+
.format("csv")
490+
.options(Map("comment" -> "~", "header" -> "false", "multiLine" -> multiLine))
491+
.load(testFile(commentsFile))
492+
.collect()
496493

497-
assert(results.toSeq.map(_.toSeq) === expected)
494+
val expected =
495+
Seq(Seq("1", "2", "3", "4", "5.01", "2015-08-20 15:57:00"),
496+
Seq("6", "7", "8", "9", "0", "2015-08-21 16:58:01"),
497+
Seq("1", "2", "3", "4", "5", "2015-08-23 18:00:42"))
498+
499+
assert(results.toSeq.map(_.toSeq) === expected)
500+
}
498501
}
499502

500503
test("inferring schema with commented lines in CSV data") {

0 commit comments

Comments
 (0)