Skip to content

Commit 7b7f16f

Browse files
yeshengmgatorsmile
authored andcommitted
[SPARK-27890][SQL] Improve SQL parser error message for character-only identifier with hyphens except those in expressions
## What changes were proposed in this pull request? Current SQL parser's error message for hyphen-connected identifiers without surrounding backquotes(e.g. hyphen-table) is confusing for end users. A possible approach to tackle this is to explicitly capture these wrong usages in the SQL parser. In this way, the end users can fix these errors more quickly. For example, for a simple query such as `SELECT * FROM test-table`, the original error message is ``` Error in SQL statement: ParseException: mismatched input '-' expecting <EOF>(line 1, pos 18) ``` which can be confusing in a large query. After the fix, the error message is: ``` Error in query: Possibly unquoted identifier test-table detected. Please consider quoting it with back-quotes as `test-table`(line 1, pos 14) == SQL == SELECT * FROM test-table --------------^^^ ``` which is easier for end users to identify the issue and fix. We safely augmented the current grammar rule to explicitly capture these error cases. The error handling logic is implemented in the SQL parsing listener `PostProcessor`. However, note that for cases such as `a - my-func(b)`, the parser can't actually tell whether this should be ``a -`my-func`(b) `` or `a - my - func(b)`. Therefore for these cases, we leave the parser as is. Also, in this patch we only provide better error messages for character-only identifiers. ## How was this patch tested? Adding new unit tests. Closes apache#24749 from yeshengm/hyphen-ident. Authored-by: Yesheng Ma <[email protected]> Signed-off-by: gatorsmile <[email protected]>
1 parent 15de6d0 commit 7b7f16f

File tree

5 files changed

+169
-35
lines changed

5 files changed

+169
-35
lines changed

sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4

Lines changed: 38 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -82,13 +82,15 @@ singleTableSchema
8282
statement
8383
: query #statementDefault
8484
| ctes? dmlStatementNoWith #dmlStatement
85-
| USE db=identifier #use
86-
| CREATE database (IF NOT EXISTS)? identifier
85+
| USE db=errorCapturingIdentifier #use
86+
| CREATE database (IF NOT EXISTS)? db=errorCapturingIdentifier
8787
((COMMENT comment=STRING) |
8888
locationSpec |
8989
(WITH DBPROPERTIES tablePropertyList))* #createDatabase
90-
| ALTER database identifier SET DBPROPERTIES tablePropertyList #setDatabaseProperties
91-
| DROP database (IF EXISTS)? identifier (RESTRICT | CASCADE)? #dropDatabase
90+
| ALTER database db=errorCapturingIdentifier
91+
SET DBPROPERTIES tablePropertyList #setDatabaseProperties
92+
| DROP database (IF EXISTS)? db=errorCapturingIdentifier
93+
(RESTRICT | CASCADE)? #dropDatabase
9294
| SHOW DATABASES (LIKE? pattern=STRING)? #showDatabases
9395
| createTableHeader ('(' colTypeList ')')? tableProvider
9496
((OPTIONS options=tablePropertyList) |
@@ -135,7 +137,8 @@ statement
135137
(ALTER | CHANGE) COLUMN? qualifiedName
136138
(TYPE dataType)? (COMMENT comment=STRING)? colPosition? #alterTableColumn
137139
| ALTER TABLE tableIdentifier partitionSpec?
138-
CHANGE COLUMN? identifier colType colPosition? #changeColumn
140+
CHANGE COLUMN?
141+
colName=errorCapturingIdentifier colType colPosition? #changeColumn
139142
| ALTER TABLE tableIdentifier (partitionSpec)?
140143
SET SERDE STRING (WITH SERDEPROPERTIES tablePropertyList)? #setTableSerDe
141144
| ALTER TABLE tableIdentifier (partitionSpec)?
@@ -172,20 +175,20 @@ statement
172175
| DROP TEMPORARY? FUNCTION (IF EXISTS)? qualifiedName #dropFunction
173176
| EXPLAIN (LOGICAL | FORMATTED | EXTENDED | CODEGEN | COST)?
174177
statement #explain
175-
| SHOW TABLES ((FROM | IN) db=identifier)?
178+
| SHOW TABLES ((FROM | IN) db=errorCapturingIdentifier)?
176179
(LIKE? pattern=STRING)? #showTables
177-
| SHOW TABLE EXTENDED ((FROM | IN) db=identifier)?
180+
| SHOW TABLE EXTENDED ((FROM | IN) db=errorCapturingIdentifier)?
178181
LIKE pattern=STRING partitionSpec? #showTable
179182
| SHOW TBLPROPERTIES table=tableIdentifier
180183
('(' key=tablePropertyKey ')')? #showTblProperties
181184
| SHOW COLUMNS (FROM | IN) tableIdentifier
182-
((FROM | IN) db=identifier)? #showColumns
185+
((FROM | IN) db=errorCapturingIdentifier)? #showColumns
183186
| SHOW PARTITIONS tableIdentifier partitionSpec? #showPartitions
184187
| SHOW identifier? FUNCTIONS
185188
(LIKE? (qualifiedName | pattern=STRING))? #showFunctions
186189
| SHOW CREATE TABLE tableIdentifier #showCreateTable
187190
| (DESC | DESCRIBE) FUNCTION EXTENDED? describeFuncName #describeFunction
188-
| (DESC | DESCRIBE) database EXTENDED? identifier #describeDatabase
191+
| (DESC | DESCRIBE) database EXTENDED? db=errorCapturingIdentifier #describeDatabase
189192
| (DESC | DESCRIBE) TABLE? option=(EXTENDED | FORMATTED)?
190193
tableIdentifier partitionSpec? describeColName? #describeTable
191194
| (DESC | DESCRIBE) QUERY? query #describeQuery
@@ -319,7 +322,7 @@ ctes
319322
;
320323

321324
namedQuery
322-
: name=identifier (columnAliases=identifierList)? AS? '(' query ')'
325+
: name=errorCapturingIdentifier (columnAliases=identifierList)? AS? '(' query ')'
323326
;
324327

325328
tableProvider
@@ -559,15 +562,15 @@ identifierList
559562
;
560563

561564
identifierSeq
562-
: identifier (',' identifier)*
565+
: ident+=errorCapturingIdentifier (',' ident+=errorCapturingIdentifier)*
563566
;
564567

565568
orderedIdentifierList
566569
: '(' orderedIdentifier (',' orderedIdentifier)* ')'
567570
;
568571

569572
orderedIdentifier
570-
: identifier ordering=(ASC | DESC)?
573+
: ident=errorCapturingIdentifier ordering=(ASC | DESC)?
571574
;
572575

573576
identifierCommentList
@@ -591,7 +594,7 @@ inlineTable
591594
;
592595

593596
functionTable
594-
: identifier '(' (expression (',' expression)*)? ')' tableAlias
597+
: funcName=errorCapturingIdentifier '(' (expression (',' expression)*)? ')' tableAlias
595598
;
596599

597600
tableAlias
@@ -609,19 +612,19 @@ rowFormat
609612
;
610613

611614
multipartIdentifier
612-
: parts+=identifier ('.' parts+=identifier)*
615+
: parts+=errorCapturingIdentifier ('.' parts+=errorCapturingIdentifier)*
613616
;
614617

615618
tableIdentifier
616-
: (db=identifier '.')? table=identifier
619+
: (db=errorCapturingIdentifier '.')? table=errorCapturingIdentifier
617620
;
618621

619622
functionIdentifier
620-
: (db=identifier '.')? function=identifier
623+
: (db=errorCapturingIdentifier '.')? function=errorCapturingIdentifier
621624
;
622625

623626
namedExpression
624-
: expression (AS? (identifier | identifierList))?
627+
: expression (AS? (name=errorCapturingIdentifier | identifierList))?
625628
;
626629

627630
namedExpressionSeq
@@ -788,7 +791,7 @@ colTypeList
788791
;
789792

790793
colType
791-
: identifier dataType (COMMENT STRING)?
794+
: colName=errorCapturingIdentifier dataType (COMMENT STRING)?
792795
;
793796

794797
complexColTypeList
@@ -808,18 +811,18 @@ windowClause
808811
;
809812

810813
namedWindow
811-
: identifier AS windowSpec
814+
: name=errorCapturingIdentifier AS windowSpec
812815
;
813816

814817
windowSpec
815-
: name=identifier #windowRef
816-
| '('name=identifier')' #windowRef
818+
: name=errorCapturingIdentifier #windowRef
819+
| '('name=errorCapturingIdentifier')' #windowRef
817820
| '('
818821
( CLUSTER BY partition+=expression (',' partition+=expression)*
819822
| ((PARTITION | DISTRIBUTE) BY partition+=expression (',' partition+=expression)*)?
820823
((ORDER | SORT) BY sortItem (',' sortItem)*)?)
821824
windowFrame?
822-
')' #windowDef
825+
')' #windowDef
823826
;
824827

825828
windowFrame
@@ -843,6 +846,19 @@ qualifiedName
843846
: identifier ('.' identifier)*
844847
;
845848

849+
// this rule is used for explicitly capturing wrong identifiers such as test-table, which should actually be `test-table`
850+
// replace identifier with errorCapturingIdentifier where the immediate follow symbol is not an expression, otherwise
851+
// valid expressions such as "a-b" can be recognized as an identifier
852+
errorCapturingIdentifier
853+
: identifier errorCapturingIdentifierExtra
854+
;
855+
856+
// extra left-factoring grammar
857+
errorCapturingIdentifierExtra
858+
: (MINUS identifier)+ #errorIdent
859+
| #realIdent
860+
;
861+
846862
identifier
847863
: strictIdentifier
848864
| {!ansi}? strictNonReserved

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -665,7 +665,7 @@ class AstBuilder(conf: SQLConf) extends SqlBaseBaseVisitor[AnyRef] with Logging
665665
// Collect all window specifications defined in the WINDOW clause.
666666
val baseWindowMap = ctx.namedWindow.asScala.map {
667667
wCtx =>
668-
(wCtx.identifier.getText, typedVisit[WindowSpec](wCtx.windowSpec))
668+
(wCtx.name.getText, typedVisit[WindowSpec](wCtx.windowSpec))
669669
}.toMap
670670

671671
// Handle cases like
@@ -927,7 +927,7 @@ class AstBuilder(conf: SQLConf) extends SqlBaseBaseVisitor[AnyRef] with Logging
927927
}
928928

929929
val tvf = UnresolvedTableValuedFunction(
930-
func.identifier.getText, func.expression.asScala.map(expression), aliases)
930+
func.funcName.getText, func.expression.asScala.map(expression), aliases)
931931
tvf.optionalMap(func.tableAlias.strictIdentifier)(aliasPlan)
932932
}
933933

@@ -1026,7 +1026,7 @@ class AstBuilder(conf: SQLConf) extends SqlBaseBaseVisitor[AnyRef] with Logging
10261026
* Create a Sequence of Strings for an identifier list.
10271027
*/
10281028
override def visitIdentifierSeq(ctx: IdentifierSeqContext): Seq[String] = withOrigin(ctx) {
1029-
ctx.identifier.asScala.map(_.getText)
1029+
ctx.ident.asScala.map(_.getText)
10301030
}
10311031

10321032
/* ********************************************************************************************
@@ -1086,8 +1086,8 @@ class AstBuilder(conf: SQLConf) extends SqlBaseBaseVisitor[AnyRef] with Logging
10861086
*/
10871087
override def visitNamedExpression(ctx: NamedExpressionContext): Expression = withOrigin(ctx) {
10881088
val e = expression(ctx.expression)
1089-
if (ctx.identifier != null) {
1090-
Alias(e, ctx.identifier.getText)()
1089+
if (ctx.name != null) {
1090+
Alias(e, ctx.name.getText)()
10911091
} else if (ctx.identifierList != null) {
10921092
MultiAlias(e, visitIdentifierList(ctx.identifierList))
10931093
} else {
@@ -1479,7 +1479,7 @@ class AstBuilder(conf: SQLConf) extends SqlBaseBaseVisitor[AnyRef] with Logging
14791479
* Create a reference to a window frame, i.e. [[WindowSpecReference]].
14801480
*/
14811481
override def visitWindowRef(ctx: WindowRefContext): WindowSpecReference = withOrigin(ctx) {
1482-
WindowSpecReference(ctx.identifier.getText)
1482+
WindowSpecReference(ctx.name.getText)
14831483
}
14841484

14851485
/**
@@ -1958,7 +1958,7 @@ class AstBuilder(conf: SQLConf) extends SqlBaseBaseVisitor[AnyRef] with Logging
19581958
}
19591959

19601960
StructField(
1961-
identifier.getText,
1961+
colName.getText,
19621962
cleanedDataType,
19631963
nullable = true,
19641964
builder.build())
@@ -2012,7 +2012,7 @@ class AstBuilder(conf: SQLConf) extends SqlBaseBaseVisitor[AnyRef] with Logging
20122012
}
20132013
}
20142014

2015-
orderedIdCtx.identifier.getText
2015+
orderedIdCtx.ident.getText
20162016
})
20172017
}
20182018

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/ParseDriver.scala

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -265,6 +265,14 @@ class ParseException(
265265
*/
266266
case object PostProcessor extends SqlBaseBaseListener {
267267

268+
/** Throws error message when exiting a explicitly captured wrong identifier rule */
269+
override def exitErrorIdent(ctx: SqlBaseParser.ErrorIdentContext): Unit = {
270+
val ident = ctx.getParent.getText
271+
272+
throw new ParseException(s"Possibly unquoted identifier $ident detected. " +
273+
s"Please consider quoting it with back-quotes as `$ident`", ctx)
274+
}
275+
268276
/** Remove the back ticks from an Identifier. */
269277
override def exitQuotedIdentifier(ctx: SqlBaseParser.QuotedIdentifierContext): Unit = {
270278
replaceTokenByIdentifier(ctx, 1) { token =>

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/ErrorParserSuite.scala

Lines changed: 110 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,11 +17,20 @@
1717
package org.apache.spark.sql.catalyst.parser
1818

1919
import org.apache.spark.sql.catalyst.analysis.AnalysisTest
20+
import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
2021

2122
/**
2223
* Test various parser errors.
2324
*/
2425
class ErrorParserSuite extends AnalysisTest {
26+
import CatalystSqlParser._
27+
import org.apache.spark.sql.catalyst.dsl.expressions._
28+
import org.apache.spark.sql.catalyst.dsl.plans._
29+
30+
private def assertEqual(sqlCommand: String, plan: LogicalPlan): Unit = {
31+
assert(parsePlan(sqlCommand) == plan)
32+
}
33+
2534
def intercept(sqlCommand: String, messages: String*): Unit =
2635
interceptParseException(CatalystSqlParser.parsePlan)(sqlCommand, messages: _*)
2736

@@ -72,4 +81,105 @@ class ErrorParserSuite extends AnalysisTest {
7281
intercept("select * from test where test.t is like 'test'", "mismatched input 'is' expecting")
7382
intercept("SELECT * FROM test WHERE x NOT NULL", "mismatched input 'NOT' expecting")
7483
}
84+
85+
test("hyphen in identifier - DDL tests") {
86+
val msg = "unquoted identifier"
87+
intercept("USE test-test", 1, 8, 9, msg + " test-test")
88+
intercept("CREATE DATABASE IF NOT EXISTS my-database", 1, 32, 33, msg + " my-database")
89+
intercept(
90+
"""
91+
|ALTER DATABASE my-database
92+
|SET DBPROPERTIES ('p1'='v1')""".stripMargin, 2, 17, 18, msg + " my-database")
93+
intercept("DROP DATABASE my-database", 1, 16, 17, msg + " my-database")
94+
intercept(
95+
"""
96+
|ALTER TABLE t
97+
|CHANGE COLUMN
98+
|test-col BIGINT
99+
""".stripMargin, 4, 4, 5, msg + " test-col")
100+
intercept("CREATE TABLE test (attri-bute INT)", 1, 24, 25, msg + " attri-bute")
101+
intercept(
102+
"""
103+
|CREATE TABLE IF NOT EXISTS mydb.page-view
104+
|USING parquet
105+
|COMMENT 'This is the staging page view table'
106+
|LOCATION '/user/external/page_view'
107+
|TBLPROPERTIES ('p1'='v1', 'p2'='v2')
108+
|AS SELECT * FROM src""".stripMargin, 2, 36, 37, msg + " page-view")
109+
intercept("SHOW TABLES IN hyphen-database", 1, 21, 22, msg + " hyphen-database")
110+
intercept("SHOW TABLE EXTENDED IN hyphen-db LIKE \"str\"", 1, 29, 30, msg + " hyphen-db")
111+
intercept("SHOW COLUMNS IN t FROM test-db", 1, 27, 28, msg + " test-db")
112+
intercept("DESC SCHEMA EXTENDED test-db", 1, 25, 26, msg + " test-db")
113+
intercept("ANALYZE TABLE test-table PARTITION (part1)", 1, 18, 19, msg + " test-table")
114+
intercept("LOAD DATA INPATH \"path\" INTO TABLE my-tab", 1, 37, 38, msg + " my-tab")
115+
}
116+
117+
test("hyphen in identifier - DML tests") {
118+
val msg = "unquoted identifier"
119+
// dml tests
120+
intercept("SELECT * FROM table-with-hyphen", 1, 19, 25, msg + " table-with-hyphen")
121+
// special test case: minus in expression shouldn't be treated as hyphen in identifiers
122+
intercept("SELECT a-b FROM table-with-hyphen", 1, 21, 27, msg + " table-with-hyphen")
123+
intercept("SELECT a-b AS a-b FROM t", 1, 15, 16, msg + " a-b")
124+
intercept("SELECT a-b FROM table-hyphen WHERE a-b = 0", 1, 21, 22, msg + " table-hyphen")
125+
intercept("SELECT (a - test_func(b-c)) FROM test-table", 1, 37, 38, msg + " test-table")
126+
intercept("WITH a-b AS (SELECT 1 FROM s) SELECT * FROM s;", 1, 6, 7, msg + " a-b")
127+
intercept(
128+
"""
129+
|SELECT a, b
130+
|FROM t1 JOIN t2
131+
|USING (a, b, at-tr)
132+
""".stripMargin, 4, 15, 16, msg + " at-tr"
133+
)
134+
intercept(
135+
"""
136+
|SELECT product, category, dense_rank()
137+
|OVER (PARTITION BY category ORDER BY revenue DESC) as hyphen-rank
138+
|FROM productRevenue
139+
""".stripMargin, 3, 60, 61, msg + " hyphen-rank"
140+
)
141+
intercept(
142+
"""
143+
|SELECT a, b
144+
|FROM grammar-breaker
145+
|WHERE a-b > 10
146+
|GROUP BY fake-breaker
147+
|ORDER BY c
148+
""".stripMargin, 3, 12, 13, msg + " grammar-breaker")
149+
assertEqual(
150+
"""
151+
|SELECT a, b
152+
|FROM t
153+
|WHERE a-b > 10
154+
|GROUP BY fake-breaker
155+
|ORDER BY c
156+
""".stripMargin,
157+
table("t")
158+
.where('a - 'b > 10)
159+
.groupBy('fake - 'breaker)('a, 'b)
160+
.orderBy('c.asc))
161+
intercept(
162+
"""
163+
|SELECT * FROM tab
164+
|WINDOW hyphen-window AS
165+
| (PARTITION BY a, b ORDER BY c rows BETWEEN 1 PRECEDING AND 1 FOLLOWING)
166+
""".stripMargin, 3, 13, 14, msg + " hyphen-window")
167+
intercept(
168+
"""
169+
|SELECT * FROM tab
170+
|WINDOW window_ref AS window-ref
171+
""".stripMargin, 3, 27, 28, msg + " window-ref")
172+
intercept(
173+
"""
174+
|SELECT tb.*
175+
|FROM t-a INNER JOIN tb
176+
|ON ta.a = tb.a AND ta.tag = tb.tag
177+
""".stripMargin, 3, 6, 7, msg + " t-a")
178+
intercept(
179+
"""
180+
|FROM test-table
181+
|SELECT a
182+
|SELECT b
183+
""".stripMargin, 2, 9, 10, msg + " test-table")
184+
}
75185
}

0 commit comments

Comments
 (0)