-
Notifications
You must be signed in to change notification settings - Fork 630
Expand parse without semicolons #1949
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 4 commits
4b4f5a1
7c17355
b83c245
53d7930
0026bb7
75cb98e
bbd32ed
4bde5b4
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
@@ -1036,8 +1036,14 @@ pub trait Dialect: Debug + Any { | |||||||||||||||||||
/// Returns true if the specified keyword should be parsed as a table factor alias. | ||||||||||||||||||||
/// When explicit is true, the keyword is preceded by an `AS` word. Parser is provided | ||||||||||||||||||||
/// to enable looking ahead if needed. | ||||||||||||||||||||
fn is_table_factor_alias(&self, explicit: bool, kw: &Keyword, parser: &mut Parser) -> bool { | ||||||||||||||||||||
explicit || self.is_table_alias(kw, parser) | ||||||||||||||||||||
/// | ||||||||||||||||||||
/// When the dialect supports statements without semicolon delimiter, actual keywords aren't parsed as aliases. | ||||||||||||||||||||
fn is_table_factor_alias(&self, explicit: bool, kw: &Keyword, _parser: &mut Parser) -> bool { | ||||||||||||||||||||
if self.supports_statements_without_semicolon_delimiter() { | ||||||||||||||||||||
kw == &Keyword::NoKeyword | ||||||||||||||||||||
} else { | ||||||||||||||||||||
explicit || self.is_table_alias(kw, _parser) | ||||||||||||||||||||
} | ||||||||||||||||||||
} | ||||||||||||||||||||
|
||||||||||||||||||||
/// Returns true if this dialect supports querying historical table data | ||||||||||||||||||||
|
@@ -1136,6 +1142,11 @@ pub trait Dialect: Debug + Any { | |||||||||||||||||||
fn supports_notnull_operator(&self) -> bool { | ||||||||||||||||||||
false | ||||||||||||||||||||
} | ||||||||||||||||||||
|
||||||||||||||||||||
/// Returns true if the dialect supports parsing statements without a semicolon delimiter. | ||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would it be possible to give an example here? Something like
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done 👍 |
||||||||||||||||||||
fn supports_statements_without_semicolon_delimiter(&self) -> bool { | ||||||||||||||||||||
false | ||||||||||||||||||||
} | ||||||||||||||||||||
} | ||||||||||||||||||||
|
||||||||||||||||||||
/// This represents the operators for which precedence must be defined | ||||||||||||||||||||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -67,7 +67,7 @@ impl Dialect for MsSqlDialect { | |
} | ||
|
||
fn supports_connect_by(&self) -> bool { | ||
true | ||
false | ||
} | ||
|
||
fn supports_eq_alias_assignment(&self) -> bool { | ||
|
@@ -123,6 +123,10 @@ impl Dialect for MsSqlDialect { | |
true | ||
} | ||
|
||
fn supports_statements_without_semicolon_delimiter(&self) -> bool { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🤦 MS SQL!!!! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🤷♂️🤣 |
||
true | ||
} | ||
|
||
/// See <https://learn.microsoft.com/en-us/sql/relational-databases/security/authentication-access/server-level-roles> | ||
fn get_reserved_grantees_types(&self) -> &[GranteesType] { | ||
&[GranteesType::Public] | ||
|
@@ -280,6 +284,9 @@ impl MsSqlDialect { | |
) -> Result<Vec<Statement>, ParserError> { | ||
let mut stmts = Vec::new(); | ||
loop { | ||
while let Token::SemiColon = parser.peek_token_ref().token { | ||
parser.advance_token(); | ||
} | ||
if let Token::EOF = parser.peek_token_ref().token { | ||
break; | ||
} | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -266,6 +266,22 @@ impl ParserOptions { | |
self.unescape = unescape; | ||
self | ||
} | ||
|
||
/// Set if semicolon statement delimiters are required. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is the usecase to have the same configuration information on the It seems there are a few places in the Parser that inconsistently check the options and/or the dialect flag. If we had only one way to specify the option, there would not be any room for inconsistencies Or maybe I misunderstand what is going on here There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
It's similar to the existing
It seems like testing parsing without semicolons would get a lot worse if it wasn't a dialect option There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah yeah the My current thinking is be that ideally we have one way to configure the semicolon option since it would be nice to not have two knobs for the flag. In terms of which way to configure, I'm guessing it could make more sense to have the flag as a parser setting - i.e. whether or not semicolon is required is more reflecting the tool that originally processed the input file, vs an aspect of the sql language/dialect itself, so that it can differ in certain contexts for the same dialect. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This seems reasonable to me in terms of philosophy. However, I don't know how that could be implemented practically. I.e., how would you write that code in this project to be able to have the same test coverage? The benefit of setting the parse option based on the dialect configuration is:
|
||
/// | ||
/// If this option is `true`, the following SQL will not parse. If the option is `false`, the SQL will parse. | ||
/// | ||
/// ```sql | ||
/// SELECT 1 | ||
/// SELECT 2 | ||
/// ``` | ||
pub fn with_require_semicolon_stmt_delimiter( | ||
mut self, | ||
require_semicolon_stmt_delimiter: bool, | ||
) -> Self { | ||
self.require_semicolon_stmt_delimiter = require_semicolon_stmt_delimiter; | ||
self | ||
} | ||
} | ||
|
||
#[derive(Copy, Clone)] | ||
|
@@ -362,7 +378,11 @@ impl<'a> Parser<'a> { | |
state: ParserState::Normal, | ||
dialect, | ||
recursion_counter: RecursionCounter::new(DEFAULT_REMAINING_DEPTH), | ||
options: ParserOptions::new().with_trailing_commas(dialect.supports_trailing_commas()), | ||
options: ParserOptions::new() | ||
.with_trailing_commas(dialect.supports_trailing_commas()) | ||
.with_require_semicolon_stmt_delimiter( | ||
!dialect.supports_statements_without_semicolon_delimiter(), | ||
), | ||
} | ||
} | ||
|
||
|
@@ -485,10 +505,10 @@ impl<'a> Parser<'a> { | |
match self.peek_token().token { | ||
Token::EOF => break, | ||
|
||
// end of statement | ||
Token::Word(word) => { | ||
if expecting_statement_delimiter && word.keyword == Keyword::END { | ||
break; | ||
// don't expect a semicolon statement delimiter after a newline when not otherwise required | ||
Token::Whitespace(Whitespace::Newline) => { | ||
if !self.options.require_semicolon_stmt_delimiter { | ||
expecting_statement_delimiter = false; | ||
aharpervc marked this conversation as resolved.
Show resolved
Hide resolved
|
||
} | ||
} | ||
_ => {} | ||
|
@@ -500,7 +520,7 @@ impl<'a> Parser<'a> { | |
|
||
let statement = self.parse_statement()?; | ||
stmts.push(statement); | ||
expecting_statement_delimiter = true; | ||
expecting_statement_delimiter = self.options.require_semicolon_stmt_delimiter; | ||
} | ||
Ok(stmts) | ||
} | ||
|
@@ -4541,6 +4561,18 @@ impl<'a> Parser<'a> { | |
return Ok(vec![]); | ||
} | ||
|
||
if end_token == Token::SemiColon | ||
&& self | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. do we also have to check the parser options here too? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It might be that we should only be checking the parser options. I'll investigate There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Switched to parser options 👍 |
||
.dialect | ||
.supports_statements_without_semicolon_delimiter() | ||
{ | ||
if let Token::Word(ref kw) = self.peek_token().token { | ||
if kw.keyword != Keyword::NoKeyword { | ||
return Ok(vec![]); | ||
} | ||
} | ||
} | ||
|
||
if self.options.trailing_commas && self.peek_tokens() == [Token::Comma, end_token] { | ||
let _ = self.consume_token(&Token::Comma); | ||
return Ok(vec![]); | ||
|
@@ -4558,6 +4590,9 @@ impl<'a> Parser<'a> { | |
) -> Result<Vec<Statement>, ParserError> { | ||
let mut values = vec![]; | ||
loop { | ||
// ignore empty statements (between successive statement delimiters) | ||
while self.consume_token(&Token::SemiColon) {} | ||
|
||
match &self.peek_nth_token_ref(0).token { | ||
Token::EOF => break, | ||
Token::Word(w) => { | ||
|
@@ -4569,7 +4604,13 @@ impl<'a> Parser<'a> { | |
} | ||
|
||
values.push(self.parse_statement()?); | ||
self.expect_token(&Token::SemiColon)?; | ||
|
||
if self.options.require_semicolon_stmt_delimiter { | ||
self.expect_token(&Token::SemiColon)?; | ||
} | ||
|
||
// ignore empty statements (between successive statement delimiters) | ||
while self.consume_token(&Token::SemiColon) {} | ||
} | ||
Ok(values) | ||
} | ||
|
@@ -16464,7 +16505,28 @@ impl<'a> Parser<'a> { | |
|
||
/// Parse [Statement::Return] | ||
fn parse_return(&mut self) -> Result<Statement, ParserError> { | ||
match self.maybe_parse(|p| p.parse_expr())? { | ||
let rs = self.maybe_parse(|p| { | ||
let expr = p.parse_expr()?; | ||
|
||
match &expr { | ||
Expr::Value(_) | ||
| Expr::Function(_) | ||
| Expr::UnaryOp { .. } | ||
| Expr::BinaryOp { .. } | ||
| Expr::Case { .. } | ||
| Expr::Cast { .. } | ||
| Expr::Convert { .. } | ||
| Expr::Subquery(_) => Ok(expr), | ||
// todo: how to retstrict to variables? | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This seems more like a semantic check -- this crate is focused on syntax parsing. Read more hre: https://github.com/apache/datafusion-sqlparser-rs?tab=readme-ov-file#syntax-vs-semantics I think the error is inappropriate in this case There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
What do you suggest?
The difficulty here is that without semicolon tokens the return statement is ambiguous. Consider the following perfectly valid T-SQL statement (which is also a test case example) which has several variations of
If you revert this change and run that test, you get this error:
The reason is that for the first I think I had said when introducing parse_return in a previous PR that we'd have to come back and tighten it up eventually, but I can't find that discussion 😐. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @aharpervc in the example, is the ambiguity stemming from tsql not supporting There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't see this being about the if statement, but rather about the return statement. The question for the parse is, given a return keyword, how many more tokens (or similarly, which expr's) should be considered part of that ReturnStatement? For T-SQL, there is a restricted answer to that question, but it is somewhat awkward to express here.
Newlines don't really help you here, I put them in the example above for readability, but this should parse identically (and does if you run on a real SQL Server instance):
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah right, makes sense that they're not newline dependent. To clarify what I meant in the if self.peek_one_of_keywords(
// hmm noticed while enumerating this list that the only ambiguous case really
// seems to be IF - since offhand the others can't be an expression.
// So that the list might not be necessary after all.
[BEGIN, BREAK, CONTINUE, GOTO, IF, RETURN, THROW, TRY, WAITFOR, WHILE]
) {
// bare return statement
}
// else maybe_parse_expr would something like this work? |
||
Expr::Identifier(id) if id.value.starts_with('@') => Ok(expr), | ||
_ => parser_err!( | ||
"Non-returnable expression found following RETURN", | ||
p.peek_token().span.start | ||
), | ||
} | ||
})?; | ||
|
||
match rs { | ||
Some(expr) => Ok(Statement::Return(ReturnStatement { | ||
value: Some(ReturnStatementValue::Expr(expr)), | ||
})), | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -25,6 +25,7 @@ | |
#[cfg(not(feature = "std"))] | ||
use alloc::{ | ||
boxed::Box, | ||
format, | ||
string::{String, ToString}, | ||
vec, | ||
vec::Vec, | ||
|
@@ -186,6 +187,37 @@ impl TestedDialects { | |
statements | ||
} | ||
|
||
/// The same as [`statements_parse_to`] but it will strip semicolons from the SQL text. | ||
pub fn statements_without_semicolons_parse_to( | ||
&self, | ||
sql: &str, | ||
canonical: &str, | ||
) -> Vec<Statement> { | ||
let sql_without_semicolons = sql | ||
.replace("; ", " ") | ||
.replace(" ;", " ") | ||
.replace(";\n", "\n") | ||
.replace("\n;", "\n") | ||
.replace(";", " "); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. wouldn't this last statement (replace Can you please add comments explaining the rationale for these substitutions? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm preserving the newline/whitespace around the statement delimiter because I'm trying to avoid being opinionated on whether they're meaningful or not in this helper. i.e., the helper should only strip semicolons and should avoid other transformations. I can add a comment There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good point. At some point I was probably testing this in conjunction with #1809 and might have gotten it from there. Regardless, I'll simplify it here per your suggestion. |
||
let statements = self | ||
.parse_sql_statements(&sql_without_semicolons) | ||
.expect(&sql_without_semicolons); | ||
if !canonical.is_empty() && sql != canonical { | ||
assert_eq!(self.parse_sql_statements(canonical).unwrap(), statements); | ||
} else { | ||
assert_eq!( | ||
sql, | ||
statements | ||
.iter() | ||
// note: account for format_statement_list manually inserted semicolons | ||
.map(|s| s.to_string().trim_end_matches(";").to_string()) | ||
.collect::<Vec<_>>() | ||
.join("; ") | ||
); | ||
} | ||
statements | ||
} | ||
|
||
/// Ensures that `sql` parses as an [`Expr`], and that | ||
/// re-serializing the parse result produces canonical | ||
pub fn expr_parses_to(&self, sql: &str, canonical: &str) -> Expr { | ||
|
@@ -318,6 +350,43 @@ where | |
all_dialects_where(|d| !except(d)) | ||
} | ||
|
||
/// Returns all dialects that don't support statements without semicolon delimiters. | ||
/// (i.e. dialects that require semicolon delimiters.) | ||
pub fn all_dialects_requiring_semicolon_statement_delimiter() -> TestedDialects { | ||
let tested_dialects = | ||
all_dialects_except(|d| d.supports_statements_without_semicolon_delimiter()); | ||
assert_ne!(tested_dialects.dialects.len(), 0); | ||
tested_dialects | ||
} | ||
|
||
/// Returns all dialects that do support statements without semicolon delimiters. | ||
/// (i.e. dialects not requiring semicolon delimiters.) | ||
pub fn all_dialects_not_requiring_semicolon_statement_delimiter() -> TestedDialects { | ||
let tested_dialects = | ||
all_dialects_where(|d| d.supports_statements_without_semicolon_delimiter()); | ||
assert_ne!(tested_dialects.dialects.len(), 0); | ||
tested_dialects | ||
} | ||
|
||
/// Asserts an error for `parse_sql_statements`: | ||
/// - "end of statement" for dialects that require semicolon delimiters | ||
/// - "an SQL statement" for dialects that don't require semicolon delimiters. | ||
pub fn assert_err_parse_statements(sql: &str, found: &str) { | ||
assert_eq!( | ||
ParserError::ParserError(format!("Expected: end of statement, found: {found}")), | ||
all_dialects_requiring_semicolon_statement_delimiter() | ||
.parse_sql_statements(sql) | ||
.unwrap_err() | ||
); | ||
|
||
assert_eq!( | ||
ParserError::ParserError(format!("Expected: an SQL statement, found: {found}")), | ||
all_dialects_not_requiring_semicolon_statement_delimiter() | ||
.parse_sql_statements(sql) | ||
.unwrap_err() | ||
); | ||
} | ||
|
||
pub fn assert_eq_vec<T: ToString>(expected: &[&str], actual: &[T]) { | ||
assert_eq!( | ||
expected, | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I followed the intent here, would you be able to provide example sqls to illustrate and how they should be parsed with vs without the flag?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the example sql is from the sqlparse_common test case
parse_select_with_table_alias_keyword
&parse_invalid_limit_by
, such asSELECT a FROM lineitem DECLARE
. Commenting out the new logic and running the tests shows the difficulty. My thinking here is that if semicolons are optional, we can forbid keywords as table aliases. Folks who use this option are going to be opting in to a certain level of ambiguity regardless, but having the parser do reasonable things (with test coverage) is my objective here.Looking at this code again I think this should be checking the actual configured parser option rather than the dialect's capability, similar to the other feedback comment. I will change it.