-
Couldn't load subscription status.
- Fork 664
Description
Hi,
At this time, whitespace tokens are stored in the parser, and are then filtered out in several distinct points in the parser logic, such as:
datafusion-sqlparser-rs/src/parser/mod.rs
Lines 4032 to 4049 in 67684c8
pub fn peek_tokens_with_location<const N: usize>(&self) -> [TokenWithSpan; N] { let mut index = self.index; core::array::from_fn(|_| loop { let token = self.tokens.get(index); index += 1; if let Some(TokenWithSpan { token: Token::Whitespace(_), span: _, }) = token { continue; } break token.cloned().unwrap_or(TokenWithSpan { token: Token::EOF, span: Span::empty(), }); }) } datafusion-sqlparser-rs/src/parser/mod.rs
Lines 4055 to 4069 in 67684c8
pub fn peek_tokens_ref<const N: usize>(&self) -> [&TokenWithSpan; N] { let mut index = self.index; core::array::from_fn(|_| loop { let token = self.tokens.get(index); index += 1; if let Some(TokenWithSpan { token: Token::Whitespace(_), span: _, }) = token { continue; } break token.unwrap_or(&EOF_TOKEN); }) } datafusion-sqlparser-rs/src/parser/mod.rs
Lines 4077 to 4094 in 67684c8
pub fn peek_nth_token_ref(&self, mut n: usize) -> &TokenWithSpan { let mut index = self.index; loop { index += 1; match self.tokens.get(index - 1) { Some(TokenWithSpan { token: Token::Whitespace(_), span: _, }) => continue, non_whitespace => { if n == 0 { return non_whitespace.unwrap_or(&EOF_TOKEN); } n -= 1; } } } } datafusion-sqlparser-rs/src/parser/mod.rs
Lines 4149 to 4160 in 67684c8
pub fn advance_token(&mut self) { loop { self.index += 1; match self.tokens.get(self.index - 1) { Some(TokenWithSpan { token: Token::Whitespace(_), span: _, }) => continue, _ => break, } } } datafusion-sqlparser-rs/src/parser/mod.rs
Lines 4183 to 4202 in 67684c8
/// Seek back the last one non-whitespace token. /// /// Must be called after `next_token()`, otherwise might panic. OK to call /// after `next_token()` indicates an EOF. /// // TODO rename to backup_token and deprecate prev_token? pub fn prev_token(&mut self) { loop { assert!(self.index > 0); self.index -= 1; if let Some(TokenWithSpan { token: Token::Whitespace(_), span: _, }) = self.tokens.get(self.index) { continue; } return; } }
and many more.
SQL, as far as I know, is not a language that cares about spaces like Python - it should be safe to remove all concepts of whitespaces after the tokenization process is complete, and this should:
- Reduce memory requirements, as whitespace tokens would not be stored anymore
- Significantly simplify parser logic by removing all of the whitespace-related logic from the parser
- Move the parser closer to a streaming logic, but that will require many more PRs
Since such a PR would require quite a bit of effort on my part, I would appreciate some feedback on it before moving forward with it.
@iffyio do you happen to have any opinion regarding such a refactoring?
Ciao,
Luca