Skip to content

Support parse command with Calcite#3474

Merged
penghuo merged 4 commits intoopensearch-project:mainfrom
LantaoJin:pr/issues/3463
Mar 28, 2025
Merged

Support parse command with Calcite#3474
penghuo merged 4 commits intoopensearch-project:mainfrom
LantaoJin:pr/issues/3463

Conversation

@LantaoJin
Copy link
Copy Markdown
Member

@LantaoJin LantaoJin commented Mar 25, 2025

Description

Support parse command with Calcite

Limitation:
Multiple capturing groups are not allowed in Calcite REGEXP_EXTRACT function.
| parse address '(?<streetNumber>\d+) (?<street>.+)' will fallback to V2.

Related Issues

Resolves #3463

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • New functionality has javadoc added.
  • New functionality has a user manual doc added.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Lantao Jin <ltjin@amazon.com>
context.relBuilder.peek().getRowType().getFieldNames().stream()
.map(
cur -> {
String noNumericSuffix = cur.replaceAll("\\d", "");
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if original fields contains SAL, SAL1 and overriding fields contains SAL?

Seems this logic will remove the numeric suffix of SAL1 as well, which is incorrect.

Copy link
Copy Markdown
Member Author

@LantaoJin LantaoJin Mar 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. related IT added.

RexNode sourceField = rexVisitor.analyze(node.getSourceField(), context);
ParseMethod parseMethod = node.getParseMethod();
java.util.Map<String, Literal> arguments = node.getArguments();
assert arguments.isEmpty();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add assert message for better understanding if error really happened

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

Signed-off-by: Lantao Jin <ltjin@amazon.com>
@LantaoJin LantaoJin requested a review from qianheng-aws March 25, 2025 08:31
RexNode overrideField = null;
String alias =
((RexLiteral) ((RexCall) eval).getOperands().get(1)).getValueAs(String.class);
if (originalFieldNames.contains(alias)) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Non blocking] How about extract line 291-298 into a separate method for processing plus new columns with overriding? It should have interface like:

void projectOverride(List<RexNode> newExprs, List<String> newColumnNames, CalcitePlanContext context)

It should be reused for many places involves overriding purpose.

Copy link
Copy Markdown
Member Author

@LantaoJin LantaoJin Mar 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

} else {
// Overriding the existing field if the alias has the same name with original field
// name.
RexNode overrideField = null;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Minor] This attr seems to be put in wrong place since it's only used for branch originalFieldNames.contains(alias)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Signed-off-by: Lantao Jin <ltjin@amazon.com>
Signed-off-by: Lantao Jin <ltjin@amazon.com>
verifyDataRows(result, rows("a@a.com", "a.com", "c@c.com"));
result =
executeQuery(
"source = test | parse email '.+@(?<email>.+)' | fields email, email0, email1");
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if pattern group have same name? i.e. '.+@(?.+) to .+@(?.+)'

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will fallback to V2 since multiple group matchers are unsupported now.

@LantaoJin
Copy link
Copy Markdown
Member Author

@dai-chen could you help to review this?

@penghuo penghuo merged commit 82ccc8d into opensearch-project:main Mar 28, 2025
22 checks passed
penghuo pushed a commit that referenced this pull request Jun 16, 2025
---------

Signed-off-by: Lantao Jin <ltjin@amazon.com>
Signed-off-by: xinyual <xinyual@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

calcite calcite migration releated

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE] Support Parse command with Calcite

4 participants