Skip to content

Commit ce94cb3

Browse files
WanKuncloud-fan
authored andcommitted
[SPARK-53552][SQL] Optimize substr SQL function
### What changes were proposed in this pull request? In substringSQL() functions, if pos > 0 then we don't need to calculate numChars(). <img width="1846" height="388" alt="企业微信截图_96d4fc98-bce1-4b43-937c-68ca3c21e54c" src="https://github.com/user-attachments/assets/504eceee-83eb-45aa-91ab-b9c657993861" /> ### Why are the changes needed? SQL function substr performance improvement. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Exists UT ### Was this patch authored or co-authored using generative AI tooling? No Closes #52308 from wankunde/substr. Authored-by: WanKun <wankun@bilibili.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
1 parent 9bd844b commit ce94cb3

File tree

1 file changed

+8
-5
lines changed

1 file changed

+8
-5
lines changed

common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java

Lines changed: 8 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -642,9 +642,13 @@ public UTF8String substring(final int start, final int until) {
642642
}
643643

644644
int j = i;
645-
while (i < numBytes && c < until) {
646-
i += numBytesForFirstByte(getByte(i));
647-
c += 1;
645+
if (until == Integer.MAX_VALUE) {
646+
i = numBytes;
647+
} else {
648+
while (i < numBytes && c < until) {
649+
i += numBytesForFirstByte(getByte(i));
650+
c += 1;
651+
}
648652
}
649653

650654
if (i > j) {
@@ -663,9 +667,8 @@ public UTF8String substringSQL(int pos, int length) {
663667
// refers to element i-1 in the sequence. If a start index i is less than 0, it refers
664668
// to the -ith element before the end of the sequence. If a start index i is 0, it
665669
// refers to the first element.
666-
int len = numChars();
667670
// `len + pos` does not overflow as `len >= 0`.
668-
int start = (pos > 0) ? pos -1 : ((pos < 0) ? len + pos : 0);
671+
int start = (pos > 0) ? pos -1 : ((pos < 0) ? numChars() + pos : 0);
669672

670673
int end;
671674
if ((long) start + length > Integer.MAX_VALUE) {

0 commit comments

Comments
 (0)