Skip to content

Commit 2407066

Browse files
tanelkMaxGekk
authored andcommitted
[SPARK-46070][SQL] Compile regex pattern in SparkDateTimeUtils.getZoneId outside the hot loop
### What changes were proposed in this pull request? Compile the regex patterns used in `SparkDateTimeUtils.getZoneId` outside of the method, that can be called for each dataset row.. ### Why are the changes needed? `String.replaceFirst` internally does `Pattern.compile(regex).matcher(this).replaceFirst(replacement)`. `Pattern.compile` is very expensive method, that should not be called in a loop. When using method like `from_utc_timestamp` with non-literal timezone, the `SparkDateTimeUtils.getZoneId` is called for each loop. In one of my usecases adding `from_utc_timestamp` increased the runtime from 15min to 6h. ### Does this PR introduce _any_ user-facing change? Performance improvement. ### How was this patch tested? Existing UTs ### Was this patch authored or co-authored using generative AI tooling? No Closes #43976 from tanelk/SPARK-46070_precompile_regex. Authored-by: Tanel Kiis <[email protected]> Signed-off-by: Max Gekk <[email protected]>
1 parent 6e4d75a commit 2407066

File tree

1 file changed

+8
-5
lines changed

1 file changed

+8
-5
lines changed

sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala

Lines changed: 8 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@ import java.sql.{Date, Timestamp}
2020
import java.time.{Instant, LocalDate, LocalDateTime, LocalTime, ZonedDateTime, ZoneId, ZoneOffset}
2121
import java.util.TimeZone
2222
import java.util.concurrent.TimeUnit.{MICROSECONDS, NANOSECONDS}
23+
import java.util.regex.Pattern
2324

2425
import scala.util.control.NonFatal
2526

@@ -36,12 +37,14 @@ trait SparkDateTimeUtils {
3637

3738
final val TimeZoneUTC = TimeZone.getTimeZone("UTC")
3839

40+
final val singleHourTz = Pattern.compile("(\\+|\\-)(\\d):")
41+
final val singleMinuteTz = Pattern.compile("(\\+|\\-)(\\d\\d):(\\d)$")
42+
3943
def getZoneId(timeZoneId: String): ZoneId = {
40-
val formattedZoneId = timeZoneId
41-
// To support the (+|-)h:mm format because it was supported before Spark 3.0.
42-
.replaceFirst("(\\+|\\-)(\\d):", "$10$2:")
43-
// To support the (+|-)hh:m format because it was supported before Spark 3.0.
44-
.replaceFirst("(\\+|\\-)(\\d\\d):(\\d)$", "$1$2:0$3")
44+
// To support the (+|-)h:mm format because it was supported before Spark 3.0.
45+
var formattedZoneId = singleHourTz.matcher(timeZoneId).replaceFirst("$10$2:")
46+
// To support the (+|-)hh:m format because it was supported before Spark 3.0.
47+
formattedZoneId = singleMinuteTz.matcher(formattedZoneId).replaceFirst("$1$2:0$3")
4548

4649
ZoneId.of(formattedZoneId, ZoneId.SHORT_IDS)
4750
}

0 commit comments

Comments
 (0)