[SPARK-55144][SS] Introduce new state format version for performant stream-stream join by HeartSaVioR · Pull Request #53930 · apache/spark

HeartSaVioR · 2026-01-23T05:18:26Z

What changes were proposed in this pull request?

This PR proposes to implement the new state format for stream-stream join, based on the new state key encoding w.r.t. event time awareness.

The new state format is focused to eliminate the necessity of full scan during eviction & populating unmatched rows. The overhead of eviction should have bound to the actual number of state rows to be evicted (indirectly impacted by the amount of watermark advancement), but we have been doing the full scan with the existing state format, which could take more than 2 seconds in 1,000,000 rows even if there is zero row to be evicted. The overhead of eviction with the new state format would be bound to the actual number of state rows to be evicted, taking around 30ms or even less in 1,000,000 rows when there is zero row to be evicted.

To achieve the above, we make a drastic change of data structure to move out from the logical array, and introduce a secondary index in addition to the main data.

Each side of the join will use two (virtual) column families (total 4 column families), which are following:

KeyWithTsToValuesStore
- Primary data store
- (key, event time) -> values
- each element in values consists of (value, matched)
TsWithKeyTypeStore
- Secondary index for efficient eviction
- (event time, key) -> empty value (configured as multi-values)
- numValues is calculated by the number of elements in the value side; new element is added when a new value is added into values in primary data store
  - This is to track the number of deleted rows accurately. It's optional but the metric has been useful so we want to keep it as it is.

As the format of key part implies, KeyWithTsToValuesStore will use TimestampAsPostfixKeyStateEncoderSpec, and TsWithKeyTypeStore will use TimestampAsPrefixKeyStateEncoderSpec.

The granularity of the timestamp for event time is 1 millisecond, which is in line with the granularity for watermark advancement. This can be a kind of knob controlling the number of the keys vs the number of the values in the key, trading off the granularity of eviction based on watermark advancement vs the size of key space (may impact performance).

There are several follow-ups with this state format implementation, which can be addressed on top of this:

further optimizations with RocksDB offering: WriteBatch (for batched writes), MGET, etc.
retrieving matched rows with the "scope" of timestamps (in time-interval join)
- while the format is ready to support ordered scan of timestamp, this needs another state store API to define the range of keys to scan, which needs some effort

Why are the changes needed?

The cost of eviction based on full scan is severe to make the stream-stream join to be lower latency. Also, the logic of maintaining logical array is complicated enough to maintain and the performance characteristic is less predictable given the behavior of deleting the element in random index (placing the value of the last index to the deleted index).

Does this PR introduce any user-facing change?

No. At this point, this state format is not integrated with the actual stream-stream join operator, and we need to do follow-up work for integration to finally introduce the change to user-facing.

How was this patch tested?

New UT suites, refactoring the existing suite to test with both time window and time interval cases.

Was this patch authored or co-authored using generative AI tooling?

No.

github-actions · 2026-01-23T05:18:34Z

JIRA Issue Information

=== Improvement SPARK-55144 ===
Summary: Introduce new state format version for performant stream-stream join
Assignee: None
Status: Open
Affected: ["4.2.0"]

This comment was automatically generated by GitHub Actions

HeartSaVioR · 2026-01-23T05:18:56Z

NOTE: This is on top of the PR - #53911

HeartSaVioR · 2026-01-23T05:48:37Z

TODO tickets:

yet to cover the case of regular join where there is no event time in both join condition and the value

We have an idea to give a try (via replacing GET-and-PUT pattern of count in secondary index with blind MERGE), though it may require broader change due to the issue in below.

https://issues.apache.org/jira/browse/SPARK-55131

After resolving the above, we can give a try with blind MERGE and check the performance improvement.

retrieving matched rows with the "scope" of timestamps (in time-interval join)

https://issues.apache.org/jira/browse/SPARK-55147

further optimizations with RocksDB offering: WriteBatch (for batched writes), MGET, etc.

https://issues.apache.org/jira/browse/SPARK-55148

anishshri-db · 2026-01-25T06:13:20Z

...e/spark/sql/execution/streaming/operators/stateful/join/StreamingSymmetricHashJoinExec.scala

      joinStateManager.get(key)
    }

+    // FIXME: doc!


Can we add the comments for this ?

My bad, thanks for reminding. Will do.

anishshri-db · 2026-01-25T06:16:14Z

...core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateEncoder.scala

+   * @throws UnsupportedOperationException if called on an encoder that doesn't support event time
+   *                                       as postfix.
+   */
+  def encodeKeyForEventTimeAsPostfix(row: UnsafeRow, eventTime: Long): Array[Byte]


Could we make this more generic ? Don't think we should call out eventTime as such - can just name it as longType ?

I'd say we shouldn't generalize too much - this is coupled with state store API change and I'm not sure we want to introduce an API with just saying it's to handle additional long type. That should have enough meaning to do so.

While I think "event time" has enough potential for usages, timestamp is fine for me if event time sounds too tight. I'd still want to keep the semantic of "time" here.

anishshri-db · 2026-01-25T06:16:29Z

...core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateEncoder.scala

+   * @throws UnsupportedOperationException if called on an encoder that doesn't support event time
+   *                                       as postfix.
+   */
+  def decodeKeyForEventTimeAsPostfix(bytes: Array[Byte]): (UnsafeRow, Long)


Same here as above

anishshri-db · 2026-01-25T06:17:35Z

...core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateEncoder.scala

          }
        }
        StructType(remainingSchema)
+      case _ =>


When is this possible ?

New encoder specs for event time would be bound to here. It's not handled at this point and I filed a TODO ticket for it.

anishshri-db · 2026-01-25T06:21:36Z

...rc/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateStoreProvider.scala

      new StateStoreIterator(iter, rocksDbIter.closeIfNeeded)
    }

+    class RocksDBEventTimeAwareStateOperations(cfName: String)


Why not rename this to something more generic ?

anishshri-db · 2026-01-25T06:22:03Z

...in/scala/org/apache/spark/sql/execution/streaming/operators/stateful/statefulOperators.scala

+      val eventTimeColsSet = eventTimeCols.map(_._1.exprId).toSet
+      if (eventTimeColsSet.size > 1) {
+        throw new AnalysisException(
+          errorClass = "_LEGACY_ERROR_TEMP_3077",


Can we add new error class for this ?

That was copied over from existing code IIRC - maybe file a JIRA ticket and handle it altogether?

HeartSaVioR · 2026-01-25T12:30:22Z

@anishshri-db
Btw, just to remind, there is another PR for narrower change - #53911
This is on top of the above linked PR and probably quite huge to review altogether.

HeartSaVioR · 2026-02-02T13:11:46Z

TODO Update:

https://issues.apache.org/jira/browse/SPARK-55131

This is now the first PR of the stacked PRs. We can now update the logic here to replace GET-and-PUT pattern of count in secondary index with blind MERGE.

HeartSaVioR · 2026-02-12T07:19:58Z

...e/spark/sql/execution/streaming/operators/stateful/join/StreamingSymmetricHashJoinExec.scala

      joinStateManager.get(key)
    }

+    // FIXME: doc!


Self review: add code comment

HeartSaVioR · 2026-02-12T07:20:46Z

...xecution/streaming/operators/stateful/join/StreamingSymmetricHashJoinValueRowConverter.scala

+  def convertToValueRow(value: UnsafeRow, matched: Boolean): UnsafeRow
+}
+
+class StreamingSymmetricHashJoinValueRowConverterFormatV1(


Self review: add code comment

HeartSaVioR · 2026-02-12T07:20:53Z

...xecution/streaming/operators/stateful/join/StreamingSymmetricHashJoinValueRowConverter.scala

+  override def convertToValueRow(value: UnsafeRow, matched: Boolean): UnsafeRow = value
+}
+
+class StreamingSymmetricHashJoinValueRowConverterFormatV2(


Self review: add code comment

HeartSaVioR · 2026-02-12T07:21:06Z

...xecution/streaming/operators/stateful/join/StreamingSymmetricHashJoinValueRowConverter.scala

+  }
+}
+
+object StreamingSymmetricHashJoinValueRowConverter {


Self review: add code comment

HeartSaVioR · 2026-02-12T07:21:15Z

...he/spark/sql/execution/streaming/operators/stateful/join/SymmetricHashJoinStateManager.scala

+import org.apache.spark.sql.types.{BooleanType, DataType, LongType, NullType, StructField, StructType}
 import org.apache.spark.util.NextIterator

+trait SymmetricHashJoinStateManager {


Self review: add code comment

HeartSaVioR · 2026-02-12T07:25:42Z

...he/spark/sql/execution/streaming/operators/stateful/join/SymmetricHashJoinStateManager.scala

+    )
+    // */
+
+    /*


Self review: revert it

HeartSaVioR · 2026-02-12T07:25:48Z

...he/spark/sql/execution/streaming/operators/stateful/join/SymmetricHashJoinStateManager.scala

      // We want to collect instance metrics from both state stores
      keyWithIndexToValueMetrics.instanceMetrics ++ keyToNumValuesMetrics.instanceMetrics
    )
+     */


Self review: revert it

HeartSaVioR · 2026-02-12T07:26:22Z

...he/spark/sql/execution/streaming/operators/stateful/join/SymmetricHashJoinStateManager.scala

      KeyWithIndexToValueType
    } else {
-      throw new IllegalArgumentException(s"Unknown join store name: $storeName")
+      // TODO: Add support of KeyWithTsToValuesType and TsWithKeyType


Self review: may need to have TODO JIRA ticket?

HeartSaVioR · 2026-02-12T07:26:45Z

...he/spark/sql/execution/streaming/operators/stateful/join/SymmetricHashJoinStateManager.scala

      // State key is the partition key
      new NoopStatePartitionKeyExtractor(stateKeySchema)
+    } else {
+      // TODO: Add support of KeyWithTsToValuesType and TsWithKeyType


Self review: may need to have TODO JIRA ticket?

HeartSaVioR · 2026-02-12T07:27:15Z

...he/spark/sql/execution/streaming/operators/stateful/join/SymmetricHashJoinStateManager.scala

    }
  }
+
+  case class KeyAndTsToValuePair(


Self review: add code comment

…StateStore API ### What changes were proposed in this pull request? This PR proposes to introduce iterator/prefixScan with multi-values in StateStore API. ### Why are the changes needed? The functionality is missing on StateStore API so when the caller sets multi-values for specific CF, that CF doesn't support scanning through the data. The new functionality will be used in new state format version in stream-stream join, specifically SPARK-55144 (#53930). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New UTs. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: claude-4.5-sonnet The above is used for creating a new test suite. All other parts aren't generated by LLM. Closes #54278 from HeartSaVioR/SPARK-55494. Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>

…ber of evicted rows

nyaapa

🎉

eason-yuchen-liu

Had one pass except for StreamingSymmetricHashJoinValueRowConverter.scala and the tests.

eason-yuchen-liu · 2026-02-19T19:33:43Z

...he/spark/sql/execution/streaming/operators/stateful/join/SymmetricHashJoinStateManager.scala

+      case _ =>
+        // Need a strategy about bucketing when event time is not available
+        // - first attempt: random bucketing
+        random.nextInt(bucketSizeForNoEventTime)


Is it OK for extractEventTimeFn to return non-deterministic result? Will it create problem when we want to fetch the exact key in the future?

We always scan through all buckets to figure out all the values associated with the key. Unlike time interval join where we could scope the timestamp range during scanning, this case will need to read all the values, so it's simply a trade off of "smaller number of buckets with more elements per bucket" vs "larger number of buckets with less elements per bucket".

There is no operation we have to look up specific element.

eason-yuchen-liu · 2026-02-19T19:51:46Z

...he/spark/sql/execution/streaming/operators/stateful/join/SymmetricHashJoinStateManager.scala

+  override def get(key: UnsafeRow): Iterator[UnsafeRow] = {
+    keyWithTsToValues.getValues(key).flatMap { result =>
+      result.values.map(_.value)
+    }.iterator


Not sure - let me check with IDE...

eason-yuchen-liu · 2026-02-19T19:54:09Z

...he/spark/sql/execution/streaming/operators/stateful/join/SymmetricHashJoinStateManager.scala

+    Seq(StructField("dummy", NullType, nullable = true))
+  )
+
+  private val stateStoreCkptId: Option[String] = None


Can you explain why this is hardcoded?

Ahh nice catch. I think I left this to the integration but forgot to leave a TODO comment. Let me do this...

eason-yuchen-liu · 2026-02-19T19:54:16Z

...he/spark/sql/execution/streaming/operators/stateful/join/SymmetricHashJoinStateManager.scala

+  )
+
+  private val stateStoreCkptId: Option[String] = None
+  private val handlerSnapshotOptions: Option[HandlerSnapshotOptions] = None


github-actions bot added SQL STRUCTURED STREAMING labels Jan 23, 2026

anishshri-db reviewed Jan 25, 2026

View reviewed changes

HeartSaVioR force-pushed the SPARK-55144-on-top-of-SPARK-55129 branch 2 times, most recently from a2c1977 to c6618d8 Compare February 2, 2026 13:09

HeartSaVioR mentioned this pull request Feb 11, 2026

[SPARK-55494][SS] Introduce iterator/prefixScan with multi-values in StateStore API #54278

Closed

HeartSaVioR force-pushed the SPARK-55144-on-top-of-SPARK-55129 branch 2 times, most recently from 1a22c78 to cfaab40 Compare February 12, 2026 04:45

HeartSaVioR commented Feb 12, 2026

View reviewed changes

HeartSaVioR added 4 commits February 19, 2026 16:21

Implement the new state format for stream-stream join

97c5aab

Reflect the change in parent PRs

dca9690

Use blind merge and count back in secondary index on counting the num…

9c7cae1

…ber of evicted rows

reflect prior PR changes

ec48b97

HeartSaVioR force-pushed the SPARK-55144-on-top-of-SPARK-55129 branch from de4d53d to ec48b97 Compare February 19, 2026 07:22

nyaapa approved these changes Feb 19, 2026

View reviewed changes

eason-yuchen-liu reviewed Feb 19, 2026

View reviewed changes

+                  )
+                  // */
+                  /*

Comments

Conversation

HeartSaVioR commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

github-actions bot commented Jan 23, 2026

JIRA Issue Information

Uh oh!

HeartSaVioR commented Jan 23, 2026

Uh oh!

HeartSaVioR commented Jan 23, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR Jan 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR commented Jan 25, 2026

Uh oh!

HeartSaVioR commented Feb 2, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nyaapa left a comment

Choose a reason for hiding this comment

Uh oh!

eason-yuchen-liu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

HeartSaVioR commented Jan 23, 2026 •

edited

Loading

HeartSaVioR Jan 25, 2026 •

edited

Loading