[SPARK-55144][SS] Introduce new state format version for performant stream-stream join#53930
[SPARK-55144][SS] Introduce new state format version for performant stream-stream join#53930HeartSaVioR wants to merge 4 commits intoapache:masterfrom
Conversation
JIRA Issue Information=== Improvement SPARK-55144 === This comment was automatically generated by GitHub Actions |
|
NOTE: This is on top of the PR - #53911 |
|
TODO tickets:
We have an idea to give a try (via replacing GET-and-PUT pattern of count in secondary index with blind MERGE), though it may require broader change due to the issue in below. https://issues.apache.org/jira/browse/SPARK-55131 After resolving the above, we can give a try with blind MERGE and check the performance improvement.
https://issues.apache.org/jira/browse/SPARK-55147
|
| joinStateManager.get(key) | ||
| } | ||
|
|
||
| // FIXME: doc! |
There was a problem hiding this comment.
Can we add the comments for this ?
There was a problem hiding this comment.
My bad, thanks for reminding. Will do.
| * @throws UnsupportedOperationException if called on an encoder that doesn't support event time | ||
| * as postfix. | ||
| */ | ||
| def encodeKeyForEventTimeAsPostfix(row: UnsafeRow, eventTime: Long): Array[Byte] |
There was a problem hiding this comment.
Could we make this more generic ? Don't think we should call out eventTime as such - can just name it as longType ?
There was a problem hiding this comment.
I'd say we shouldn't generalize too much - this is coupled with state store API change and I'm not sure we want to introduce an API with just saying it's to handle additional long type. That should have enough meaning to do so.
While I think "event time" has enough potential for usages, timestamp is fine for me if event time sounds too tight. I'd still want to keep the semantic of "time" here.
| * @throws UnsupportedOperationException if called on an encoder that doesn't support event time | ||
| * as postfix. | ||
| */ | ||
| def decodeKeyForEventTimeAsPostfix(bytes: Array[Byte]): (UnsafeRow, Long) |
| } | ||
| } | ||
| StructType(remainingSchema) | ||
| case _ => |
There was a problem hiding this comment.
When is this possible ?
There was a problem hiding this comment.
New encoder specs for event time would be bound to here. It's not handled at this point and I filed a TODO ticket for it.
| new StateStoreIterator(iter, rocksDbIter.closeIfNeeded) | ||
| } | ||
|
|
||
| class RocksDBEventTimeAwareStateOperations(cfName: String) |
There was a problem hiding this comment.
Why not rename this to something more generic ?
| val eventTimeColsSet = eventTimeCols.map(_._1.exprId).toSet | ||
| if (eventTimeColsSet.size > 1) { | ||
| throw new AnalysisException( | ||
| errorClass = "_LEGACY_ERROR_TEMP_3077", |
There was a problem hiding this comment.
Can we add new error class for this ?
There was a problem hiding this comment.
That was copied over from existing code IIRC - maybe file a JIRA ticket and handle it altogether?
|
@anishshri-db |
a2c1977 to
c6618d8
Compare
|
TODO Update: https://issues.apache.org/jira/browse/SPARK-55131 This is now the first PR of the stacked PRs. We can now update the logic here to replace GET-and-PUT pattern of count in secondary index with blind MERGE. |
1a22c78 to
cfaab40
Compare
| joinStateManager.get(key) | ||
| } | ||
|
|
||
| // FIXME: doc! |
There was a problem hiding this comment.
Self review: add code comment
| def convertToValueRow(value: UnsafeRow, matched: Boolean): UnsafeRow | ||
| } | ||
|
|
||
| class StreamingSymmetricHashJoinValueRowConverterFormatV1( |
There was a problem hiding this comment.
Self review: add code comment
| override def convertToValueRow(value: UnsafeRow, matched: Boolean): UnsafeRow = value | ||
| } | ||
|
|
||
| class StreamingSymmetricHashJoinValueRowConverterFormatV2( |
There was a problem hiding this comment.
Self review: add code comment
| } | ||
| } | ||
|
|
||
| object StreamingSymmetricHashJoinValueRowConverter { |
There was a problem hiding this comment.
Self review: add code comment
| import org.apache.spark.sql.types.{BooleanType, DataType, LongType, NullType, StructField, StructType} | ||
| import org.apache.spark.util.NextIterator | ||
|
|
||
| trait SymmetricHashJoinStateManager { |
There was a problem hiding this comment.
Self review: add code comment
| ) | ||
| // */ | ||
|
|
||
| /* |
There was a problem hiding this comment.
Self review: revert it
| // We want to collect instance metrics from both state stores | ||
| keyWithIndexToValueMetrics.instanceMetrics ++ keyToNumValuesMetrics.instanceMetrics | ||
| ) | ||
| */ |
There was a problem hiding this comment.
Self review: revert it
| KeyWithIndexToValueType | ||
| } else { | ||
| throw new IllegalArgumentException(s"Unknown join store name: $storeName") | ||
| // TODO: Add support of KeyWithTsToValuesType and TsWithKeyType |
There was a problem hiding this comment.
Self review: may need to have TODO JIRA ticket?
| // State key is the partition key | ||
| new NoopStatePartitionKeyExtractor(stateKeySchema) | ||
| } else { | ||
| // TODO: Add support of KeyWithTsToValuesType and TsWithKeyType |
There was a problem hiding this comment.
Self review: may need to have TODO JIRA ticket?
| } | ||
| } | ||
|
|
||
| case class KeyAndTsToValuePair( |
There was a problem hiding this comment.
Self review: add code comment
…StateStore API ### What changes were proposed in this pull request? This PR proposes to introduce iterator/prefixScan with multi-values in StateStore API. ### Why are the changes needed? The functionality is missing on StateStore API so when the caller sets multi-values for specific CF, that CF doesn't support scanning through the data. The new functionality will be used in new state format version in stream-stream join, specifically SPARK-55144 (#53930). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New UTs. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: claude-4.5-sonnet The above is used for creating a new test suite. All other parts aren't generated by LLM. Closes #54278 from HeartSaVioR/SPARK-55494. Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
de4d53d to
ec48b97
Compare
eason-yuchen-liu
left a comment
There was a problem hiding this comment.
Had one pass except for StreamingSymmetricHashJoinValueRowConverter.scala and the tests.
| case _ => | ||
| // Need a strategy about bucketing when event time is not available | ||
| // - first attempt: random bucketing | ||
| random.nextInt(bucketSizeForNoEventTime) |
There was a problem hiding this comment.
Is it OK for extractEventTimeFn to return non-deterministic result? Will it create problem when we want to fetch the exact key in the future?
There was a problem hiding this comment.
We always scan through all buckets to figure out all the values associated with the key. Unlike time interval join where we could scope the timestamp range during scanning, this case will need to read all the values, so it's simply a trade off of "smaller number of buckets with more elements per bucket" vs "larger number of buckets with less elements per bucket".
There was a problem hiding this comment.
There is no operation we have to look up specific element.
| override def get(key: UnsafeRow): Iterator[UnsafeRow] = { | ||
| keyWithTsToValues.getValues(key).flatMap { result => | ||
| result.values.map(_.value) | ||
| }.iterator |
There was a problem hiding this comment.
Not sure - let me check with IDE...
| Seq(StructField("dummy", NullType, nullable = true)) | ||
| ) | ||
|
|
||
| private val stateStoreCkptId: Option[String] = None |
There was a problem hiding this comment.
Can you explain why this is hardcoded?
There was a problem hiding this comment.
Ahh nice catch. I think I left this to the integration but forgot to leave a TODO comment. Let me do this...
| ) | ||
|
|
||
| private val stateStoreCkptId: Option[String] = None | ||
| private val handlerSnapshotOptions: Option[HandlerSnapshotOptions] = None |
What changes were proposed in this pull request?
This PR proposes to implement the new state format for stream-stream join, based on the new state key encoding w.r.t. event time awareness.
The new state format is focused to eliminate the necessity of full scan during eviction & populating unmatched rows. The overhead of eviction should have bound to the actual number of state rows to be evicted (indirectly impacted by the amount of watermark advancement), but we have been doing the full scan with the existing state format, which could take more than 2 seconds in 1,000,000 rows even if there is zero row to be evicted. The overhead of eviction with the new state format would be bound to the actual number of state rows to be evicted, taking around 30ms or even less in 1,000,000 rows when there is zero row to be evicted.
To achieve the above, we make a drastic change of data structure to move out from the logical array, and introduce a secondary index in addition to the main data.
Each side of the join will use two (virtual) column families (total 4 column families), which are following:
As the format of key part implies, KeyWithTsToValuesStore will use
TimestampAsPostfixKeyStateEncoderSpec, and TsWithKeyTypeStore will useTimestampAsPrefixKeyStateEncoderSpec.The granularity of the timestamp for event time is 1 millisecond, which is in line with the granularity for watermark advancement. This can be a kind of knob controlling the number of the keys vs the number of the values in the key, trading off the granularity of eviction based on watermark advancement vs the size of key space (may impact performance).
There are several follow-ups with this state format implementation, which can be addressed on top of this:
Why are the changes needed?
The cost of eviction based on full scan is severe to make the stream-stream join to be lower latency. Also, the logic of maintaining logical array is complicated enough to maintain and the performance characteristic is less predictable given the behavior of deleting the element in random index (placing the value of the last index to the deleted index).
Does this PR introduce any user-facing change?
No. At this point, this state format is not integrated with the actual stream-stream join operator, and we need to do follow-up work for integration to finally introduce the change to user-facing.
How was this patch tested?
New UT suites, refactoring the existing suite to test with both time window and time interval cases.
Was this patch authored or co-authored using generative AI tooling?
No.