Skip to content

Conversation

@ericm-db
Copy link
Contributor

@ericm-db ericm-db commented Dec 6, 2025

What changes were proposed in this pull request?

This PR decouples the versioning of OffsetSeqMetadata from OffsetLog, allowing them to evolve independently while maintaining backward compatibility
with existing streaming checkpoints.

Key changes:

  1. Introduced OffsetSeqMetadataBase trait to abstract over metadata versions (V1 and V2)
  2. Added OffsetSeqMetadataV2 with new fields:
    - sourceMetadataInfo: Map of source metadata keyed by sourceId
    - controlBatchInfo: Information about control batches (e.g., rewind batches)
  3. Added OffsetSeqV2 class for VERSION_2 offset sequences
  4. Added STREAMING_OFFSET_LOG_FORMAT_VERSION config to control offset log versioning
  5. Updated OffsetSeqLog to support both VERSION_1 and VERSION_2 serialization formats
  6. Updated method signatures throughout the codebase to accept base types (OffsetSeqMetadataBase, OffsetSeqBase) instead of concrete types

Why are the changes needed?

Previously, OffsetSeqMetadata version was tightly coupled to the OffsetLog version. This meant that any change to metadata format would require
bumping the entire offset log version, making it difficult to evolve metadata independently.

This change enables:

  • Independent versioning of offset metadata without breaking checkpoint compatibility
  • Future extensibility for adding new metadata fields (e.g., source metadata, control batch information)
  • Better type safety through trait-based abstraction

Does this PR introduce any user-facing change?

No. The changes are backward compatible:

  • Existing checkpoints continue to work (VERSION_1 is the default)
  • The new VERSION_2 format is only used when explicitly configured via spark.sql.streaming.offsetLog.formatVersion=2

How was this patch tested?

  • Added unit tests for OffsetSeqMetadataV2 serialization/deserialization
  • Added tests for VERSION_2 offset log format with control batch information

Was this patch authored or co-authored using generative AI tooling?

No

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant