-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Add transactional batch support for Cosmos DB Spark connector #47402
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add transactional batch support for Cosmos DB Spark connector #47402
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR implements atomic transactional batch operations for the Cosmos DB Spark connector, enabling users to execute multiple operations within the same logical partition atomically with all-or-nothing semantics.
Key Changes:
- Adds
writeTransactionalBatchAPI toCosmosItemsDataSourcethat converts flat DataFrame schemas to batch operations using the same pattern as standard Cosmos writes - Implements
TransactionalBatchWriterandTransactionalBatchPartitionExecutorclasses following the CosmosReadManyReader pattern for consistency - Provides comprehensive integration tests covering atomic creation, rollback on failure, simplified schemas with default upsert, operation ordering, and update scenarios
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 19 comments.
Show a summary per file
| File | Description |
|---|---|
| sdk/cosmos/azure-cosmos-spark_3/src/main/scala/com/azure/cosmos/spark/CosmosItemsDataSource.scala | Adds writeTransactionalBatch API methods and createBatchOperationExtraction function to convert DataFrame rows to batch operations |
| sdk/cosmos/azure-cosmos-spark_3/src/main/scala/com/azure/cosmos/spark/TransactionalBatchWriter.scala | Implements core transactional batch execution logic with partition-based grouping and atomic execution |
| sdk/cosmos/azure-cosmos-spark_3/src/main/scala/com/azure/cosmos/spark/CosmosConfig.scala | Adds ItemTransactionalBatch to the ItemWriteStrategy enumeration |
| sdk/cosmos/azure-cosmos-spark_3/docs/configuration-reference.md | Documents the transactional batch API with usage patterns, schema requirements, and constraints |
| sdk/cosmos/azure-cosmos-spark_3/src/test/scala/com/azure/cosmos/spark/TransactionalBatchITest.scala | Provides integration tests for atomic operations, rollback scenarios, simplified schemas, and operation ordering |
...os/azure-cosmos-spark_3/src/main/scala/com/azure/cosmos/spark/TransactionalBatchWriter.scala
Outdated
Show resolved
Hide resolved
...osmos/azure-cosmos-spark_3/src/main/scala/com/azure/cosmos/spark/CosmosItemsDataSource.scala
Outdated
Show resolved
Hide resolved
...os/azure-cosmos-spark_3/src/main/scala/com/azure/cosmos/spark/TransactionalBatchWriter.scala
Outdated
Show resolved
Hide resolved
...osmos/azure-cosmos-spark_3/src/main/scala/com/azure/cosmos/spark/CosmosItemsDataSource.scala
Show resolved
Hide resolved
...os/azure-cosmos-spark_3/src/main/scala/com/azure/cosmos/spark/TransactionalBatchWriter.scala
Outdated
Show resolved
Hide resolved
...os/azure-cosmos-spark_3/src/main/scala/com/azure/cosmos/spark/TransactionalBatchWriter.scala
Show resolved
Hide resolved
...os/azure-cosmos-spark_3/src/main/scala/com/azure/cosmos/spark/TransactionalBatchWriter.scala
Outdated
Show resolved
Hide resolved
...os/azure-cosmos-spark_3/src/main/scala/com/azure/cosmos/spark/TransactionalBatchWriter.scala
Outdated
Show resolved
Hide resolved
sdk/cosmos/azure-cosmos-spark_3/docs/configuration-reference.md
Outdated
Show resolved
Hide resolved
...osmos/azure-cosmos-spark_3/src/main/scala/com/azure/cosmos/spark/CosmosItemsDataSource.scala
Outdated
Show resolved
Hide resolved
334b9b8 to
a8506f0
Compare
API Change CheckAPIView identified API level changes in this PR and created the following API reviews |
50ad8fd to
74d6ed2
Compare
This commit implements atomic transactional batch operations for the Cosmos DB Spark connector, enabling users to execute multiple operations within the same logical partition atomically. All operations succeed together or fail together, ensuring data consistency. CosmosItemsDataSource.scala: Adds the writeTransactionalBatch API with support for flat DataFrame schemas following the same pattern as standard Cosmos writes. The implementation converts DataFrame rows to JSON documents using CosmosRowConverter, automatically handling metadata columns like operationType (excluded from documents) and partitionKey (normalized to "pk" for Cosmos DB). When the operationType column is omitted, operations default to upsert. The extraction logic ensures proper Spark serialization by creating non-serializable objects within lambda closures rather than capturing them. TransactionalBatchWriter.scala (new file): Implements the core batch execution logic following the CosmosReadManyReader pattern for consistency. Groups operations by partition key and executes them atomically using the Cosmos DB Batch API. Handles all five operation types (create, upsert, replace, delete, read) and returns detailed results including status codes, success indicators, result documents for read operations, and error messages for failures. Manages Cosmos client lifecycle, metadata cache broadcasting, and proper resource cleanup after execution. TransactionalBatchITest.scala (new file): Provides comprehensive integration tests covering atomic creation, rollback on failure, simplified schemas with default upsert operations, operation ordering preservation, and update operations. Tests verify both successful scenarios and failure modes including duplicate key conflicts that trigger atomic rollback. All tests use flat column DataFrames matching the standard Cosmos write pattern rather than JSON strings. CosmosConfig.scala: Adds ItemTransactionalBatch to the ItemWriteStrategy enumeration to enable configuration and routing of transactional batch operations within the connector's write strategy framework. configuration-reference.md: Documents the new transactional batch API with usage patterns, schema requirements, supported operations, constraints, and example code. Explains both simplified usage (default upsert) and explicit operation types, along with the atomic nature of batch operations and the 100-operation limit per partition key.
…s/spark/TransactionalBatchWriter.scala Co-authored-by: Copilot <[email protected]>
…s/spark/CosmosItemsDataSource.scala Co-authored-by: Copilot <[email protected]>
…al partition keys Added toJson() method to PartitionKeyDefinition to enable JSON serialization for Spark broadcast variables, since the class contains complex nested objects that cannot be directly serialized. The Spark connector broadcasts partition key definitions from driver to executors, requiring this serialization capability. Also updated TransactionalBatchWriter to use PartitionKeyBuilder with pattern matching for correct hierarchical partition key construction, fixing Tuple2 type errors in the previous size-based approach.
0c3801e to
6b29c9e
Compare
| * - The partition key column name must match the container's partition key path (e.g., "pk" if the path is "/pk"). | ||
| * - The "id" column (String) is required. | ||
| * - An optional "operationType" column (String) can be provided to specify the operation ("create", "replace", "upsert", "delete", "read") for each row. | ||
| * If not provided, the default operation is "upsert". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
etag pre-condition should be taken into account as well if specified in input DF?
| * | ||
| * Output DataFrame schema: | ||
| * - id: String | ||
| * - partition key column: String (name matches container's partition key path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need an output DF? if any transaction fails I think we should fail the entire operation? Otherwise too much complexity leaks into user code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought for observability? Users can see which partition keys had successful batch commits, useful for audit, etc. But you are right, the entire Spark job fails if any transaction fails, so I guess there's no partial success handling to be done in user code. We should just return Unit instead?
...osmos/azure-cosmos-spark_3/src/main/scala/com/azure/cosmos/spark/CosmosItemsDataSource.scala
Show resolved
Hide resolved
...osmos/azure-cosmos-spark_3/src/main/scala/com/azure/cosmos/spark/CosmosItemsDataSource.scala
Outdated
Show resolved
Hide resolved
| val defaultOperationExtraction = createBatchOperationExtraction(df, partitionKeyDefinition) | ||
|
|
||
| // Execute the batch with the pre-initialized client states | ||
| batchWriter.writeTransactionalBatchWithClientStates(df.rdd, defaultOperationExtraction, (clientMetadataBroadcast, partitionKeyDefBroadcast)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you have to ensure spark partitioning by PK - and also order by it to have a rpredictabel order (ensure change in pk means you can trigger flushing to backend)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am removing broadcast per Annie, but we can discuss the order by as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ordering within each partition is already handled by grouping operations by partition key value in TransactionalBatchPartitionExecutor, but I guess you mean explicit repartitioning like df.repartition($"partitionKeyColumn") is needed to make batching predictable, right?
...osmos/azure-cosmos-spark_3/src/main/scala/com/azure/cosmos/spark/CosmosItemsDataSource.scala
Outdated
Show resolved
Hide resolved
...os/azure-cosmos-spark_3/src/main/scala/com/azure/cosmos/spark/TransactionalBatchWriter.scala
Outdated
Show resolved
Hide resolved
...os/azure-cosmos-spark_3/src/main/scala/com/azure/cosmos/spark/TransactionalBatchWriter.scala
Show resolved
Hide resolved
|
|
||
| // Group operations by partition key | ||
| // Use the full partition key values sequence as the key | ||
| private val operationsByPartitionKey: mutable.Map[Seq[Any], mutable.ArrayBuffer[BatchOperation]] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anything you do in a map requires pushign the entire df into memory - this is not going to work at scale. So, the whole grouping by PK etc. needs to be pushed into the DF - so, spark can distribute the work and do the partitioning (to ensure all docs with same pk value are even on teh same executor) and then grouping/ordering. That way you can simply look at row-by-row - and any change in PK-value means you need to send the transactional batch. Also you will need a construct liek in BulkWriter.scala that first separates the incoming data by physcial partition - to allow each TransactionalWriter to send tranasactional batches to all physical partitions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let us discuss this in a call - everythign else looks liek a good start - but gettign this rgiht is crucial to make this work at scale and being efficient (which emans every Spark partition alone needs to be bale to fully saturate the CDB backedn - across all partitions)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also retries need to be added here - 408 will be very common for example
…d-buffer with streaming iterator to handle large datasets at scale. Add a new test to check that deliberately interleaved operations across 3 partition keys are grouped properly (reducing transitions from 6 to 2)
Description
This PR adds transactional batch operation support to the Azure Cosmos DB Spark 3 connector, enabling atomic multi-document operations within a single partition. The implementation follows the same pattern as standard Cosmos writes, converting flat DataFrame rows to JSON documents while supporting batch-specific metadata.
Key Features
operationTypecolumn (defaults to "upsert" if not provided)Usage Examples
Example 1: Basic Upsert Operations (No Operation Type Specified)
When you don't specify an
operationTypecolumn, all operations default to "upsert":Output Schema:
Example 2: Financial Instrument Temporal Versioning (Hierarchical Partition Keys)
This example demonstrates hierarchical partition keys for temporal versioning of financial instruments. The partition key consists of two levels:
PermId(instrument identifier) andSourceId(data source):Note: In this example,
PermIdandSourceIdtogether form the hierarchical partition key (2 levels). All operations in the batch must use the same partition key values to maintain atomicity.Example 3: Mixed Operations with Operation Type Column
Specify different operations per row using the
operationTypecolumn:Input DataFrame Schema
Your DataFrame should have flat columns representing document properties:
Note: For hierarchical partition keys, include all partition key path columns (e.g.,
PermId,SourceId). TheoperationTypecolumn is metadata and not included in the stored document.Output DataFrame Schema
Constraints
Error Handling
The API returns detailed results for each operation:
All SDK Contribution checklist:
General Guidelines and Best Practices
Testing Guidelines