Skip to content

Commit ed584ea

Browse files
authored
Merge pull request #120948 from kim-ale/patch-7
Add AddFileAction Size Limitation
2 parents 4ca4337 + c1a3a52 commit ed584ea

File tree

1 file changed

+9
-2
lines changed

1 file changed

+9
-2
lines changed

articles/stream-analytics/write-to-delta-lake.md

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -84,8 +84,7 @@ At the failure of schema conversion, the job behavior will follow the [output da
8484

8585
### Delta Log checkpoints
8686

87-
88-
The Stream Analytics job will create Delta Log checkpoints periodically.
87+
The Stream Analytics job will create [Delta Log checkpoints](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#checkpoints-1) periodically in the V1 format. Delta Log checkpoints are snapshots of the Delta Table and will typically contain the name of the data file generated by the Stream Analytics job. If the amount of data files is large, then this will lead to large checkpoints which can cause memory issues in the Stream Analytics Job.
8988

9089
## Limitations
9190

@@ -100,6 +99,14 @@ The Stream Analytics job will create Delta Log checkpoints periodically.
10099
- Writing to existing tables of Writer Version 7 or above with writer features will fail.
101100
- Example: Writing to existing tables with [Deletion Vectors](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#deletion-vectors) enabled will fail.
102101
- The exceptions here are the [changeDataFeed and appendOnly Writer Features](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#valid-feature-names-in-table-features).
102+
- When a Stream Analytics job writes a batch of data to a Delta Lake, it can generate multiple [Add File Action](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#add-file-and-remove-file). When there are too many Add File Actions generated for a single batch, a Stream Analytics Job can be stuck.
103+
- The number of Add File Actions generated are determined by a number of factors:
104+
- Size of the batch. This is determined by the data volume and the batching parameters [Minimum Rows and Maximum Time](https://learn.microsoft.com/azure/stream-analytics/blob-storage-azure-data-lake-gen2-output#output-configuration)
105+
- Cardinality of the [Partition Column values](https://learn.microsoft.com/azure/stream-analytics/write-to-delta-lake#delta-lake-configuration) of the batch.
106+
- To reduce the number of Add File Actions generated for a batch the following steps can be taken:
107+
- Reduce the batching configurations [Minimum Rows and Maximum Time](https://learn.microsoft.com/azure/stream-analytics/blob-storage-azure-data-lake-gen2-output#output-configuration)
108+
- Reduce the cardinality of the [Partition Column values](https://learn.microsoft.com/azure/stream-analytics/write-to-delta-lake#delta-lake-configuration) by tweaking the input data or choosing a different partition column
109+
- Stream Analytics jobs can only read and write single part V1 Checkpoints. Multi-part checkpoints and the Checkpoint V2 format are not supported.
103110

104111
## Next steps
105112

0 commit comments

Comments
 (0)