You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[SPARK-29953][SS] Don't clean up source files for FileStreamSource if the files belong to the output of FileStreamSink
### What changes were proposed in this pull request?
This patch prevents the cleanup operation in FileStreamSource if the source files belong to the FileStreamSink. This is needed because the output of FileStreamSink can be read with multiple Spark queries and queries will read the files based on the metadata log, which won't reflect the cleanup.
To simplify the logic, the patch only takes care of the case of when the source path without glob pattern refers to the output directory of FileStreamSink, via checking FileStreamSource to see whether it leverages metadata directory or not to list the source files.
### Why are the changes needed?
Without this patch, if end users turn on cleanup option with the path which is the output of FileStreamSink, there may be out of sync between metadata and available files which may break other queries reading the path.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Added UT.
Closesapache#26590 from HeartSaVioR/SPARK-29953.
Authored-by: Jungtaek Lim (HeartSaVioR) <[email protected]>
Signed-off-by: Shixiong Zhu <[email protected]>
Copy file name to clipboardExpand all lines: docs/structured-streaming-programming-guide.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -551,7 +551,7 @@ Here are the details of all the sources in Spark.
551
551
When "archive" is provided, additional option <code>sourceArchiveDir</code> must be provided as well. The value of "sourceArchiveDir" must have 2 subdirectories (so depth of directory is greater than 2). e.g. <code>/archived/here</code>. This will ensure archived files are never included as new source files.<br/>
552
552
Spark will move source files respecting their own path. For example, if the path of source file is <code>/a/b/dataset.txt</code> and the path of archive directory is <code>/archived/here</code>, file will be moved to <code>/archived/here/a/b/dataset.txt</code>.<br/>
553
553
NOTE: Both archiving (via moving) or deleting completed files will introduce overhead (slow down) in each micro-batch, so you need to understand the cost for each operation in your file system before enabling this option. On the other hand, enabling this option will reduce the cost to list source files which can be an expensive operation.<br/>
554
-
NOTE 2: The source path should not be used from multiple sources or queries when enabling this option.<br/>
554
+
NOTE 2: The source path should not be used from multiple sources or queries when enabling this option. Similarly, you must ensure the source path doesn't match to any files in output directory of file stream sink.<br/>
555
555
NOTE 3: Both delete and move actions are best effort. Failing to delete or move files will not fail the streaming query.
556
556
<br/><br/>
557
557
For file-format-specific options, see the related methods in <code>DataStreamReader</code>
0 commit comments