Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -150,7 +150,7 @@ case class MergeIntoPaimonTable(
case _ => false
}
}
if (hasUpdate(matchedActions)) {
if (hasUpdate(matchedActions) || notMatchedActions.nonEmpty) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain this line in detail?

Copy link
Contributor Author

@Aitozi Aitozi Feb 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when there is matched or not matched action by target, we should evaluate on the file splits after inner join. Otherwise, all the source rows are not matched by target.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the test of this could verify this.

  test(s"Paimon MergeInto: only insert") {
    withTable("source", "target") {

      Seq((1, 100, "c11"), (3, 300, "c33")).toDF("a", "b", "c").createOrReplaceTempView("source")

      createTable("target", "a INT, b INT, c STRING", Seq("a"))
      spark.sql("INSERT INTO target values (1, 10, 'c1'), (2, 20, 'c2')")

      spark.sql(s"""
                   |MERGE INTO target
                   |USING source
                   |ON target.a = source.a
                   |WHEN NOT MATCHED
                   |THEN INSERT (a, b, c) values (a, b, c)
                   |""".stripMargin)

      checkAnswer(
        spark.sql("SELECT * FROM target ORDER BY a, b"),
        Row(1, 10, "c1") :: Row(2, 20, "c2") :: Row(3, 300, "c33") :: Nil)
    }
  }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @YannByron , I have another thought for this question, and opened another PR

I think we should not use this condition to move this untouched files into touched file set, which will increase the rewrite input/output size.

Instead, if not matched action is non empty, these files should be input as untouched files, and we only have to consume the untouched files only when the not matched by target action is present

touchedFilePathsSet ++= findTouchedFiles(
targetDS.join(sourceDS, toColumn(mergeCondition), "inner"),
sparkSession)
Expand All @@ -172,10 +172,15 @@ case class MergeIntoPaimonTable(

// Add FILE_TOUCHED_COL to mark the row as coming from the touched file, if the row has not been
// modified and was from touched file, it should be kept too.
val targetDSWithFileTouchedCol = createDataset(sparkSession, touchedFileRelation)
val touchedDsWithFileTouchedCol = createDataset(sparkSession, touchedFileRelation)
.withColumn(FILE_TOUCHED_COL, lit(true))
.union(createDataset(sparkSession, unTouchedFileRelation)
.withColumn(FILE_TOUCHED_COL, lit(false)))
val targetDSWithFileTouchedCol = if (notMatchedBySourceActions.nonEmpty) {
touchedDsWithFileTouchedCol.union(
createDataset(sparkSession, unTouchedFileRelation)
.withColumn(FILE_TOUCHED_COL, lit(false)))
} else {
touchedDsWithFileTouchedCol
}

val toWriteDS =
constructChangedRows(sparkSession, targetDSWithFileTouchedCol).drop(ROW_KIND_COL)
Expand Down
Loading