Iceberg extension quietly ignores delete markers resulting in incorrect data

If an iceberg table using merge-on-read updates or deletes is ingested into druid, then the deleted rows will be ingested as well. 

As a simple example, we can create a quick Iceberg table using Spark:

```scala
val df = Seq(
    ("store_a", 1, 100),
    ("store_a", 2, 200),
    ("store_b", 3, 300),
    ("store_b", 4, 400),
).toDF("store_id", "item_count", "price_total")

df.withColumn("ts", current_timestamp()).
    writeTo("demo.test_database.checkouts").
    using("iceberg").
    partitionedBy(hours($"ts")).
    tableProperty("write.update.mode", "merge-on-read").
    create()
```

Then update the table:
```sql
UPDATE demo.test_database.checkouts SET total_price=0 WHERE store_id = 'store_a'
```

Ingesting the table into druid then shows 6 rows, due to ingesting both versions of the updated records:
```sql
SELECT * FROM "checkouts"

{"__time":"2025-12-19T00:00:00.000Z","store_id":"store_a","count":4,"sum_item_count":6,"sum_price_total":300}
{"__time":"2025-12-19T00:00:00.000Z","store_id":"store_b","count":2,"sum_item_count":7,"sum_price_total":700}
```

This feels like a potential hazard which isn't explicitly called out in [the documentation](https://druid.apache.org/docs/latest/development/extensions-contrib/iceberg/).

Ideally we would handle the delete markers and properly materialize the data, but thats a pretty big overhaul. As a more simple solution should we maybe just fail the ingestion if there's delete markers present in the target partitions?

Happy to help with the implementation here, or at least just updating the documentation to make this more clear - let me know what you think is the best path forwards.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Iceberg extension quietly ignores delete markers resulting in incorrect data #18858

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Iceberg extension quietly ignores delete markers resulting in incorrect data #18858

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions