feat: Ensure MOR table works, with lance base files and avro logs file #17768

rahil-c · 2026-01-02T13:59:20Z

Describe the issue this Pull Request addresses

This seeks to add support for MOR Table, and does checks to ensure bulk insert, insert, update, delete work successfully on this table type by generating avro log files alongside lance base files.

Summary and Changelog

Implemented getAvroBytes in HoodieSparkRecord
Parameterized tests for MOR

Impact

None

Risk Level

low

Documentation Update

none

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable

rahil-c · 2026-01-02T13:59:47Z

Running into issues with some tests hence draft state for now.

the-other-tim-brown · 2026-01-02T15:38:07Z

...spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala

            // scalastyle:on

            val writeConfig = client.getConfig
-            if (writeConfig.getRecordMerger.getRecordType == HoodieRecordType.SPARK && tableType == MERGE_ON_READ && writeConfig.getLogDataBlockFormat.orElse(HoodieLogBlockType.AVRO_DATA_BLOCK) != HoodieLogBlockType.PARQUET_DATA_BLOCK) {


Can you explain why this is needed for lance but not for the other file formats?

Originally when I was running tests for MOR tables I ran into the following exception when the condition was present.

java.lang.UnsupportedOperationException: org.apache.hudi.DefaultSparkRecordMerger only support parquet log. at org.apache.hudi.HoodieSparkSqlWriterInternal.writeInternal(HoodieSparkSqlWriter.scala:513) at org.apache.hudi.HoodieSparkSqlWriterInternal.$anonfun$write$1(HoodieSparkSqlWriter.scala:193) at org.apache.hudi.HoodieSparkSqlWriterInternal.write(HoodieSparkSqlWriter.scala:211) at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:133) at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:171) at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:48) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)

My assumption for why this condition is in hudi is that we did not implement the getAvroBytes originally in the HoodieSparkRecord as i see the following comment here saying that only parquet log is supported for spark and not avro.
https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieAvroDataBlock.java#L121

https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/common/model/HoodieSparkRecord.java#L333

Since this change now implements getAvroBytes I thought this restriction of spark merger only supporting parquet log block instead of avro log does not seem necessary anymore.

Why does the DefaultSparkRecordMerger get hit here? That path should be dead now that we have merge mode

Looks like we need to write through as Row the whole way for now since there is no avro implementation

+1 . I can take a closer pass at why we need the getAvroBytes now, and how parquet log vs avro log is working for existing table

rahil-c · 2026-01-02T18:38:20Z

Currently the following tests were failing on latest master for MOR case.

However when running the original feature branch(older version of hudi and older version of lance) https://github.com/onehouseinc/hudi-internal/pull/1657

For now have downgraded the dependencies in order to have test pass.

the-other-tim-brown · 2026-01-02T19:00:26Z

For now have downgraded the dependencies in order to have test pass.

@rahil-c what is the issue with the newer version?

...rk-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestLanceDataSource.scala

the-other-tim-brown · 2026-01-08T16:17:24Z

...lient/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/HoodieSparkLanceWriter.java

      UTF8String recordKey = UTF8String.fromString(key.getRecordKey());
      updateRecordMetadata(row, recordKey, key.getPartitionPath(), getWrittenRecordCount());
-      super.write(row);
+      super.write(row.copy());


This is required when running bulk insert if incoming dataframe has more than 1 row. Follow up is filed to fix the underlying issue: #17808

so is this an existing issue for Lance writer in general, unrelated to MoR?

…r MOR

…fix writer bug

hudi-bot · 2026-01-13T04:03:15Z

CI report:

78829f1 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

rahil-c · 2026-01-15T20:21:05Z

@vinothchandar Was wondering if you can review/sign off on this when you get a chance?

vinothchandar

Took 1 pass. Will take a deeper pass in a IDE

vinothchandar · 2026-01-15T22:33:58Z

...lient/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/HoodieSparkLanceWriter.java

      UTF8String recordKey = UTF8String.fromString(key.getRecordKey());
      updateRecordMetadata(row, recordKey, key.getPartitionPath(), getWrittenRecordCount());
-      super.write(row);
+      super.write(row.copy());


so is this an existing issue for Lance writer in general, unrelated to MoR?

vinothchandar · 2026-01-15T22:37:21Z

...spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala

            // scalastyle:on

            val writeConfig = client.getConfig
-            if (writeConfig.getRecordMerger.getRecordType == HoodieRecordType.SPARK && tableType == MERGE_ON_READ && writeConfig.getLogDataBlockFormat.orElse(HoodieLogBlockType.AVRO_DATA_BLOCK) != HoodieLogBlockType.PARQUET_DATA_BLOCK) {


+1 . I can take a closer pass at why we need the getAvroBytes now, and how parquet log vs avro log is working for existing table

vinothchandar · 2026-01-15T22:39:46Z

hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/common/model/HoodieSparkRecord.java

  public ByteArrayOutputStream getAvroBytes(HoodieSchema recordSchema, Properties props) throws IOException {
-    throw new UnsupportedOperationException();
+    // Convert Spark InternalRow to Avro GenericRecord
+    if (data == null) {


this change is not lance specific. So love to understand, why this becomes necessary.

@vinothchandar
Originally I hit the following exception in TestLanceDataSource#testBasicUpsertModifyExistingRow when trying to upsert on an existing row for the MOR case (where there should have been a lance base file but an avro log file).

You are correct however in that its not a lance specific issue, even switching the test setup to use "PARQUET" by changing this line https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestLanceDataSource.scala#L70
I was hitting the same issue below:

Caused by: org.apache.hudi.exception.HoodieAppendException: Failed while appending records to /var/folders/lm/0j1q1s_n09b4wgqkdqbzpbkm0000gn/T/junit-11448262777148643233/dataset/test_lance_upsert_merge_on_read/.3169035e-e73a-49ec-be8f-c7045242bf56-0_20260115220744098.log.1_0-38-60 at org.apache.hudi.io.HoodieAppendHandle.appendDataAndDeleteBlocks(HoodieAppendHandle.java:511) at org.apache.hudi.io.HoodieAppendHandle.doAppend(HoodieAppendHandle.java:470) at org.apache.hudi.table.action.deltacommit.BaseSparkDeltaCommitActionExecutor.handleUpdate(BaseSparkDeltaCommitActionExecutor.java:82) at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:358) ... 35 more Caused by: java.lang.UnsupportedOperationException at org.apache.hudi.common.model.HoodieSparkRecord.getAvroBytes(HoodieSparkRecord.java:331) at org.apache.hudi.common.table.log.block.HoodieAvroDataBlock.serializeRecords(HoodieAvroDataBlock.java:122) at org.apache.hudi.common.table.log.block.HoodieDataBlock.getContentBytes(HoodieDataBlock.java:132) at org.apache.hudi.common.table.log.HoodieLogFormatWriter.appendBlocks(HoodieLogFormatWriter.java:147) at org.apache.hudi.io.HoodieAppendHandle.appendDataAndDeleteBlocks(HoodieAppendHandle.java:503) ... 38 more

When examining the frames of the stack trace, i can see that is it going thru the upsert path and to HoodieAppendHandle

and attempts to write a log file in HoodieAppendHandle#appendDataAndDeleteBlocks, in the following code pointer, https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java#L503

The actual block seems to be the HoodieAvroDataBlock

which contains a method called serializeRecords

hudi/hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieAvroDataBlock.java

Line 122 in 30029e3

ByteArrayOutputStream data = s.getAvroBytes(schema, props);

The actual record being used in this case is HoodieSparkRecord which currently did not have a getAvroBytes hence why I implemented it for now.

vinothchandar

Approving the PR to unblock.

But, I'd love for us to just write lance files in the log.. like we do for parquet today. Avoid the avro logs as the default MoR write path here, given there can be "blobs" in the log files too. and avro cannot handle that easily

github-actions bot added the size:M PR with lines of changes in (100, 300] label Jan 2, 2026

rahil-c changed the title ~~Ensure MOR table works, with lance base files and avro logs file~~ feat: Ensure MOR table works, with lance base files and avro logs file Jan 2, 2026

rahil-c mentioned this pull request Jan 2, 2026

Implement bulk insert/ insert / upsert / delete validation for MOR #17626

Closed

the-other-tim-brown reviewed Jan 2, 2026

View reviewed changes

rahil-c marked this pull request as ready for review January 2, 2026 18:40

the-other-tim-brown reviewed Jan 2, 2026

View reviewed changes

...rk-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestLanceDataSource.scala Outdated Show resolved Hide resolved

the-other-tim-brown reviewed Jan 2, 2026

View reviewed changes

...rk-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestLanceDataSource.scala Show resolved Hide resolved

the-other-tim-brown force-pushed the rahil/hudi-lance-spark-datasource-crud-mor branch from 3b0709d to b12bddb Compare January 8, 2026 16:04

github-actions bot added size:L PR with lines of changes in (300, 1000] and removed size:M PR with lines of changes in (100, 300] labels Jan 8, 2026

the-other-tim-brown reviewed Jan 8, 2026

View reviewed changes

rahil-c and others added 8 commits January 12, 2026 21:35

Ensure MOR table works, with lance base files and avro logs file

9ac2ff8

fix style

a81c694

minor

c496518

version downgrade in lance spark and lance core due to arrow issue fo…

ff1bcfd

…r MOR

retrigger ci

e56ed4a

cleanup

decc2b3

add compaction validation

b427e30

refactor test to reduce code duplication, add clustering validation, …

78829f1

…fix writer bug

the-other-tim-brown force-pushed the rahil/hudi-lance-spark-datasource-crud-mor branch from b12bddb to 78829f1 Compare January 13, 2026 02:36

the-other-tim-brown assigned vinothchandar Jan 14, 2026

vinothchandar reviewed Jan 15, 2026

View reviewed changes

rahil-c mentioned this pull request Jan 16, 2026

feat: Lance schema evolution (add column, type promotion) #17904

Open

3 tasks

the-other-tim-brown mentioned this pull request Jan 16, 2026

feat(lance): Remove extra buffering in Lance writer #17916

Merged

3 tasks

vinothchandar approved these changes Jan 16, 2026

View reviewed changes

the-other-tim-brown approved these changes Jan 16, 2026

View reviewed changes

the-other-tim-brown merged commit d292847 into apache:master Jan 16, 2026
72 checks passed

feat: Ensure MOR table works, with lance base files and avro logs file #17768

feat: Ensure MOR table works, with lance base files and avro logs file #17768

Uh oh!

Conversation

rahil-c commented Jan 2, 2026

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

rahil-c commented Jan 2, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rahil-c Jan 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rahil-c commented Jan 2, 2026

Uh oh!

the-other-tim-brown commented Jan 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hudi-bot commented Jan 13, 2026

CI report:

Uh oh!

rahil-c commented Jan 15, 2026

Uh oh!

vinothchandar left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rahil-c Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vinothchandar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

rahil-c Jan 2, 2026 •

edited

Loading

the-other-tim-brown commented Jan 2, 2026 •

edited

Loading

rahil-c Jan 16, 2026 •

edited

Loading