Skip to content

Conversation

@rahil-c
Copy link
Collaborator

@rahil-c rahil-c commented Jan 2, 2026

Describe the issue this Pull Request addresses

Issue #17626

This seeks to add support for MOR Table, and does checks to ensure bulk insert, insert, update, delete work successfully on this table type by generating avro log files alongside lance base files.

Summary and Changelog

  • Implemented getAvroBytes in HoodieSparkRecord
  • Parameterized tests for MOR

Impact

None

Risk Level

low

Documentation Update

none

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

@github-actions github-actions bot added the size:M PR with lines of changes in (100, 300] label Jan 2, 2026
@rahil-c
Copy link
Collaborator Author

rahil-c commented Jan 2, 2026

Running into issues with some tests hence draft state for now.

@rahil-c rahil-c changed the title Ensure MOR table works, with lance base files and avro logs file feat: Ensure MOR table works, with lance base files and avro logs file Jan 2, 2026
// scalastyle:on

val writeConfig = client.getConfig
if (writeConfig.getRecordMerger.getRecordType == HoodieRecordType.SPARK && tableType == MERGE_ON_READ && writeConfig.getLogDataBlockFormat.orElse(HoodieLogBlockType.AVRO_DATA_BLOCK) != HoodieLogBlockType.PARQUET_DATA_BLOCK) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why this is needed for lance but not for the other file formats?

Copy link
Collaborator Author

@rahil-c rahil-c Jan 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally when I was running tests for MOR tables I ran into the following exception when the condition was present.

java.lang.UnsupportedOperationException: org.apache.hudi.DefaultSparkRecordMerger only support parquet log.

	at org.apache.hudi.HoodieSparkSqlWriterInternal.writeInternal(HoodieSparkSqlWriter.scala:513)
	at org.apache.hudi.HoodieSparkSqlWriterInternal.$anonfun$write$1(HoodieSparkSqlWriter.scala:193)
	at org.apache.hudi.HoodieSparkSqlWriterInternal.write(HoodieSparkSqlWriter.scala:211)
	at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:133)
	at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:171)
	at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:48)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)

My assumption for why this condition is in hudi is that we did not implement the getAvroBytes originally in the HoodieSparkRecord as i see the following comment here saying that only parquet log is supported for spark and not avro.
https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieAvroDataBlock.java#L121

https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/common/model/HoodieSparkRecord.java#L333

Since this change now implements getAvroBytes I thought this restriction of spark merger only supporting parquet log block instead of avro log does not seem necessary anymore.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does the DefaultSparkRecordMerger get hit here? That path should be dead now that we have merge mode

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we need to write through as Row the whole way for now since there is no avro implementation

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 . I can take a closer pass at why we need the getAvroBytes now, and how parquet log vs avro log is working for existing table

@rahil-c
Copy link
Collaborator Author

rahil-c commented Jan 2, 2026

Currently the following tests were failing on latest master for MOR case.
Screenshot 2026-01-02 at 8 40 17 AM

However when running the original feature branch(older version of hudi and older version of lance) https://github.com/onehouseinc/hudi-internal/pull/1657
Screenshot 2026-01-02 at 9 35 03 AM

For now have downgraded the dependencies in order to have test pass.

@rahil-c rahil-c marked this pull request as ready for review January 2, 2026 18:40
@the-other-tim-brown
Copy link
Contributor

the-other-tim-brown commented Jan 2, 2026

For now have downgraded the dependencies in order to have test pass.

@rahil-c what is the issue with the newer version?

@the-other-tim-brown the-other-tim-brown force-pushed the rahil/hudi-lance-spark-datasource-crud-mor branch from 3b0709d to b12bddb Compare January 8, 2026 16:04
@github-actions github-actions bot added size:L PR with lines of changes in (300, 1000] and removed size:M PR with lines of changes in (100, 300] labels Jan 8, 2026
UTF8String recordKey = UTF8String.fromString(key.getRecordKey());
updateRecordMetadata(row, recordKey, key.getPartitionPath(), getWrittenRecordCount());
super.write(row);
super.write(row.copy());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is required when running bulk insert if incoming dataframe has more than 1 row. Follow up is filed to fix the underlying issue: #17808

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so is this an existing issue for Lance writer in general, unrelated to MoR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

@the-other-tim-brown the-other-tim-brown force-pushed the rahil/hudi-lance-spark-datasource-crud-mor branch from b12bddb to 78829f1 Compare January 13, 2026 02:36
@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@rahil-c
Copy link
Collaborator Author

rahil-c commented Jan 15, 2026

@vinothchandar Was wondering if you can review/sign off on this when you get a chance?

Copy link
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took 1 pass. Will take a deeper pass in a IDE

UTF8String recordKey = UTF8String.fromString(key.getRecordKey());
updateRecordMetadata(row, recordKey, key.getPartitionPath(), getWrittenRecordCount());
super.write(row);
super.write(row.copy());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so is this an existing issue for Lance writer in general, unrelated to MoR?

// scalastyle:on

val writeConfig = client.getConfig
if (writeConfig.getRecordMerger.getRecordType == HoodieRecordType.SPARK && tableType == MERGE_ON_READ && writeConfig.getLogDataBlockFormat.orElse(HoodieLogBlockType.AVRO_DATA_BLOCK) != HoodieLogBlockType.PARQUET_DATA_BLOCK) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 . I can take a closer pass at why we need the getAvroBytes now, and how parquet log vs avro log is working for existing table

public ByteArrayOutputStream getAvroBytes(HoodieSchema recordSchema, Properties props) throws IOException {
throw new UnsupportedOperationException();
// Convert Spark InternalRow to Avro GenericRecord
if (data == null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change is not lance specific. So love to understand, why this becomes necessary.

Copy link
Collaborator Author

@rahil-c rahil-c Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vinothchandar
Originally I hit the following exception in TestLanceDataSource#testBasicUpsertModifyExistingRow when trying to upsert on an existing row for the MOR case (where there should have been a lance base file but an avro log file).

You are correct however in that its not a lance specific issue, even switching the test setup to use "PARQUET" by changing this line https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestLanceDataSource.scala#L70
I was hitting the same issue below:

Caused by: org.apache.hudi.exception.HoodieAppendException: Failed while appending records to /var/folders/lm/0j1q1s_n09b4wgqkdqbzpbkm0000gn/T/junit-11448262777148643233/dataset/test_lance_upsert_merge_on_read/.3169035e-e73a-49ec-be8f-c7045242bf56-0_20260115220744098.log.1_0-38-60
	at org.apache.hudi.io.HoodieAppendHandle.appendDataAndDeleteBlocks(HoodieAppendHandle.java:511)
	at org.apache.hudi.io.HoodieAppendHandle.doAppend(HoodieAppendHandle.java:470)
	at org.apache.hudi.table.action.deltacommit.BaseSparkDeltaCommitActionExecutor.handleUpdate(BaseSparkDeltaCommitActionExecutor.java:82)
	at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:358)
	... 35 more
Caused by: java.lang.UnsupportedOperationException
	at org.apache.hudi.common.model.HoodieSparkRecord.getAvroBytes(HoodieSparkRecord.java:331)
	at org.apache.hudi.common.table.log.block.HoodieAvroDataBlock.serializeRecords(HoodieAvroDataBlock.java:122)
	at org.apache.hudi.common.table.log.block.HoodieDataBlock.getContentBytes(HoodieDataBlock.java:132)
	at org.apache.hudi.common.table.log.HoodieLogFormatWriter.appendBlocks(HoodieLogFormatWriter.java:147)
	at org.apache.hudi.io.HoodieAppendHandle.appendDataAndDeleteBlocks(HoodieAppendHandle.java:503)
	... 38 more

When examining the frames of the stack trace, i can see that is it going thru the upsert path and to HoodieAppendHandle
Screenshot 2026-01-15 at 10 12 28 PM
and attempts to write a log file in HoodieAppendHandle#appendDataAndDeleteBlocks, in the following code pointer, https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java#L503

The actual block seems to be the HoodieAvroDataBlock
Screenshot 2026-01-15 at 10 17 29 PM

which contains a method called serializeRecords

ByteArrayOutputStream data = s.getAvroBytes(schema, props);

The actual record being used in this case is HoodieSparkRecord which currently did not have a getAvroBytes hence why I implemented it for now.
Screenshot 2026-01-15 at 10 21 44 PM

Copy link
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving the PR to unblock.

But, I'd love for us to just write lance files in the log.. like we do for parquet today. Avoid the avro logs as the default MoR write path here, given there can be "blobs" in the log files too. and avro cannot handle that easily

@the-other-tim-brown the-other-tim-brown merged commit d292847 into apache:master Jan 16, 2026
72 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L PR with lines of changes in (300, 1000]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants