Skip to content

Conversation

@the-other-tim-brown
Copy link
Contributor

@the-other-tim-brown the-other-tim-brown commented Jan 16, 2026

Describe the issue this Pull Request addresses

The lance writer is maintaining a list of records in a list as well as writing out to a buffer within the arrow writer. We can remove this intermediate buffer and directly write the values out to the arrow writer. This should avoid extra row copy operations flagged in #17768 (comment)

Summary and Changelog

  • Define an interface within the HoodieBaseLanceWriter that abstracts out the engine specific implementation.
  • Update the writer flow to use this interface to write the record as it arrives instead of buffering in a list.

Impact

Reduces memory overhead

Risk Level

Low, existing test coverage exists for the writer

Documentation Update

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

@github-actions github-actions bot added the size:M PR with lines of changes in (100, 300] label Jan 16, 2026
root = VectorSchemaRoot.create(getArrowSchema(), allocator);
}
// Finalize the arrow writer (sets row count on VectorSchemaRoot)
arrowWriter.finish();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: instead of finish() should the interface method be renamed to something more clear like setRowCount() or is there other functionality outside of setting rowCount as well that finish() handles?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will do some operation on the fields as well, I just made it match the implementation name but I think something like closeBatch or finishBatch may be more telling

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense either of those names sounds good to me.

@the-other-tim-brown the-other-tim-brown marked this pull request as ready for review January 16, 2026 18:17
@the-other-tim-brown the-other-tim-brown force-pushed the 17808-remove-extra-buffering branch from 65fea30 to c16af1e Compare January 16, 2026 19:08
@the-other-tim-brown the-other-tim-brown force-pushed the 17808-remove-extra-buffering branch from c16af1e to 3b56819 Compare January 19, 2026 18:25
private final BufferAllocator allocator;
private final int batchSize;
/**
* -- GETTER --
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] Do we need this comment here? Or is the @Getter annotation already and field name already self explanatory?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The -- GETTER -- will add the javadoc to the getter method with lombok but I agree this field is self explanatory

@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@nsivabalan nsivabalan merged commit 56c78c9 into apache:master Jan 20, 2026
72 checks passed
@the-other-tim-brown the-other-tim-brown deleted the 17808-remove-extra-buffering branch January 20, 2026 21:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:M PR with lines of changes in (100, 300]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants