Skip to content

[Data] - Port over changes from lance-ray into Ray Data#60497

Open
myandpr wants to merge 1 commit intoray-project:masterfrom
myandpr:migrate-lance-ray
Open

[Data] - Port over changes from lance-ray into Ray Data#60497
myandpr wants to merge 1 commit intoray-project:masterfrom
myandpr:migrate-lance-ray

Conversation

@myandpr
Copy link
Member

@myandpr myandpr commented Jan 26, 2026

Description

Port lance-ray datasink features into Ray Data LanceDatasink: write retry, Lance namespaces, and driver-side commit flow.

Related issues

Link related issues: "Fixes #60147", "Closes #60147", or "Related to #60147".

Additional information

implementation details

  • Write retry
    • Logic/parameters (from lance-ray): retry on LanceError(IO) + DataContext.retried_io_errors, with max_attempts=10 and max_backoff_s=32.
    • Execution framework (Ray-native): use ray._common.retry.call_with_retry to wrap write_fragments.
  • Lance namespaces (from lance-ray)
    • Add table_id / namespace_impl / namespace_properties to LanceDatasink.
    • Resolve/declare tables via DescribeTableRequest / DeclareTableRequest / CreateEmptyTableRequest.
    • Merge user storage_options with namespace-provided options.
    • Create and pass storage_options_provider to write_fragments, LanceDataset, and commit.
    • New helper module: python/ray/data/_internal/datasource/lance_utils.py.
  • Driver commit (aligned with lance-ray)
    • Keep the same driver-side commit behavior as lance-ray: collect fragments and use the last observed schema from write results, then commit.
    • This does not implement true schema merge/unification, since lance-ray itself doesn’t either.
  • High-level API
    • Dataset.write_lance() now accepts and forwards table_id / namespace_impl / namespace_properties.
  • Tests
    • Added a parameterized test to verify write_lance() forwards namespace arguments.

Testing

  • python -m pytest python/ray/data/tests/datasource/test_lance.py -q

Notes

  • Requires a lance/pylance version that supports storage_options_provider (validated locally with 1.0.3).

@myandpr myandpr requested a review from a team as a code owner January 26, 2026 16:32
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request introduces support for Lance namespaces, write retry mechanisms, and driver-side commit flow to Ray Data's LanceDatasink. The changes involve modifying the LanceDatasink and _write_fragment functions to handle new parameters related to namespaces and retries, and adding a new utility module lance_utils.py for namespace management. The pydoclint-baseline.txt file was updated to remove previously reported docstring issues, indicating improved documentation. A new test case was added to verify the correct passing of namespace arguments. Overall, the changes are well-structured and align with the described features. The introduction of call_with_retry and namespace handling enhances the robustness and flexibility of Lance integration.

@myandpr
Copy link
Member Author

myandpr commented Jan 26, 2026

Hey @goutamvenkat-anyscale , I implemented the #60147 migration (retry/namespace/driver commit). Maybe you can take a quick look at the PR’s approach when you have a moment—would love your review and whether it aligns with your expectations. Thanks!

@ray-gardener ray-gardener bot added data Ray Data-related issues community-contribution Contributed by the community labels Jan 26, 2026
@alexeykudinkin alexeykudinkin added the go add ONLY when ready to merge, run all tests label Feb 3, 2026
@myandpr
Copy link
Member Author

myandpr commented Feb 4, 2026

note: CI failure looks infra-related (Docker client/daemon API mismatch) and likely tied to recent infra updates, not this PR’s changes.

Copy link
Contributor

@goutamvenkat-anyscale goutamvenkat-anyscale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the change. Left a few comments.

uri: str,
uri: Optional[str] = None,
schema: Optional[pa.Schema] = None,
mode: Literal["create", "append", "overwrite"] = "create",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use SaveMode enum

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update.

describe_request = DescribeTableRequest(id=table_id)
describe_response = namespace.describe_table(describe_request)
self.uri = describe_response.location
if describe_response.storage_options:
merged_storage_options.update(describe_response.storage_options)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Append and overwrite seem to be functionally the same?

Copy link
Member Author

@myandpr myandpr Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for calling this out. This branch was ported from lance-ray and keeps the same mode semantics (source: lance_ray/datasink.py, _BaseLanceDatasink.__init__, append/overwrite handling: https://github.com/lance-format/lance-ray/blob/342949e6ee0f7cfe2355951addfccaae57e39301/lance_ray/datasink.py#L79). They are similar when the table already exists, but behavior differs when it does not: append should fail, while overwrite falls back to _declare_table_with_fallback; commit behavior also differs (LanceOperation.Append vs LanceOperation.Overwrite).

Now mode handling branch has been removed.

Comment on lines +257 to +262
captured = {}

class _FakeLanceDatasink:
def __init__(self, path, **kwargs):
captured["path"] = path
captured["kwargs"] = kwargs

def _fake_write_datasink(self, datasink, **kwargs):
captured["datasink"] = datasink
captured["write_kwargs"] = kwargs

monkeypatch.setattr(ray.data.dataset, "LanceDatasink", _FakeLanceDatasink)
monkeypatch.setattr(ray.data.Dataset, "write_datasink", _fake_write_datasink)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's create a test fixture to create a fake lancedb

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point. I updated the test to use a fixture-backed fake namespace/LanceDB setup.

self.table_id = table_id
has_namespace_storage_options = True

if mode == "append":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we separate the different mode handling in a different PR?

table_id: Optional[List[str]] = None,
*args: Any,
schema: Optional[pa.Schema] = None,
mode: Literal["create", "append", "overwrite"] = "create",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default parameter conflict with mode: The write mode. Default is 'append'. Choices are 'append', 'create', 'overwrite'.
https://github.com/ray-project/ray/pull/60497/changes#diff-79935e3b17cc6e14906191f95168a2caa7a4aaf4d5c50064a5bd75b4138e9afcR290

assert captured["kwargs"]["namespace_properties"] == namespace_properties
assert isinstance(captured["datasink"], _FakeLanceDatasink)


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sharing this, it’s very helpful. For this PR, I likely need to lean toward a fixture-based fake LanceDB testing style, but I’ll use your approach as a base and adapt/refactor it accordingly.

@myandpr myandpr force-pushed the migrate-lance-ray branch 2 times, most recently from a786aac to 6e68942 Compare February 7, 2026 02:45
describe_response = namespace.describe_table(describe_request)
self.uri = describe_response.location
if describe_response.storage_options:
merged_storage_options.update(describe_response.storage_options)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OVERWRITE mode lacks fallback when table doesn't exist

Medium Severity

When using namespaces, SaveMode.OVERWRITE is treated the same as SaveMode.APPEND - both require the table to exist via describe_table. According to the PR discussion, OVERWRITE mode should fall back to _declare_table_with_fallback when the table doesn't exist, allowing it to create the table. Currently, the else branch handles both APPEND and OVERWRITE identically, so OVERWRITE on a non-existent table would fail instead of creating it.

Fix in Cursor Fix in Web

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Signed-off-by: yaommen <myanstu@163.com>
@myandpr
Copy link
Member Author

myandpr commented Feb 8, 2026

note: CI failure looks not related to this PR (https://app.readthedocs.com/projects/anyscale-ray/builds/3732139/)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community data Ray Data-related issues go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Data] - Port over changes from lance-ray into Ray Data

4 participants