Skip to content

Conversation

@eddyxu
Copy link

@eddyxu eddyxu commented Dec 24, 2025

Add lance format as one of the packaged_modules.

import datasets

ds = datasets.load_dataset("org/lance_repo", split="train")

# Or

ds = datasets.load_dataset("./local/data.lance")

@eddyxu
Copy link
Author

eddyxu commented Dec 24, 2025

Mentioned #7863 as well

@zhe-thoughts
Copy link

@pdames for vis

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@lhoestq
Copy link
Member

lhoestq commented Dec 29, 2025

Cool ! I notice the current implementation doesn't support streaming because of the symlink hack.

I believe you can do something like this instead:

def _generate_tables(self, paths: list[str]):
    for path in paths:
        ds = lance.dataset(path)
        for frag_idx, fragment in enumerate(ds.get_fragments()):
            for batch_idx, batch in enumerate(
                fragment.to_batches(columns=self.config.columns, batch_size=self.config.batch_size)
            ):
                table = pa.Table.from_batches([batch])
                table = self._cast_table(table)
                yield Key(frag_idx, batch_idx), table

note that path can be a local one, but also a hf:// URI

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants