huggingface
diff --git a/‎.github/workflows/ci.yml‎
Lines changed: 5 additions & 9 deletions b/‎.github/workflows/ci.yml‎
Lines changed: 5 additions & 9 deletions
diff --git a/‎.github/workflows/release-conda.yml‎
Lines changed: 1 addition & 1 deletion b/‎.github/workflows/release-conda.yml‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/source/image_dataset.mdx‎
Lines changed: 87 additions & 0 deletions b/‎docs/source/image_dataset.mdx‎
Lines changed: 87 additions & 0 deletions
diff --git a/‎docs/source/image_load.mdx‎
Lines changed: 33 additions & 0 deletions b/‎docs/source/image_load.mdx‎
Lines changed: 33 additions & 0 deletions
diff --git a/‎docs/source/installation.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/source/installation.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/source/loading.mdx‎
Lines changed: 20 additions & 0 deletions b/‎docs/source/loading.mdx‎
Lines changed: 20 additions & 0 deletions
diff --git a/‎docs/source/package_reference/main_classes.mdx‎
Lines changed: 1 addition & 0 deletions b/‎docs/source/package_reference/main_classes.mdx‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎docs/source/stream.mdx‎
Lines changed: 15 additions & 1 deletion b/‎docs/source/stream.mdx‎
Lines changed: 15 additions & 1 deletion
diff --git a/‎docs/source/use_with_pytorch.mdx‎
Lines changed: 3 additions & 0 deletions b/‎docs/source/use_with_pytorch.mdx‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎docs/source/video_dataset.mdx‎
Lines changed: 120 additions & 0 deletions b/‎docs/source/video_dataset.mdx‎
Lines changed: 120 additions & 0 deletions
@@ -21,7 +21,7 @@ jobs:
       - name: Set up Python
         uses: actions/setup-python@v5
         with:
-          python-version: "3.9"
+          python-version: "3.10"
       - name: Install dependencies
         run: |
           python -m pip install --upgrade pip
@@ -49,18 +49,18 @@ jobs:
         run: |
           sudo apt update
           sudo apt install -y ffmpeg 
-      - name: Set up Python 3.9
+      - name: Set up Python 3.10
         uses: actions/setup-python@v5
         with:
-          python-version: "3.9"
+          python-version: "3.10"
       - name: Setup conda env (windows)
         if: ${{ matrix.os == 'windows-latest' }}
         uses: conda-incubator/setup-miniconda@v2
         with:
           auto-update-conda: true
           miniconda-version: "latest"
           activate-environment: test
-          python-version: "3.9"
+          python-version: "3.10"
       - name: Setup FFmpeg (windows)
         if: ${{ matrix.os == 'windows-latest' }}
         run: conda install "ffmpeg=7.0.1" -c conda-forge
@@ -165,11 +165,7 @@ jobs:
       - name: Install uv
         run: pip install --upgrade uv
       - name: Install dependencies
-        run: |
-          uv pip install --system "datasets[tests_numpy2] @ ."
-          # TODO: remove once transformers v5 / huggingface_hub v1 are released officially
-          uv pip uninstall --system transformers huggingface_hub
-          uv pip install --system --prerelease=allow git+https://github.com/huggingface/transformers.git
+        run: uv pip install --system "datasets[tests_numpy2] @ ."
       - name: Print dependencies
         run: pip list
 
 
@@ -25,7 +25,7 @@ jobs:
           auto-update-conda: true
           auto-activate-base: false
           activate-environment: "build-datasets"
-          python-version: 3.9
+          python-version: 3.10
           channels: huggingface
 
       - name: Setup conda env
 
@@ -208,3 +208,90 @@ f18b91585c4d3f3e.json
 ```
 
 For more details on the WebDataset format and the python library, please check the [WebDataset documentation](https://webdataset.github.io/webdataset).
+
+## Lance
+
+[Lance](https://lance.org) is an open multimodal lakehouse table format. Lance tables can natively store not only text and scalar values,
+but also large binary objects (blobs) such as images, audio, and video alongside your tabular data.
+
+Starting from image files on disk plus associated metadata (for example, captions and dimensions), you can write a self-contained Lance dataset to a
+local `*.lance` directory. The resulting table can store your metadata columns alongside an `image` column containing the encoded image bytes.
+
+For example, you might start with metadata like:
+
+```text
+{'caption': 'Cordelia and Dudley on their wedding  day last year', 'height': 315, 'width': 233}
+{'caption': 'Statistics on challenges for automation in 2021', 'height': 299, 'width': 701}
+```
+
+You can define a `pyarrow` schema for your metadata and image bytes, build a table, and write it as a Lance dataset:
+
+```python
+import lance
+import pyarrow as pa
+
+schema = pa.schema(
+    [
+        pa.field("caption", pa.utf8()),
+        pa.field("height", pa.int32()),
+        pa.field("width", pa.int32()),
+        # ... add any additional metadata columns you want here ...
+        pa.field("image", pa.binary()),
+    ]
+)
+
+# Provide image files alongside metadata
+rows = [
+    {
+        "image_path": "/path/to/images/0001.jpg",
+        "caption": "Cordelia and Dudley on their wedding  day last year",
+        "height": 315,
+        "width": 233,
+    },
+    {
+        "image_path": "/path/to/images/0002.jpg",
+        "caption": "Statistics on challenges for automation in 2021",
+        "height": 299,
+        "width": 701,
+    },
+]
+
+image_bytes = []
+for r in rows:
+    with open(r["image_path"], "rb") as f:
+        image_bytes.append(f.read())
+
+table = pa.table(
+    {
+        "caption": [r["caption"] for r in rows],
+        "height": [r["height"] for r in rows],
+        "width": [r["width"] for r in rows],
+        "image": image_bytes,
+    },
+    schema=schema,
+)
+
+ds = lance.write_dataset(
+    table,
+    "./images.lance",
+    schema=schema,
+    mode="create",
+)
+```
+
+Here's a representative view of what a Lance table storing images might look like (the `image` column contains encoded bytes):
+
+```text
++-----------------------------------------------+--------+-------+-----+------------------------------+
+| caption                                       | height | width | ... | image                        |
++-----------------------------------------------+--------+-------+-----+------------------------------+
+| "Cordelia and Dudley on their wedding ..."    | 315    | 233   | ... | b"\\xff\\xd8\\xff...\\xd9"   |
+| "Statistics on challenges for automation ..." | 299    | 701   | ... | b"\\xff\\xd8\\xff...\\xd9"   |
++-----------------------------------------------+--------+-------+-----+------------------------------+
+```
+
+Using this approach, you can store arbitrarily large image datasets in Lance. The resulting `images.lance/` directory with
+its `*.lance` files can be uploaded to the Hugging Face Hub, just like the other examples above. See the `lance-format/laion-1m`
+[on the Hub](https://huggingface.co/datasets/lance-format/laion-1m) dataset for an example of a Lance image dataset.
+
+For more details on working with Lance datasets, see the [Lance documentation](https://lance.org).
@@ -130,6 +130,39 @@ You can load a WebDataset like this:
 >>> dataset = load_dataset("webdataset", data_dir="/path/to/folder", streaming=True)
 ```
 
+## Lance
+
+[Lance](https://lance.org) is an open multimodal lakehouse table format. Lance tables can natively store not only text and scalar values,
+but also large binary objects (blobs) such as images, audio, and video alongside your tabular data.
+
+Lance keeps your metadata and image blobs together in one place, while still letting you efficiently scan only the metadata columns you care about
+without loading image bytes. When you're ready, you can fetch a small subset of rows (including the image blobs) and write them directly to files on
+your local filesystem.
+
+```python
+from datasets import load_dataset
+
+# Return as a Hugging Face dataset
+ds = load_dataset(
+    "lance-format/laion-1m",
+    split="train",
+    streaming=True
+)
+
+dir_name = "laion_samples"
+Path(dir_name).mkdir(exist_ok=True)
+
+for idx, row in enumerate(ds.take(3)):
+    with open(f"{dir_name}/{idx}.jpg", "wb") as f:
+        f.write(row["image"])
+```
+
+In this example, the `image` column contains the encoded image bytes, so you can write them directly to `.jpg` files.
+
+> [!NOTE] The `datasets` API doesn't currently push down operations to the Lance table, so for larger datasets it may be slow.
+> For now, you'll get much better performance using the `lance` Python package directly. See the
+> documentation on [the Hub](https://huggingface.co/docs/datasets-lance) for examples on usage.
+
 ## Image decoding
 
 By default, images are decoded sequentially as `PIL.Images` when you iterate on a dataset.
 
@@ -1,6 +1,6 @@
 # Installation
 
-Before you start, you'll need to setup your environment and install the appropriate packages. 🤗 Datasets is tested on **Python 3.9+**.
+Before you start, you'll need to setup your environment and install the appropriate packages. 🤗 Datasets is tested on **Python 3.10+**.
 
 > [!TIP]
 > If you want to use 🤗 Datasets with TensorFlow or PyTorch, you'll need to install them separately. Refer to the [TensorFlow installation page](https://www.tensorflow.org/install/pip#tensorflow-2-packages-are-available) or the [PyTorch installation page](https://pytorch.org/get-started/locally/#start-locally) for the specific install command for your framework.
 
@@ -169,6 +169,26 @@ The cache directory to store intermediate processing results will be the Arrow f
 
 For now only the Arrow streaming format is supported. The Arrow IPC file format (also known as Feather V2) is not supported.
 
+### Lance
+
+[Lance](https://lance.org) is an open multimodal lakehouse table format for AI. Lance tables can natively store not only text and scalar values, but also large binary objects (blobs) such as images, audio, and video alongside your tabular data.
+
+```py
+>>> from datasets import load_dataset
+>>> lance_base_url = "lance-format/laion-1m"
+```
+
+To stream the dataset without copying it to your local machine, specify the `streaming=True` parameter:
+
+```py
+ds = load_dataset(lance_base_url, split="train", streaming=True)
+# Take first three rows
+for row in ds.take(3):
+    print(row["caption"], row["image"])
+```
+
+This will return the image caption and the image bytes in a single request.
+
 ## HDF5 files
 
 [HDF5](https://www.hdfgroup.org/solutions/hdf5/) files are commonly used for storing large amounts of numerical data in scientific computing and machine learning. Loading HDF5 files with 🤗 Datasets is similar to loading CSV files:
 
@@ -176,6 +176,7 @@ The base class [`IterableDataset`] implements an iterable Dataset backed by pyth
     - skip
     - take
     - shard
+    - reshard
     - repeat
     - to_csv
     - to_pandas
 
@@ -182,7 +182,21 @@ IterableDataset({
 })
 ```
 
-If your dataset has `dataset.num_shards==1`, you should chunk it using [`IterableDataset.skip`] and [`IterableDataset.take`] instead.
+To increase the number of shards of a dataset, you can use [`IterableDataset.reshard`]:
+
+```py
+>>> dataset.reshard()
+IterableDataset({
+    features: ['label', 'title', 'content'],
+    num_shards: 3600
+})
+```
+
+The resharding mechanism depends on the dataset file format.
+For example for Parquet, it reshards using row groups instead of having one file per shard.
+See how it works for every format in [`IterableDataset.reshard`]'s documentation.
+
+If your dataset has `dataset.num_shards==1` even after resharding, you should chunk it using [`IterableDataset.skip`] and [`IterableDataset.take`] instead.
 
 ## Interleave
 
 
@@ -255,3 +255,6 @@ then the shards are evenly assigned across the nodes, which is the most optimize
 Otherwise, each node keeps 1 example out of `world_size`, skipping the other examples.
 
 This can also be combined with a `torch.utils.data.DataLoader` if you want each node to use multiple workers to load the data.
+
+> [!WARNING]
+> If you shuffle your iterable dataset in a distributed setup, make sure to set a fixed `seed` in [`IterableDataset.shuffle`] so the same shuffled list of shards is used on every node to know which shards the node should skip.
@@ -170,3 +170,123 @@ Load your WebDataset and it will create on column per file suffix (here "mp4" an
 >>> dataset[0]["json"]
 {"bbox": [[302.0, 109.0, 73.0, 52.0]], "categories": [0]}
 ```
+
+## Lance
+
+[Lance](https://lance.org) is an open multimodal lakehouse table format. Lance tables can natively store not only text and scalar values,
+but also large binary objects (blobs) such as images, audio, and video alongside your tabular data.
+
+Lance provides a [blob API](https://lance.org/guide/blob/) that makes it convenient to store and retrieve large blobs in Lance datasets.
+The following example shows how to efficiently browse metadata without loading the heavier video blobs, then fetch the relevant video
+blobs on demand.
+
+Here's a representative view of what a Lance table storing videos might look like (the `video_blob` column uses Lance's blob encoding):
+
+```text
++------------------------------------------+-----------------+-----+------------------------------------------+
+| caption                                  | aesthetic_score | ... | video_blob                               |
++------------------------------------------+-----------------+-----+------------------------------------------+
+| "a breathtaking view of a mounta..."     | 5.2401          | ... | {position: 0, size: 4873879}             |
+| "a captivating view of the sun, b..."    | 5.2401          | ... | {position: 4873920, size: 3370571}       |
++------------------------------------------+-----------------+-----+------------------------------------------+
+```
+
+### Write a Lance dataset from raw video files
+
+Starting from raw video files on disk plus associated metadata (for example, captions and scores), you can write a self-contained Lance dataset
+to a local `*.lance` directory (a Lance dataset is a directory on disk, and it's common to name it with a `.lance` suffix):
+
+```py
+import lance
+import pyarrow as pa
+
+import urllib.request
+
+schema = pa.schema(
+    [
+        pa.field("caption", pa.utf8()),
+        pa.field("aesthetic_score", pa.float64()),
+        pa.field(
+            "video_blob",
+            pa.large_binary(),
+            metadata={"lance-encoding:blob": "true"},
+        ),
+    ]
+)
+
+# Provide video files alongside metadata
+rows = [
+    {
+        "video_path": "/path/to/videos/0001.mp4",
+        "caption": "a breathtaking view of a mountainous landscape ...",
+        "aesthetic_score": 5.240138053894043,
+    },
+    {
+        "video_path": "0002.mp4",
+        "caption": "a captivating view of the sun, bathed in hues ...",
+        "aesthetic_score": 5.240137100219727,
+    },
+]
+
+video_bytes = []
+for r in rows:
+    with open(r["video_path"], "rb") as f:
+        video_bytes.append(f.read())
+
+table = pa.table(
+    {
+        "caption": [r["caption"] for r in rows],
+        "aesthetic_score": [r["aesthetic_score"] for r in rows],
+        "video_blob": video_bytes,
+    },
+    schema=schema,
+)
+
+ds = lance.write_dataset(
+    table,
+    "./videos.lance",
+    schema=schema,
+    mode="create",
+)
+```
+
+This stores your metadata and video bytes together inside `videos.lance/`, so you can move/copy a single directory without having to keep
+separate `*.mp4` files in sync.
+
+Here's a representative view of what a Lance table storing videos might look like (the `video_blob` column contains data that's
+stored natively as blobs inside the Lance dataset):
+
+```text
++------------------------------------------+-----------------+-----+------------------------------------------+
+| caption                                  | aesthetic_score | ... | video_blob                               |
++------------------------------------------+-----------------+-----+------------------------------------------+
+| "a breathtaking view of a mounta..."     | 5.2401          | ... | {position: 0, size: 4873879}             |
+| "a captivating view of the sun, b..."    | 5.2401          | ... | {position: 4873920, size: 3370571}       |
++------------------------------------------+-----------------+-----+------------------------------------------+
+```
+
+You can upload the resulting `videos.lance/` directory to the Hub (for example with `huggingface_hub.HfApi.upload_folder`) and share it as a
+dataset repository, keeping the metadata and videos together as a single artifact.
+
+> [!TIP]
+> Lance datasets scale to very large sizes (terabytes and beyond) since the data is stored in a columnar format on disk.
+> See the [blob API](https://lance.org/guide/blob/) guide for the latest information on best practices for storing and retrieving
+> large blobs in Lance.
+
+When writing large datasets, it's typically best to limit the size of each individual `*.lance` file to a few gigabytest at most.
+Simply gather the data via an iterator and specify the `max_bytes_per_file` parameter when writing the dataset:
+
+```python
+MAX_BYTES_PER_FILE = 5 * 1024 * 1024 * 1024  # ~5 GB per file
+
+# Write as Lance dataset with file size limits for each *.lance file
+ds = lance.write_dataset(
+    table,
+    "./videos.lance",
+    schema=schema,
+    mode="create",
+    max_bytes_per_file=MAX_BYTES_PER_FILE,
+)
+```
+
+For more details on working with Lance datasets, see the [Lance documentation](https://lance.org).
-Original file line number
+Diff line change
     - skip
     - take
     - shard
 +    - reshard
     - repeat
     - to_csv
     - to_pandas