Skip to content

Commit 42c11f5

Browse files
authored
Merge branch 'main' into patch-21
2 parents a12c854 + 025593f commit 42c11f5

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

76 files changed

+2278
-1161
lines changed

.github/workflows/ci.yml

Lines changed: 5 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ jobs:
2121
- name: Set up Python
2222
uses: actions/setup-python@v5
2323
with:
24-
python-version: "3.9"
24+
python-version: "3.10"
2525
- name: Install dependencies
2626
run: |
2727
python -m pip install --upgrade pip
@@ -49,18 +49,18 @@ jobs:
4949
run: |
5050
sudo apt update
5151
sudo apt install -y ffmpeg
52-
- name: Set up Python 3.9
52+
- name: Set up Python 3.10
5353
uses: actions/setup-python@v5
5454
with:
55-
python-version: "3.9"
55+
python-version: "3.10"
5656
- name: Setup conda env (windows)
5757
if: ${{ matrix.os == 'windows-latest' }}
5858
uses: conda-incubator/setup-miniconda@v2
5959
with:
6060
auto-update-conda: true
6161
miniconda-version: "latest"
6262
activate-environment: test
63-
python-version: "3.9"
63+
python-version: "3.10"
6464
- name: Setup FFmpeg (windows)
6565
if: ${{ matrix.os == 'windows-latest' }}
6666
run: conda install "ffmpeg=7.0.1" -c conda-forge
@@ -165,11 +165,7 @@ jobs:
165165
- name: Install uv
166166
run: pip install --upgrade uv
167167
- name: Install dependencies
168-
run: |
169-
uv pip install --system "datasets[tests_numpy2] @ ."
170-
# TODO: remove once transformers v5 / huggingface_hub v1 are released officially
171-
uv pip uninstall --system transformers huggingface_hub
172-
uv pip install --system --prerelease=allow git+https://github.com/huggingface/transformers.git
168+
run: uv pip install --system "datasets[tests_numpy2] @ ."
173169
- name: Print dependencies
174170
run: pip list
175171

.github/workflows/release-conda.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ jobs:
2525
auto-update-conda: true
2626
auto-activate-base: false
2727
activate-environment: "build-datasets"
28-
python-version: 3.9
28+
python-version: 3.10
2929
channels: huggingface
3030

3131
- name: Setup conda env

docs/source/image_dataset.mdx

Lines changed: 87 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -208,3 +208,90 @@ f18b91585c4d3f3e.json
208208
```
209209

210210
For more details on the WebDataset format and the python library, please check the [WebDataset documentation](https://webdataset.github.io/webdataset).
211+
212+
## Lance
213+
214+
[Lance](https://lance.org) is an open multimodal lakehouse table format. Lance tables can natively store not only text and scalar values,
215+
but also large binary objects (blobs) such as images, audio, and video alongside your tabular data.
216+
217+
Starting from image files on disk plus associated metadata (for example, captions and dimensions), you can write a self-contained Lance dataset to a
218+
local `*.lance` directory. The resulting table can store your metadata columns alongside an `image` column containing the encoded image bytes.
219+
220+
For example, you might start with metadata like:
221+
222+
```text
223+
{'caption': 'Cordelia and Dudley on their wedding day last year', 'height': 315, 'width': 233}
224+
{'caption': 'Statistics on challenges for automation in 2021', 'height': 299, 'width': 701}
225+
```
226+
227+
You can define a `pyarrow` schema for your metadata and image bytes, build a table, and write it as a Lance dataset:
228+
229+
```python
230+
import lance
231+
import pyarrow as pa
232+
233+
schema = pa.schema(
234+
[
235+
pa.field("caption", pa.utf8()),
236+
pa.field("height", pa.int32()),
237+
pa.field("width", pa.int32()),
238+
# ... add any additional metadata columns you want here ...
239+
pa.field("image", pa.binary()),
240+
]
241+
)
242+
243+
# Provide image files alongside metadata
244+
rows = [
245+
{
246+
"image_path": "/path/to/images/0001.jpg",
247+
"caption": "Cordelia and Dudley on their wedding day last year",
248+
"height": 315,
249+
"width": 233,
250+
},
251+
{
252+
"image_path": "/path/to/images/0002.jpg",
253+
"caption": "Statistics on challenges for automation in 2021",
254+
"height": 299,
255+
"width": 701,
256+
},
257+
]
258+
259+
image_bytes = []
260+
for r in rows:
261+
with open(r["image_path"], "rb") as f:
262+
image_bytes.append(f.read())
263+
264+
table = pa.table(
265+
{
266+
"caption": [r["caption"] for r in rows],
267+
"height": [r["height"] for r in rows],
268+
"width": [r["width"] for r in rows],
269+
"image": image_bytes,
270+
},
271+
schema=schema,
272+
)
273+
274+
ds = lance.write_dataset(
275+
table,
276+
"./images.lance",
277+
schema=schema,
278+
mode="create",
279+
)
280+
```
281+
282+
Here's a representative view of what a Lance table storing images might look like (the `image` column contains encoded bytes):
283+
284+
```text
285+
+-----------------------------------------------+--------+-------+-----+------------------------------+
286+
| caption | height | width | ... | image |
287+
+-----------------------------------------------+--------+-------+-----+------------------------------+
288+
| "Cordelia and Dudley on their wedding ..." | 315 | 233 | ... | b"\\xff\\xd8\\xff...\\xd9" |
289+
| "Statistics on challenges for automation ..." | 299 | 701 | ... | b"\\xff\\xd8\\xff...\\xd9" |
290+
+-----------------------------------------------+--------+-------+-----+------------------------------+
291+
```
292+
293+
Using this approach, you can store arbitrarily large image datasets in Lance. The resulting `images.lance/` directory with
294+
its `*.lance` files can be uploaded to the Hugging Face Hub, just like the other examples above. See the `lance-format/laion-1m`
295+
[on the Hub](https://huggingface.co/datasets/lance-format/laion-1m) dataset for an example of a Lance image dataset.
296+
297+
For more details on working with Lance datasets, see the [Lance documentation](https://lance.org).

docs/source/image_load.mdx

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -130,6 +130,39 @@ You can load a WebDataset like this:
130130
>>> dataset = load_dataset("webdataset", data_dir="/path/to/folder", streaming=True)
131131
```
132132

133+
## Lance
134+
135+
[Lance](https://lance.org) is an open multimodal lakehouse table format. Lance tables can natively store not only text and scalar values,
136+
but also large binary objects (blobs) such as images, audio, and video alongside your tabular data.
137+
138+
Lance keeps your metadata and image blobs together in one place, while still letting you efficiently scan only the metadata columns you care about
139+
without loading image bytes. When you're ready, you can fetch a small subset of rows (including the image blobs) and write them directly to files on
140+
your local filesystem.
141+
142+
```python
143+
from datasets import load_dataset
144+
145+
# Return as a Hugging Face dataset
146+
ds = load_dataset(
147+
"lance-format/laion-1m",
148+
split="train",
149+
streaming=True
150+
)
151+
152+
dir_name = "laion_samples"
153+
Path(dir_name).mkdir(exist_ok=True)
154+
155+
for idx, row in enumerate(ds.take(3)):
156+
with open(f"{dir_name}/{idx}.jpg", "wb") as f:
157+
f.write(row["image"])
158+
```
159+
160+
In this example, the `image` column contains the encoded image bytes, so you can write them directly to `.jpg` files.
161+
162+
> [!NOTE] The `datasets` API doesn't currently push down operations to the Lance table, so for larger datasets it may be slow.
163+
> For now, you'll get much better performance using the `lance` Python package directly. See the
164+
> documentation on [the Hub](https://huggingface.co/docs/datasets-lance) for examples on usage.
165+
133166
## Image decoding
134167

135168
By default, images are decoded sequentially as `PIL.Images` when you iterate on a dataset.

docs/source/installation.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Installation
22

3-
Before you start, you'll need to setup your environment and install the appropriate packages. 🤗 Datasets is tested on **Python 3.9+**.
3+
Before you start, you'll need to setup your environment and install the appropriate packages. 🤗 Datasets is tested on **Python 3.10+**.
44

55
> [!TIP]
66
> If you want to use 🤗 Datasets with TensorFlow or PyTorch, you'll need to install them separately. Refer to the [TensorFlow installation page](https://www.tensorflow.org/install/pip#tensorflow-2-packages-are-available) or the [PyTorch installation page](https://pytorch.org/get-started/locally/#start-locally) for the specific install command for your framework.

docs/source/loading.mdx

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -169,6 +169,26 @@ The cache directory to store intermediate processing results will be the Arrow f
169169

170170
For now only the Arrow streaming format is supported. The Arrow IPC file format (also known as Feather V2) is not supported.
171171

172+
### Lance
173+
174+
[Lance](https://lance.org) is an open multimodal lakehouse table format for AI. Lance tables can natively store not only text and scalar values, but also large binary objects (blobs) such as images, audio, and video alongside your tabular data.
175+
176+
```py
177+
>>> from datasets import load_dataset
178+
>>> lance_base_url = "lance-format/laion-1m"
179+
```
180+
181+
To stream the dataset without copying it to your local machine, specify the `streaming=True` parameter:
182+
183+
```py
184+
ds = load_dataset(lance_base_url, split="train", streaming=True)
185+
# Take first three rows
186+
for row in ds.take(3):
187+
print(row["caption"], row["image"])
188+
```
189+
190+
This will return the image caption and the image bytes in a single request.
191+
172192
## HDF5 files
173193

174194
[HDF5](https://www.hdfgroup.org/solutions/hdf5/) files are commonly used for storing large amounts of numerical data in scientific computing and machine learning. Loading HDF5 files with 🤗 Datasets is similar to loading CSV files:

docs/source/package_reference/main_classes.mdx

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -176,6 +176,7 @@ The base class [`IterableDataset`] implements an iterable Dataset backed by pyth
176176
- skip
177177
- take
178178
- shard
179+
- reshard
179180
- repeat
180181
- to_csv
181182
- to_pandas

docs/source/stream.mdx

Lines changed: 15 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -182,7 +182,21 @@ IterableDataset({
182182
})
183183
```
184184

185-
If your dataset has `dataset.num_shards==1`, you should chunk it using [`IterableDataset.skip`] and [`IterableDataset.take`] instead.
185+
To increase the number of shards of a dataset, you can use [`IterableDataset.reshard`]:
186+
187+
```py
188+
>>> dataset.reshard()
189+
IterableDataset({
190+
features: ['label', 'title', 'content'],
191+
num_shards: 3600
192+
})
193+
```
194+
195+
The resharding mechanism depends on the dataset file format.
196+
For example for Parquet, it reshards using row groups instead of having one file per shard.
197+
See how it works for every format in [`IterableDataset.reshard`]'s documentation.
198+
199+
If your dataset has `dataset.num_shards==1` even after resharding, you should chunk it using [`IterableDataset.skip`] and [`IterableDataset.take`] instead.
186200

187201
## Interleave
188202

docs/source/use_with_pytorch.mdx

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -255,3 +255,6 @@ then the shards are evenly assigned across the nodes, which is the most optimize
255255
Otherwise, each node keeps 1 example out of `world_size`, skipping the other examples.
256256

257257
This can also be combined with a `torch.utils.data.DataLoader` if you want each node to use multiple workers to load the data.
258+
259+
> [!WARNING]
260+
> If you shuffle your iterable dataset in a distributed setup, make sure to set a fixed `seed` in [`IterableDataset.shuffle`] so the same shuffled list of shards is used on every node to know which shards the node should skip.

docs/source/video_dataset.mdx

Lines changed: 120 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -170,3 +170,123 @@ Load your WebDataset and it will create on column per file suffix (here "mp4" an
170170
>>> dataset[0]["json"]
171171
{"bbox": [[302.0, 109.0, 73.0, 52.0]], "categories": [0]}
172172
```
173+
174+
## Lance
175+
176+
[Lance](https://lance.org) is an open multimodal lakehouse table format. Lance tables can natively store not only text and scalar values,
177+
but also large binary objects (blobs) such as images, audio, and video alongside your tabular data.
178+
179+
Lance provides a [blob API](https://lance.org/guide/blob/) that makes it convenient to store and retrieve large blobs in Lance datasets.
180+
The following example shows how to efficiently browse metadata without loading the heavier video blobs, then fetch the relevant video
181+
blobs on demand.
182+
183+
Here's a representative view of what a Lance table storing videos might look like (the `video_blob` column uses Lance's blob encoding):
184+
185+
```text
186+
+------------------------------------------+-----------------+-----+------------------------------------------+
187+
| caption | aesthetic_score | ... | video_blob |
188+
+------------------------------------------+-----------------+-----+------------------------------------------+
189+
| "a breathtaking view of a mounta..." | 5.2401 | ... | {position: 0, size: 4873879} |
190+
| "a captivating view of the sun, b..." | 5.2401 | ... | {position: 4873920, size: 3370571} |
191+
+------------------------------------------+-----------------+-----+------------------------------------------+
192+
```
193+
194+
### Write a Lance dataset from raw video files
195+
196+
Starting from raw video files on disk plus associated metadata (for example, captions and scores), you can write a self-contained Lance dataset
197+
to a local `*.lance` directory (a Lance dataset is a directory on disk, and it's common to name it with a `.lance` suffix):
198+
199+
```py
200+
import lance
201+
import pyarrow as pa
202+
203+
import urllib.request
204+
205+
schema = pa.schema(
206+
[
207+
pa.field("caption", pa.utf8()),
208+
pa.field("aesthetic_score", pa.float64()),
209+
pa.field(
210+
"video_blob",
211+
pa.large_binary(),
212+
metadata={"lance-encoding:blob": "true"},
213+
),
214+
]
215+
)
216+
217+
# Provide video files alongside metadata
218+
rows = [
219+
{
220+
"video_path": "/path/to/videos/0001.mp4",
221+
"caption": "a breathtaking view of a mountainous landscape ...",
222+
"aesthetic_score": 5.240138053894043,
223+
},
224+
{
225+
"video_path": "0002.mp4",
226+
"caption": "a captivating view of the sun, bathed in hues ...",
227+
"aesthetic_score": 5.240137100219727,
228+
},
229+
]
230+
231+
video_bytes = []
232+
for r in rows:
233+
with open(r["video_path"], "rb") as f:
234+
video_bytes.append(f.read())
235+
236+
table = pa.table(
237+
{
238+
"caption": [r["caption"] for r in rows],
239+
"aesthetic_score": [r["aesthetic_score"] for r in rows],
240+
"video_blob": video_bytes,
241+
},
242+
schema=schema,
243+
)
244+
245+
ds = lance.write_dataset(
246+
table,
247+
"./videos.lance",
248+
schema=schema,
249+
mode="create",
250+
)
251+
```
252+
253+
This stores your metadata and video bytes together inside `videos.lance/`, so you can move/copy a single directory without having to keep
254+
separate `*.mp4` files in sync.
255+
256+
Here's a representative view of what a Lance table storing videos might look like (the `video_blob` column contains data that's
257+
stored natively as blobs inside the Lance dataset):
258+
259+
```text
260+
+------------------------------------------+-----------------+-----+------------------------------------------+
261+
| caption | aesthetic_score | ... | video_blob |
262+
+------------------------------------------+-----------------+-----+------------------------------------------+
263+
| "a breathtaking view of a mounta..." | 5.2401 | ... | {position: 0, size: 4873879} |
264+
| "a captivating view of the sun, b..." | 5.2401 | ... | {position: 4873920, size: 3370571} |
265+
+------------------------------------------+-----------------+-----+------------------------------------------+
266+
```
267+
268+
You can upload the resulting `videos.lance/` directory to the Hub (for example with `huggingface_hub.HfApi.upload_folder`) and share it as a
269+
dataset repository, keeping the metadata and videos together as a single artifact.
270+
271+
> [!TIP]
272+
> Lance datasets scale to very large sizes (terabytes and beyond) since the data is stored in a columnar format on disk.
273+
> See the [blob API](https://lance.org/guide/blob/) guide for the latest information on best practices for storing and retrieving
274+
> large blobs in Lance.
275+
276+
When writing large datasets, it's typically best to limit the size of each individual `*.lance` file to a few gigabytest at most.
277+
Simply gather the data via an iterator and specify the `max_bytes_per_file` parameter when writing the dataset:
278+
279+
```python
280+
MAX_BYTES_PER_FILE = 5 * 1024 * 1024 * 1024 # ~5 GB per file
281+
282+
# Write as Lance dataset with file size limits for each *.lance file
283+
ds = lance.write_dataset(
284+
table,
285+
"./videos.lance",
286+
schema=schema,
287+
mode="create",
288+
max_bytes_per_file=MAX_BYTES_PER_FILE,
289+
)
290+
```
291+
292+
For more details on working with Lance datasets, see the [Lance documentation](https://lance.org).

0 commit comments

Comments
 (0)