You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/image_dataset.mdx
+87Lines changed: 87 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -208,3 +208,90 @@ f18b91585c4d3f3e.json
208
208
```
209
209
210
210
For more details on the WebDataset format and the python library, please check the [WebDataset documentation](https://webdataset.github.io/webdataset).
211
+
212
+
## Lance
213
+
214
+
[Lance](https://lance.org) is an open multimodal lakehouse table format. Lance tables can natively store not only text and scalar values,
215
+
but also large binary objects (blobs) such as images, audio, and video alongside your tabular data.
216
+
217
+
Starting from image files on disk plus associated metadata (for example, captions and dimensions), you can write a self-contained Lance dataset to a
218
+
local `*.lance` directory. The resulting table can store your metadata columns alongside an `image` column containing the encoded image bytes.
219
+
220
+
For example, you might start with metadata like:
221
+
222
+
```text
223
+
{'caption': 'Cordelia and Dudley on their wedding day last year', 'height': 315, 'width': 233}
224
+
{'caption': 'Statistics on challenges for automation in 2021', 'height': 299, 'width': 701}
225
+
```
226
+
227
+
You can define a `pyarrow` schema for your metadata and image bytes, build a table, and write it as a Lance dataset:
228
+
229
+
```python
230
+
import lance
231
+
import pyarrow as pa
232
+
233
+
schema = pa.schema(
234
+
[
235
+
pa.field("caption", pa.utf8()),
236
+
pa.field("height", pa.int32()),
237
+
pa.field("width", pa.int32()),
238
+
# ... add any additional metadata columns you want here ...
239
+
pa.field("image", pa.binary()),
240
+
]
241
+
)
242
+
243
+
# Provide image files alongside metadata
244
+
rows = [
245
+
{
246
+
"image_path": "/path/to/images/0001.jpg",
247
+
"caption": "Cordelia and Dudley on their wedding day last year",
248
+
"height": 315,
249
+
"width": 233,
250
+
},
251
+
{
252
+
"image_path": "/path/to/images/0002.jpg",
253
+
"caption": "Statistics on challenges for automation in 2021",
254
+
"height": 299,
255
+
"width": 701,
256
+
},
257
+
]
258
+
259
+
image_bytes = []
260
+
for r in rows:
261
+
withopen(r["image_path"], "rb") as f:
262
+
image_bytes.append(f.read())
263
+
264
+
table = pa.table(
265
+
{
266
+
"caption": [r["caption"] for r in rows],
267
+
"height": [r["height"] for r in rows],
268
+
"width": [r["width"] for r in rows],
269
+
"image": image_bytes,
270
+
},
271
+
schema=schema,
272
+
)
273
+
274
+
ds = lance.write_dataset(
275
+
table,
276
+
"./images.lance",
277
+
schema=schema,
278
+
mode="create",
279
+
)
280
+
```
281
+
282
+
Here's a representative view of what a Lance table storing images might look like (the `image` column contains encoded bytes):
Copy file name to clipboardExpand all lines: docs/source/installation.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
# Installation
2
2
3
-
Before you start, you'll need to setup your environment and install the appropriate packages. 🤗 Datasets is tested on **Python 3.9+**.
3
+
Before you start, you'll need to setup your environment and install the appropriate packages. 🤗 Datasets is tested on **Python 3.10+**.
4
4
5
5
> [!TIP]
6
6
> If you want to use 🤗 Datasets with TensorFlow or PyTorch, you'll need to install them separately. Refer to the [TensorFlow installation page](https://www.tensorflow.org/install/pip#tensorflow-2-packages-are-available) or the [PyTorch installation page](https://pytorch.org/get-started/locally/#start-locally) for the specific install command for your framework.
Copy file name to clipboardExpand all lines: docs/source/loading.mdx
+20Lines changed: 20 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -169,6 +169,26 @@ The cache directory to store intermediate processing results will be the Arrow f
169
169
170
170
For now only the Arrow streaming format is supported. The Arrow IPC file format (also known as Feather V2) is not supported.
171
171
172
+
### Lance
173
+
174
+
[Lance](https://lance.org) is an open multimodal lakehouse table format for AI. Lance tables can natively store not only text and scalar values, but also large binary objects (blobs) such as images, audio, and video alongside your tabular data.
175
+
176
+
```py
177
+
>>>from datasets import load_dataset
178
+
>>> lance_base_url ="lance-format/laion-1m"
179
+
```
180
+
181
+
To stream the dataset without copying it to your local machine, specify the `streaming=True` parameter:
This will return the image caption and the image bytes in a single request.
191
+
172
192
## HDF5 files
173
193
174
194
[HDF5](https://www.hdfgroup.org/solutions/hdf5/) files are commonly used for storing large amounts of numerical data in scientific computing and machine learning. Loading HDF5 files with 🤗 Datasets is similar to loading CSV files:
Copy file name to clipboardExpand all lines: docs/source/stream.mdx
+15-1Lines changed: 15 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -182,7 +182,21 @@ IterableDataset({
182
182
})
183
183
```
184
184
185
-
If your dataset has `dataset.num_shards==1`, you should chunk it using [`IterableDataset.skip`] and [`IterableDataset.take`] instead.
185
+
To increase the number of shards of a dataset, you can use [`IterableDataset.reshard`]:
186
+
187
+
```py
188
+
>>> dataset.reshard()
189
+
IterableDataset({
190
+
features: ['label', 'title', 'content'],
191
+
num_shards: 3600
192
+
})
193
+
```
194
+
195
+
The resharding mechanism depends on the dataset fileformat.
196
+
For example for Parquet, it reshards using row groups instead of having one file per shard.
197
+
See how it works for every formatin [`IterableDataset.reshard`]'s documentation.
198
+
199
+
If your dataset has `dataset.num_shards==1` even after resharding, you should chunk it using [`IterableDataset.skip`] and [`IterableDataset.take`] instead.
Copy file name to clipboardExpand all lines: docs/source/use_with_pytorch.mdx
+3Lines changed: 3 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -255,3 +255,6 @@ then the shards are evenly assigned across the nodes, which is the most optimize
255
255
Otherwise, each node keeps 1 example out of `world_size`, skipping the other examples.
256
256
257
257
This can also be combined with a `torch.utils.data.DataLoader` if you want each node to use multiple workers to load the data.
258
+
259
+
> [!WARNING]
260
+
> If you shuffle your iterable dataset in a distributed setup, make sure to set a fixed `seed` in [`IterableDataset.shuffle`] so the same shuffled list of shards is used on every node to know which shards the node should skip.
0 commit comments