Skip to content

Commit 1d8c6cf

Browse files
lhoestqdavanstrienjulien-c
authored
[Datasets] Add PyArrow docs (#1839)
* add pyarrow docs * Apply suggestions from code review Co-authored-by: Daniel van Strien <[email protected]> Co-authored-by: Julien Chaumond <[email protected]> * reorder integrated libraries table * don't fix row group size --------- Co-authored-by: Daniel van Strien <[email protected]> Co-authored-by: Julien Chaumond <[email protected]>
1 parent 6f5c0fe commit 1d8c6cf

File tree

2 files changed

+233
-0
lines changed

2 files changed

+233
-0
lines changed

docs/hub/datasets-libraries.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@ The table below summarizes the supported libraries and their level of integratio
1616
| [FiftyOne](./datasets-fiftyone) | FiftyOne is a library for curation and visualization of image, video, and 3D data. |||
1717
| [Pandas](./datasets-pandas) | Python data analysis toolkit. |||
1818
| [Polars](./datasets-polars) | A DataFrame library on top of an OLAP query engine. |||
19+
| [PyArrow](./datasets-pyarrow) | Apache Arrow is a columnar format and a toolbox for fast data interchange and in-memory analytics. |||
1920
| [Spark](./datasets-spark) | Real-time, large-scale data processing tool in a distributed environment. |||
2021
| [WebDataset](./datasets-webdataset) | Library to write I/O pipelines for large datasets. |||
2122

docs/hub/datasets-pyarrow.md

Lines changed: 232 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,232 @@
1+
# PyArrow
2+
3+
[Arrow](https://github.com/apache/arrow) is a columnar format and a toolbox for fast data interchange and in-memory analytics.
4+
Since PyArrow supports [fsspec](https://filesystem-spec.readthedocs.io) to read and write remote data, you can use the Hugging Face paths ([`hf://`](/docs/huggingface_hub/guides/hf_file_system#integrations)) to read and write data on the Hub.
5+
It is especially useful for [Parquet](https://parquet.apache.org/) data, since Parquet is the most common file format on Hugging Face.
6+
Indeed, Parquet is particularly efficient thanks to its structure, typing, metadata and compression.
7+
8+
## Load a Table
9+
10+
You can load data from local files or from remote storage like Hugging Face Datasets. PyArrow supports many formats including CSV, JSON and more importantly Parquet:
11+
12+
```python
13+
>>> import pyarrow.parquet as pq
14+
>>> table = pq.read_table("path/to/data.parquet")
15+
```
16+
17+
To load a file from Hugging Face, the path needs to start with `hf://`. For example, the path to the [stanfordnlp/imdb](https://huggingface.co/datasets/stanfordnlp/imdb) dataset repository is `hf://datasets/stanfordnlp/imdb`. The dataset on Hugging Face contains multiple Parquet files. The Parquet file format is designed to make reading and writing data frames efficient, and to make sharing data across data analysis languages easy. Here is how to load the file `plain_text/train-00000-of-00001.parquet` as a pyarrow Table (it requires `pyarrow>=21.0`):
18+
19+
```python
20+
>>> import pyarrow.parquet as pq
21+
>>> table = pq.read_table("hf://datasets/stanfordnlp/imdb/plain_text/train-00000-of-00001.parquet")
22+
>>> table
23+
pyarrow.Table
24+
text: string
25+
label: int64
26+
----
27+
text: [["I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it (... 1542 chars omitted)", ...],...,[..., "The story centers around Barry McKenzie who must go to England if he wishes to claim his inheritan (... 221 chars omitted)"]]
28+
label: [[0,0,0,0,0,...,0,0,0,0,0],...,[1,1,1,1,1,...,1,1,1,1,1]]
29+
```
30+
31+
If you don't want to load the full Parquet data, you can get the Parquet metadata or load row group by row group instead:
32+
33+
```python
34+
>>> import pyarrow.parquet as pq
35+
>>> pf = pq.ParquetFile("hf://datasets/stanfordnlp/imdb/plain_text/train-00000-of-00001.parquet")
36+
>>> pf.metadata
37+
<pyarrow._parquet.FileMetaData object at 0x1171b4090>
38+
created_by: parquet-cpp-arrow version 12.0.0
39+
num_columns: 2
40+
num_rows: 25000
41+
num_row_groups: 25
42+
format_version: 2.6
43+
serialized_size: 62036
44+
>>> for i in pf.num_row_groups:
45+
... table = pf.read_row_group(i)
46+
... ...
47+
```
48+
49+
For more information on the Hugging Face paths and how they are implemented, please refer to the [the client library's documentation on the HfFileSystem](/docs/huggingface_hub/guides/hf_file_system).
50+
51+
## Save a Table
52+
53+
You can save a pyarrow Table using `pyarrow.parquet.write_table` to a local file or to Hugging Face directly.
54+
55+
To save the Table on Hugging Face, you first need to [Login with your Hugging Face account](/docs/huggingface_hub/quick-start#login), for example using:
56+
57+
```
58+
huggingface-cli login
59+
```
60+
61+
Then you can [create a dataset repository](/docs/huggingface_hub/quick-start#create-a-repository), for example using:
62+
63+
```python
64+
from huggingface_hub import HfApi
65+
66+
HfApi().create_repo(repo_id="username/my_dataset", repo_type="dataset")
67+
```
68+
69+
Finally, you can use [Hugging Face paths](/docs/huggingface_hub/guides/hf_file_system#integrations) in PyArrow:
70+
71+
```python
72+
import pyarrow.parquet as pq
73+
74+
pq.write_table(table, "hf://datasets/username/my_dataset/imdb.parquet", use_content_defined_chunking=True)
75+
76+
# or write in separate files if the dataset has train/validation/test splits
77+
pq.write_table(table_train, "hf://datasets/username/my_dataset/train.parquet", use_content_defined_chunking=True)
78+
pq.write_table(table_valid, "hf://datasets/username/my_dataset/validation.parquet", use_content_defined_chunking=True)
79+
pq.write_table(table_test , "hf://datasets/username/my_dataset/test.parquet", use_content_defined_chunking=True)
80+
```
81+
82+
We use `use_content_defined_chunking=True` to enable faster uploads and downloads from Hugging Face thanks to Xet deduplication (it requires `pyarrow>=21.0`).
83+
84+
<Tip>
85+
86+
Content defined chunking (CDC) makes the Parquet writer chunk the data pages in a way that makes duplicate data chunked and compressed identically.
87+
Without CDC, the pages are arbitrarily chunked and therefore duplicate data are impossible to detect because of compression.
88+
Thanks to CDC, Parquet uploads and downloads from Hugging Face are faster, since duplicate data are uploaded or downloaded only once.
89+
90+
</Tip>
91+
92+
Find more information about Xet [here](https://huggingface.co/join/xet).
93+
94+
## Use Images
95+
96+
You can load a folder with a metadata file containing a field for the names or paths to the images, structured like this:
97+
98+
```
99+
Example 1: Example 2:
100+
folder/ folder/
101+
├── metadata.parquet ├── metadata.parquet
102+
├── img000.png └── images
103+
├── img001.png ├── img000.png
104+
... ...
105+
└── imgNNN.png └── imgNNN.png
106+
```
107+
108+
You can iterate on the images paths like this:
109+
110+
```python
111+
from pathlib import Path
112+
import pyarrow as pq
113+
114+
folder_path = Path("path/to/folder")
115+
table = pq.read_table(folder_path + "metadata.parquet")
116+
for file_name in table["file_name"].to_pylist():
117+
image_path = folder_path / file_name
118+
...
119+
```
120+
121+
Since the dataset is in a [supported structure](https://huggingface.co/docs/hub/en/datasets-image#additional-columns) (a `metadata.parquet` file with a `file_name` field), you can save this dataset to Hugging Face and the Dataset Viewer shows both the metadata and images.
122+
123+
```python
124+
from huggingface_hub import HfApi
125+
api = HfApi()
126+
127+
api.upload_folder(
128+
folder_path=folder_path,
129+
repo_id="username/my_image_dataset",
130+
repo_type="dataset",
131+
)
132+
```
133+
134+
### Embed Images inside Parquet
135+
136+
PyArrow has a binary type which allows to have the images bytes in Arrow tables. Therefore it enables saving the dataset as one single Parquet file containing both the images (bytes and path) and the samples metadata:
137+
138+
```python
139+
import pyarrow as pa
140+
import pyarrow.parquet as pq
141+
142+
# Embed the image bytes in Arrow
143+
image_array = pa.array([
144+
{
145+
"bytes": (folder_path / file_name).read_bytes(),
146+
"path": file_name,
147+
}
148+
for file_name in table["file_name"].to_pylist()
149+
])
150+
table.append_column("image", image_array)
151+
152+
# (Optional) Set the HF Image type for the Dataset Viewer and the `datasets` library
153+
features = {"image": {"_type": "Image"}} # or using datasets.Features(...).to_dict()
154+
schema_metadata = {"huggingface": {"dataset_info": {"features": features}}}
155+
table = table.replace_schema_metadata(schema_metadata)
156+
157+
# Save to Parquet
158+
# (Optional) with use_content_defined_chunking for faster uploads and downloads
159+
pq.write_table(table, "data.parquet", use_content_defined_chunking=True)
160+
```
161+
162+
Setting the Image type in the Arrow schema metadata allows other libraries and the Hugging Face Dataset Viewer to know that "image" contains images and not just binary data.
163+
164+
## Use Audios
165+
166+
You can load a folder with a metadata file containing a field for the names or paths to the audios, structured like this:
167+
168+
```
169+
Example 1: Example 2:
170+
folder/ folder/
171+
├── metadata.parquet ├── metadata.parquet
172+
├── rec000.wav └── audios
173+
├── rec001.wav ├── rec000.wav
174+
... ...
175+
└── recNNN.wav └── recNNN.wav
176+
```
177+
178+
You can iterate on the audios paths like this:
179+
180+
```python
181+
from pathlib import Path
182+
import pyarrow as pq
183+
184+
folder_path = Path("path/to/folder")
185+
table = pq.read_table(folder_path + "metadata.parquet")
186+
for file_name in table["file_name"].to_pylist():
187+
audio_path = folder_path / file_name
188+
...
189+
```
190+
191+
Since the dataset is in a [supported structure](https://huggingface.co/docs/hub/en/datasets-audio#additional-columns) (a `metadata.parquet` file with a `file_name` field), you can save it to Hugging Face, and the Hub Dataset Viewer shows both the metadata and audio.
192+
193+
```python
194+
from huggingface_hub import HfApi
195+
api = HfApi()
196+
197+
api.upload_folder(
198+
folder_path=folder_path,
199+
repo_id="username/my_audio_dataset",
200+
repo_type="dataset",
201+
)
202+
```
203+
204+
### Embed Audio inside Parquet
205+
206+
PyArrow has a binary type which allows for having audio bytes in Arrow tables. Therefore, it enables saving the dataset as one single Parquet file containing both the audio (bytes and path) and the samples metadata:
207+
208+
```python
209+
import pyarrow as pa
210+
import pyarrow.parquet as pq
211+
212+
# Embed the audio bytes in Arrow
213+
audio_array = pa.array([
214+
{
215+
"bytes": (folder_path / file_name).read_bytes(),
216+
"path": file_name,
217+
}
218+
for file_name in table["file_name"].to_pylist()
219+
])
220+
table.append_column("audio", audio_array)
221+
222+
# (Optional) Set the HF Audio type for the Dataset Viewer and the `datasets` library
223+
features = {"audio": {"_type": "Audio"}} # or using datasets.Features(...).to_dict()
224+
schema_metadata = {"huggingface": {"dataset_info": {"features": features}}}
225+
table = table.replace_schema_metadata(schema_metadata)
226+
227+
# Save to Parquet
228+
# (Optional) with use_content_defined_chunking for faster uploads and downloads
229+
pq.write_table(table, "data.parquet", use_content_defined_chunking=True)
230+
```
231+
232+
Setting the Audio type in the Arrow schema metadata enables other libraries and the Hugging Face Dataset Viewer to recognise that "audio" contains audio data, not just binary data.

0 commit comments

Comments
 (0)