Skip to content
Merged
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
194 changes: 183 additions & 11 deletions docs/hub/datasets-pandas.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,44 @@
# Pandas

[Pandas](https://github.com/pandas-dev/pandas) is a widely used Python data analysis toolkit.
Since it uses [fsspec](https://filesystem-spec.readthedocs.io) to read and write remote data, you can use the Hugging Face paths ([`hf://`](/docs/huggingface_hub/guides/hf_file_system#integrations)) to read and write data on the Hub:
Since it uses [fsspec](https://filesystem-spec.readthedocs.io) to read and write remote data, you can use the Hugging Face paths ([`hf://`](/docs/huggingface_hub/guides/hf_file_system#integrations)) to read and write data on the Hub.

First you need to [Login with your Hugging Face account](/docs/huggingface_hub/quick-start#login), for example using:
## Load a DataFrame

You can load data from local files or from remote storage like Hugging Face Datasets. Pandas supports many formats including CSV, JSON and Parquet:

```python
>>> import pandas as pd
>>> df = pd.read_csv("path/to/data.csv")
```

To load a file from Hugging Face, the path needs to start with `hf://`. For example, the path to the [stanfordnlp/imdb](https://huggingface.co/datasets/stanfordnlp/imdb) dataset repository is `hf://datasets/stanfordnlp/imdb`. The dataset on Hugging Face contains multiple Parquet files. The Parquet file format is designed to make reading and writing data frames efficient, and to make sharing data across data analysis languages easy. Here is how to load the file `plain_text/train-00000-of-00001.parquet`:

```python
>>> import pandas as pd
>>> df = pd.read_parquet("hf://datasets/stanfordnlp/imdb/plain_text/train-00000-of-00001.parquet")
>>> df
text label
0 I rented I AM CURIOUS-YELLOW from my video sto... 0
1 "I Am Curious: Yellow" is a risible and preten... 0
2 If only to avoid making this type of film in t... 0
3 This film was probably inspired by Godard's Ma... 0
4 Oh, brother...after hearing about this ridicul... 0
... ... ...
24995 A hit at the time but now better categorised a... 1
24996 I love this movie like no other. Another time ... 1
24997 This film and it's sequel Barry Mckenzie holds... 1
24998 'The Adventures Of Barry McKenzie' started lif... 1
24999 The story centers around Barry McKenzie who mu... 1
```

To have more information on the Hugging Face paths and how they are implemented, please refer to the [the client library's documentation on the HfFileSystem](/docs/huggingface_hub/guides/hf_file_system).

## Save a DataFrame

You can save a pandas DataFrame using `to_csv/to_json/to_parquet` to a local file or to Hugging Face directly.

To save the DataFrame on Hugging Face, you first need to [Login with your Hugging Face account](/docs/huggingface_hub/quick-start#login), for example using:

```
huggingface-cli login
Expand All @@ -22,26 +57,163 @@ Finally, you can use [Hugging Face paths](/docs/huggingface_hub/guides/hf_file_s
```python
import pandas as pd

df.to_parquet("hf://datasets/username/my_dataset/data.parquet")
df.to_parquet("hf://datasets/username/my_dataset/imdb.parquet")

# or write in separate files if the dataset has train/validation/test splits
df_train.to_parquet("hf://datasets/username/my_dataset/train.parquet")
df_valid.to_parquet("hf://datasets/username/my_dataset/validation.parquet")
df_test .to_parquet("hf://datasets/username/my_dataset/test.parquet")
```

This creates a dataset repository `username/my_dataset` containing your Pandas dataset in Parquet format.
You can reload it later:
## Use Images

You can load a folder with a metadata file containing a field for the names or paths to the images, structured like this:

```
Example 1: Example 2:
folder/ folder/
├── metadata.csv ├── metadata.csv
├── img000.png └── images
├── img001.png ├── img000.png
... ...
└── imgNNN.png └── imgNNN.png
```

You can iterate on the images paths like this:

```python
import pandas as pd

folder_path = "path/to/folder/"
df = pd.read_csv(folder_path + "metadata.csv")
for image_path in (folder_path + df["file_name"]):
...
```

Since the dataset is in a supported structure ("metadata.csv" file with "file_name" field), you can save this dataset to Hugging Face and the Dataset Viewer shows both the metadata and images on Hugging Face.

```python
from huggingface_hub import HfApi
api = HfApi()

api.upload_folder(
folder_path=folder_path,
repo_id="username/my_image_dataset",
repo_type="dataset",
)
```

### Image methods and Parquet

Using [pandas-image-methods](https://github.com/lhoestq/pandas-image-methods) you enable `PIL.Image` methods on an image column. It also enables saving the dataset as one single Parquet file containing both the images and the metadata:

```python
import pandas as pd
from pandas_image_methods import PILMethods

df = pd.read_parquet("hf://datasets/username/my_dataset/data.parquet")
pd.api.extensions.register_series_accessor("pil")(PILMethods)

# or read from separate files if the dataset has train/validation/test splits
df_train = pd.read_parquet("hf://datasets/username/my_dataset/train.parquet")
df_valid = pd.read_parquet("hf://datasets/username/my_dataset/validation.parquet")
df_test = pd.read_parquet("hf://datasets/username/my_dataset/test.parquet")
df["image"] = (folder_path + df["file_name"]).pil.open()
df.to_parquet("data.parquet")
```

To have more information on the Hugging Face paths and how they are implemented, please refer to the [the client library's documentation on the HfFileSystem](/docs/huggingface_hub/guides/hf_file_system).
All the `PIL.Image` methods are available, e.g.

```python
df["image"] = df["image"].pil.rotate(90)
```

## Use Audios

You can load a folder with a metadata file containing a field for the names or paths to the audios, structured like this:

```
Example 1: Example 2:
folder/ folder/
├── metadata.csv ├── metadata.csv
├── rec000.wav └── audios
├── rec001.wav ├── rec000.wav
... ...
└── recNNN.wav └── recNNN.wav
```

You can iterate on the audios paths like this:

```python
import pandas as pd

folder_path = "path/to/folder/"
df = pd.read_csv(folder_path + "metadata.csv")
for audio_path in (folder_path + df["file_name"]):
...
```

Since the dataset is in a supported structure ("metadata.csv" file with "file_name" field), you can save this dataset to Hugging Face and the Dataset Viewer shows both the metadata and audios on Hugging Face.

```python
from huggingface_hub import HfApi
api = HfApi()

api.upload_folder(
folder_path=folder_path,
repo_id="username/my_audio_dataset",
repo_type="dataset",
)
```

### Audio methods and Parquet

Using [pandas-audio-methods](https://github.com/lhoestq/pandas-audio-methods) you enable `soundfile` methods on an audio column. It also enables saving the dataset as one single Parquet file containing both the audios and the metadata:

```python
import pandas as pd
from pandas_image_methods import SFMethods

pd.api.extensions.register_series_accessor("sf")(SFMethods)

df["audio"] = (folder_path + df["file_name"]).sf.open()
df.to_parquet("data.parquet")
```

This makes it easy to use with `librosa` e.g. for resampling:

```python
df["audio"] = [librosa.load(audio, sr=16_000) for audio in df["audio"]]
df["audio"] = df["audio"].sf.write()
```

## Use Transformers

You can use `transformers` pipelines on pandas DataFrames to classify, generate text, images, etc.
This section shows a few examples with `tqdm` for progress bars.

<Tip>

Pipelines don't accept a `tqdm` object as input but you can use a python generator instead, in the form `x for x in tqdm(...)`

</Tip>

### Text Classification

```python
from transformers import pipeline
from tqdm import tqdm

pipe = pipeline("text-classification", model="clapAI/modernBERT-base-multilingual-sentiment")

# Compute labels
df["label"] = [y["label"] for y in pipe(x for x in tqdm(df["text"]))]
# Compute labels and scores
df[["label", "score"]] = [(y["label"], y["score"]) for y in pipe(x for x in tqdm(df["text"]))]
```

### Text Generation

```python
from transformers import pipeline
from tqdm import tqdm

p = pipeline("text-generation", model="Qwen/Qwen2.5-1.5B-Instruct")
prompt = "What is the main topic of this sentence ? REPLY IN LESS THAN 3 WORDS. Sentence: '{}'"
df["output"] = [y["generated_text"][1]["content"] for y in pipe([{"role": "user", "content": prompt.format(x)}] for x in tqdm(df["text"]))]
```
Loading