From 278bc94e27d92d803db6ccb556a6fd448a2a1848 Mon Sep 17 00:00:00 2001 From: Quentin Lhoest Date: Wed, 8 Jan 2025 13:28:17 +0100 Subject: [PATCH 1/4] More pandas docs --- docs/hub/datasets-pandas.md | 124 ++++++++++++++++++++++++++++++++---- 1 file changed, 113 insertions(+), 11 deletions(-) diff --git a/docs/hub/datasets-pandas.md b/docs/hub/datasets-pandas.md index a0d3c3a09..e623c3257 100644 --- a/docs/hub/datasets-pandas.md +++ b/docs/hub/datasets-pandas.md @@ -1,9 +1,44 @@ # Pandas [Pandas](https://github.com/pandas-dev/pandas) is a widely used Python data analysis toolkit. -Since it uses [fsspec](https://filesystem-spec.readthedocs.io) to read and write remote data, you can use the Hugging Face paths ([`hf://`](/docs/huggingface_hub/guides/hf_file_system#integrations)) to read and write data on the Hub: +Since it uses [fsspec](https://filesystem-spec.readthedocs.io) to read and write remote data, you can use the Hugging Face paths ([`hf://`](/docs/huggingface_hub/guides/hf_file_system#integrations)) to read and write data on the Hub. -First you need to [Login with your Hugging Face account](/docs/huggingface_hub/quick-start#login), for example using: +## Load a DataFrame + +You can load data from local files or from remote storage like Hugging Face Datasets. Pandas supports many formats including CSV, JSON and Paequet: + +```python +>>> import pandas as pd +>>> df = pd.read_csv("path/to/data.csv") +``` + +To load a file from Hugging Face, the path needs to start with `hf://`. For example, the path to the [stanfordnlp/imdb](https://huggingface.co/datasets/stanfordnlp/imdb) dataset repository is `hf://datasets/stanfordnlp/imdb`. The dataset on Hugging Face contains multiple Parquet files. The Parquet file format is designed to make reading and writing data frames efficient, and to make sharing data across data analysis languages easy. Here is how to load the file `plain_text/train-00000-of-00001.parquet`: + +```python +>>> import pandas as pd +>>> df = pd.read_parquet("hf://datasets/stanfordnlp/imdb/plain_text/train-00000-of-00001.parquet") +>>> df + text label +0 I rented I AM CURIOUS-YELLOW from my video sto... 0 +1 "I Am Curious: Yellow" is a risible and preten... 0 +2 If only to avoid making this type of film in t... 0 +3 This film was probably inspired by Godard's Ma... 0 +4 Oh, brother...after hearing about this ridicul... 0 +... ... ... +24995 A hit at the time but now better categorised a... 1 +24996 I love this movie like no other. Another time ... 1 +24997 This film and it's sequel Barry Mckenzie holds... 1 +24998 'The Adventures Of Barry McKenzie' started lif... 1 +24999 The story centers around Barry McKenzie who mu... 1 +``` + +To have more information on the Hugging Face paths and how they are implemented, please refer to the [the client library's documentation on the HfFileSystem](/docs/huggingface_hub/guides/hf_file_system). + +## Save a DataFrame + +You can save a pandas DataFrame using `to_csv/to_json/to_parquet` to a local file or to Hugging Face directly. + +To save the DataFrame on Hugging Face, you first need to [Login with your Hugging Face account](/docs/huggingface_hub/quick-start#login), for example using: ``` huggingface-cli login @@ -22,7 +57,7 @@ Finally, you can use [Hugging Face paths](/docs/huggingface_hub/guides/hf_file_s ```python import pandas as pd -df.to_parquet("hf://datasets/username/my_dataset/data.parquet") +df.to_parquet("hf://datasets/username/my_dataset/imdb.parquet") # or write in separate files if the dataset has train/validation/test splits df_train.to_parquet("hf://datasets/username/my_dataset/train.parquet") @@ -30,18 +65,85 @@ df_valid.to_parquet("hf://datasets/username/my_dataset/validation.parquet") df_test .to_parquet("hf://datasets/username/my_dataset/test.parquet") ``` -This creates a dataset repository `username/my_dataset` containing your Pandas dataset in Parquet format. -You can reload it later: +## Use Images + +From a metadata file containing a "file_name" field for the names or paths to the images: + +``` +data/ data/ +├── metadata.csv ├── metadata.csv +├── img000.png └── images +├── img001.png ├── img000.png +... ... +└── imgNNN.png └── imgNNN.png +``` ```python import pandas as pd -df = pd.read_parquet("hf://datasets/username/my_dataset/data.parquet") +folder_path = "path/to/data/" +df = pd.read_csv(folder_path + "metadata.csv") +for image_path in (folder_path + df["file_name"]): + ... +``` -# or read from separate files if the dataset has train/validation/test splits -df_train = pd.read_parquet("hf://datasets/username/my_dataset/train.parquet") -df_valid = pd.read_parquet("hf://datasets/username/my_dataset/validation.parquet") -df_test = pd.read_parquet("hf://datasets/username/my_dataset/test.parquet") +Since the dataset is in a supported structure, you can save this dataset to Hugging Face and the Dataset Viewer shows both the metadata and images on HF. + +```python +from huggingface_hub import HfApi +api = HfApi() + +api.upload_folder( + folder_path=folder_path, + repo_id="username/my_image_dataset", + repo_type="dataset", +) ``` -To have more information on the Hugging Face paths and how they are implemented, please refer to the [the client library's documentation on the HfFileSystem](/docs/huggingface_hub/guides/hf_file_system). +Using [pandas-image-methods](https://github.com/lhoestq/pandas-image-methods) you enable `PIL.Image` methods on an image column. It also enables saving the dataset as one single Parquet file containing both the images and the metadata: + +```python +import pandas as pd +from pandas_image_methods import PILMethods + +pd.api.extensions.register_series_accessor("pil")(PILMethods) + +df["image"] = (folder_path + df["file_name"]).pil.open() +df.to_parquet("data.parquet") +``` + +All the `PIL.Image` methods are available, e.g. + +```python +df["image"] = df["image"].pil.rotate(90) +``` + +## Use Transformers + +You can use `transformers` pipelines on pandas DataFrames to classify, generate text, images, etc. +This section shows a few examples. + +### Text Classification + +```python +from transformers import pipeline +from tqdm import tqdm + +pipe = pipeline("text-classification", model="clapAI/modernBERT-base-multilingual-sentiment") + +# Compute labels +df["label"] = [y["label"] for y in pipe(x for x in tqdm(df["text"]))] +# Compute labels and scores +df[["label", "score"]] = [(y["label"], y["score"]) for y in pipe(x for x in tqdm(df["text"]))] +``` + +### Text Generation + +```python +from transformers import pipeline +from tqdm import tqdm + +p = pipeline("text-generation", model="Qwen/Qwen2.5-1.5B-Instruct") +prompt = "What is the main topic of this sentence ? REPLY IN LESS THAN 3 WORDS. Sentence: '{}'" +df["output"] = [y["generated_text"][1]["content"] for y in pipe([{"role": "user", "content": prompt.format(x)}] for x in tqdm(df["text"]))] +``` From f63a5c668395da8f9f389046e5fd9a26d4eb85c0 Mon Sep 17 00:00:00 2001 From: Quentin Lhoest Date: Thu, 9 Jan 2025 15:07:12 +0100 Subject: [PATCH 2/4] add audio --- docs/hub/datasets-pandas.md | 68 ++++++++++++++++++++++++++++++++++--- 1 file changed, 64 insertions(+), 4 deletions(-) diff --git a/docs/hub/datasets-pandas.md b/docs/hub/datasets-pandas.md index e623c3257..4ac96b560 100644 --- a/docs/hub/datasets-pandas.md +++ b/docs/hub/datasets-pandas.md @@ -67,10 +67,11 @@ df_test .to_parquet("hf://datasets/username/my_dataset/test.parquet") ## Use Images -From a metadata file containing a "file_name" field for the names or paths to the images: +From a folder with a metadata file containing a "file_name" field for the names or paths to the images: ``` -data/ data/ +Example 1: Example 2: +folder/ folder/ ├── metadata.csv ├── metadata.csv ├── img000.png └── images ├── img001.png ├── img000.png @@ -81,13 +82,13 @@ data/ data/ ```python import pandas as pd -folder_path = "path/to/data/" +folder_path = "path/to/folder/" df = pd.read_csv(folder_path + "metadata.csv") for image_path in (folder_path + df["file_name"]): ... ``` -Since the dataset is in a supported structure, you can save this dataset to Hugging Face and the Dataset Viewer shows both the metadata and images on HF. +Since the dataset is in a supported structure, you can save this dataset to Hugging Face and the Dataset Viewer shows both the metadata and images on Hugging Face. ```python from huggingface_hub import HfApi @@ -100,6 +101,8 @@ api.upload_folder( ) ``` +### Image methods and Parquet + Using [pandas-image-methods](https://github.com/lhoestq/pandas-image-methods) you enable `PIL.Image` methods on an image column. It also enables saving the dataset as one single Parquet file containing both the images and the metadata: ```python @@ -118,6 +121,63 @@ All the `PIL.Image` methods are available, e.g. df["image"] = df["image"].pil.rotate(90) ``` +## Use Audios + +From a folder with a metadata file containing a "file_name" field for the names or paths to the audios: + +``` +Example 1: Example 2: +folder/ folder/ +├── metadata.csv ├── metadata.csv +├── rec000.wav └── audios +├── rec001.wav ├── rec000.wav +... ... +└── recNNN.wav └── recNNN.wav +``` + +```python +import pandas as pd + +folder_path = "path/to/folder/" +df = pd.read_csv(folder_path + "metadata.csv") +for audio_path in (folder_path + df["file_name"]): + ... +``` + +Since the dataset is in a supported structure, you can save this dataset to Hugging Face and the Dataset Viewer shows both the metadata and audios on Hugging Face. + +```python +from huggingface_hub import HfApi +api = HfApi() + +api.upload_folder( + folder_path=folder_path, + repo_id="username/my_audio_dataset", + repo_type="dataset", +) +``` + +### Audio methods and Parquet + +Using [pandas-audio-methods](https://github.com/lhoestq/pandas-audio-methods) you enable `soundfile` methods on an audio column. It also enables saving the dataset as one single Parquet file containing both the audios and the metadata: + +```python +import pandas as pd +from pandas_image_methods import SFMethods + +pd.api.extensions.register_series_accessor("sf")(SFMethods) + +df["audio"] = (folder_path + df["file_name"]).sf.open() +df.to_parquet("data.parquet") +``` + +This makes it easy to use with `librosa` e.g. for resampling: + +```python +df["audio"] = [librosa.load(audio, sr=16_000) for audio in df["audio"]] +df["audio"] = df["audio"].sf.write() +``` + ## Use Transformers You can use `transformers` pipelines on pandas DataFrames to classify, generate text, images, etc. From 170848fe28fefcb98d7d769943cc3063d294380b Mon Sep 17 00:00:00 2001 From: Quentin Lhoest Date: Thu, 9 Jan 2025 15:15:39 +0100 Subject: [PATCH 3/4] minor changes --- docs/hub/datasets-pandas.md | 22 ++++++++++++++++------ 1 file changed, 16 insertions(+), 6 deletions(-) diff --git a/docs/hub/datasets-pandas.md b/docs/hub/datasets-pandas.md index 4ac96b560..1c8575cf4 100644 --- a/docs/hub/datasets-pandas.md +++ b/docs/hub/datasets-pandas.md @@ -5,7 +5,7 @@ Since it uses [fsspec](https://filesystem-spec.readthedocs.io) to read and write ## Load a DataFrame -You can load data from local files or from remote storage like Hugging Face Datasets. Pandas supports many formats including CSV, JSON and Paequet: +You can load data from local files or from remote storage like Hugging Face Datasets. Pandas supports many formats including CSV, JSON and Parquet: ```python >>> import pandas as pd @@ -67,7 +67,7 @@ df_test .to_parquet("hf://datasets/username/my_dataset/test.parquet") ## Use Images -From a folder with a metadata file containing a "file_name" field for the names or paths to the images: +You can load a folder with a metadata file containing a field for the names or paths to the images, structured like this: ``` Example 1: Example 2: @@ -79,6 +79,8 @@ folder/ folder/ └── imgNNN.png └── imgNNN.png ``` +You can iterate on the images paths like this: + ```python import pandas as pd @@ -88,7 +90,7 @@ for image_path in (folder_path + df["file_name"]): ... ``` -Since the dataset is in a supported structure, you can save this dataset to Hugging Face and the Dataset Viewer shows both the metadata and images on Hugging Face. +Since the dataset is in a supported structure ("metadata.csv" file with "file_name" field), you can save this dataset to Hugging Face and the Dataset Viewer shows both the metadata and images on Hugging Face. ```python from huggingface_hub import HfApi @@ -123,7 +125,7 @@ df["image"] = df["image"].pil.rotate(90) ## Use Audios -From a folder with a metadata file containing a "file_name" field for the names or paths to the audios: +You can load a folder with a metadata file containing a field for the names or paths to the audios, structured like this: ``` Example 1: Example 2: @@ -135,6 +137,8 @@ folder/ folder/ └── recNNN.wav └── recNNN.wav ``` +You can iterate on the audios paths like this: + ```python import pandas as pd @@ -144,7 +148,7 @@ for audio_path in (folder_path + df["file_name"]): ... ``` -Since the dataset is in a supported structure, you can save this dataset to Hugging Face and the Dataset Viewer shows both the metadata and audios on Hugging Face. +Since the dataset is in a supported structure ("metadata.csv" file with "file_name" field), you can save this dataset to Hugging Face and the Dataset Viewer shows both the metadata and audios on Hugging Face. ```python from huggingface_hub import HfApi @@ -181,7 +185,13 @@ df["audio"] = df["audio"].sf.write() ## Use Transformers You can use `transformers` pipelines on pandas DataFrames to classify, generate text, images, etc. -This section shows a few examples. +This section shows a few examples with `tqdm` for progress bars. + + + +Pipelines don't accept a `tqdm` object as input but you can use a python generator instead, in the form `x for x in tqdm(...)` + + ### Text Classification From 4f18dd603a19478c2272b04e7dca062f0e824eae Mon Sep 17 00:00:00 2001 From: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> Date: Thu, 9 Jan 2025 16:14:39 +0100 Subject: [PATCH 4/4] Apply suggestions from code review Co-authored-by: Daniel van Strien --- docs/hub/datasets-pandas.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/hub/datasets-pandas.md b/docs/hub/datasets-pandas.md index 1c8575cf4..6e6d3fdbf 100644 --- a/docs/hub/datasets-pandas.md +++ b/docs/hub/datasets-pandas.md @@ -32,7 +32,7 @@ To load a file from Hugging Face, the path needs to start with `hf://`. For exam 24999 The story centers around Barry McKenzie who mu... 1 ``` -To have more information on the Hugging Face paths and how they are implemented, please refer to the [the client library's documentation on the HfFileSystem](/docs/huggingface_hub/guides/hf_file_system). +For more information on the Hugging Face paths and how they are implemented, please refer to the [the client library's documentation on the HfFileSystem](/docs/huggingface_hub/guides/hf_file_system). ## Save a DataFrame @@ -90,7 +90,7 @@ for image_path in (folder_path + df["file_name"]): ... ``` -Since the dataset is in a supported structure ("metadata.csv" file with "file_name" field), you can save this dataset to Hugging Face and the Dataset Viewer shows both the metadata and images on Hugging Face. +Since the dataset is in a supported structure (a `metadata.csv` file with a `file_name` field), you can save this dataset to Hugging Face and the Dataset Viewer shows both the metadata and images on Hugging Face. ```python from huggingface_hub import HfApi @@ -148,7 +148,7 @@ for audio_path in (folder_path + df["file_name"]): ... ``` -Since the dataset is in a supported structure ("metadata.csv" file with "file_name" field), you can save this dataset to Hugging Face and the Dataset Viewer shows both the metadata and audios on Hugging Face. +Since the dataset is in a supported structure (a `metadata.csv` file with a `file_name` field), you can save it to Hugging Face, and the Hub Dataset Viewer shows both the metadata and audio. ```python from huggingface_hub import HfApi