Skip to content

Commit 1210ee8

Browse files
authored
Add 'Audio dataset' doc page (#1360)
* add link to task pages * add the first part * format * add webdataset and parquet * image -> audio
1 parent d8a676f commit 1210ee8

File tree

3 files changed

+220
-2
lines changed

3 files changed

+220
-2
lines changed

docs/hub/datasets-audio.md

Lines changed: 218 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,218 @@
1+
# Audio Dataset
2+
3+
This guide will show you how to configure your dataset repository with audio files. You can find accompanying examples of repositories in this [Audio datasets examples collection](https://huggingface.co/collections/datasets-examples/audio-dataset-66aca0b73e8f69e3d069e607).
4+
5+
A dataset with a supported structure and [file formats](./datasets-adding#file-formats) automatically has a Dataset Viewer on its page on the Hub.
6+
7+
---
8+
9+
Additional information about your audio files - such as transcriptions - is automatically loaded as long as you include this information in a metadata file (`metadata.csv`/`metadata.jsonl`).
10+
11+
Alternatively, audio files can be in Parquet files or in TAR archives following the [WebDataset](https://github.com/webdataset/webdataset) format.
12+
13+
## Only audio files
14+
15+
If your dataset only consists of one column with audio, you can simply store your audio files at the root:
16+
17+
```plaintext
18+
my_dataset_repository/
19+
├── 1.wav
20+
├── 2.wav
21+
├── 3.wav
22+
└── 4.wav
23+
```
24+
25+
or in a subdirectory:
26+
27+
```plaintext
28+
my_dataset_repository/
29+
└── audio
30+
├── 1.wav
31+
├── 2.wav
32+
├── 3.wav
33+
└── 4.wav
34+
```
35+
36+
Multiple [formats](./datasets-adding#file-formats) are supported at the same time, including AIFF, FLAC, MP3, OGG and WAV.
37+
38+
```plaintext
39+
my_dataset_repository/
40+
└── audio
41+
├── 1.aiff
42+
├── 2.ogg
43+
├── 3.mp3
44+
└── 4.flac
45+
```
46+
47+
If you have several splits, you can put your audio files into directories named accordingly:
48+
49+
```plaintext
50+
my_dataset_repository/
51+
├── train
52+
│   ├── 1.wav
53+
│   └── 2.wav
54+
└── test
55+
├── 3.wav
56+
└── 4.wav
57+
```
58+
59+
See [File names and splits](./datasets-file-names-and-splits) for more information and other ways to organize data by splits.
60+
61+
## Additional columns
62+
63+
If there is additional information you'd like to include about your dataset, like the transcription, add it as a `metadata.csv` file in your repository. This lets you quickly create datasets for different audio tasks like [text-to-speech](https://huggingface.co/tasks/text-to-speech) or [automatic speech recognition](https://huggingface.co/tasks/automatic-speech-recognition).
64+
65+
```plaintext
66+
my_dataset_repository/
67+
├── 1.wav
68+
├── 2.wav
69+
├── 3.wav
70+
├── 4.wav
71+
└── metadata.csv
72+
```
73+
74+
Your `metadata.csv` file must have a `file_name` column which links image files with their metadata:
75+
76+
```csv
77+
file_name,animal
78+
1.wav,cat
79+
2.wav,cat
80+
3.wav,dog
81+
4.wav,dog
82+
```
83+
84+
You can also use a [JSONL](https://jsonlines.org/) file `metadata.jsonl`:
85+
86+
```jsonl
87+
{"file_name": "1.wav","text": "cat"}
88+
{"file_name": "2.wav","text": "cat"}
89+
{"file_name": "3.wav","text": "dog"}
90+
{"file_name": "4.wav","text": "dog"}
91+
```
92+
93+
## Relative paths
94+
95+
Metadata file must be located either in the same directory with the audio files it is linked to, or in any parent directory, like in this example:
96+
97+
```plaintext
98+
my_dataset_repository/
99+
└── test
100+
├── audio
101+
│   ├── 1.wav
102+
│   ├── 2.wav
103+
│   ├── 3.wav
104+
│   └── 4.wav
105+
└── metadata.csv
106+
```
107+
108+
In this case, the `file_name` column must be a full relative path to the audio files, not just the filename:
109+
110+
```csv
111+
file_name,animal
112+
audio/1.wav,cat
113+
audio/2.wav,cat
114+
audio/3.wav,dog
115+
audio/4.wav,dog
116+
```
117+
118+
Metadata file cannot be put in subdirectories of a directory with the audio files.
119+
120+
In this example, the `test` directory is used to setup the name of the training split. See [File names and splits](./datasets-file-names-and-splits) for more information.
121+
122+
## Audio classification
123+
124+
For audio classification datasets, you can also use a simple setup: use directories to name the audio classes. Store your audio files in a directory structure like:
125+
126+
```plaintext
127+
my_dataset_repository/
128+
├── cat
129+
│   ├── 1.wav
130+
│   └── 2.wav
131+
└── dog
132+
├── 3.wav
133+
└── 4.wav
134+
```
135+
136+
The dataset created with this structure contains two columns: `audio` and `label` (with values `cat` and `dog`).
137+
138+
You can also provide multiple splits. To do so, your dataset directory should have the following structure (see [File names and splits](./datasets-file-names-and-splits) for more information):
139+
140+
```plaintext
141+
my_dataset_repository/
142+
├── test
143+
│   ├── cat
144+
│   │   └── 2.wav
145+
│   └── dog
146+
│   └── 4.wav
147+
└── train
148+
├── cat
149+
│   └── 1.wav
150+
└── dog
151+
└── 3.wav
152+
```
153+
154+
You can disable this automatic addition of the `label` column in the [YAML configuration](./datasets-manual-configuration). If your directory names have no special meaning, set `drop_labels: true` in the README header:
155+
156+
```yaml
157+
configs:
158+
- config_name: default # Name of the dataset subset, if applicable.
159+
drop_labels: true
160+
```
161+
162+
## Large scale datasets
163+
164+
### WebDataset format
165+
166+
The [WebDataset](./datasets-webdataset) format is well suited for large scale audio datasets (see [AlienKevin/sbs_cantonese](https://huggingface.co/datasets/AlienKevin/sbs_cantonese) for example).
167+
It consists of TAR archives containing audio files and their metadata and is optimized for streaming. It is useful if you have a large number of audio files and to get streaming data loaders for large scale training.
168+
169+
```plaintext
170+
my_dataset_repository/
171+
├── train-0000.tar
172+
├── train-0001.tar
173+
├── ...
174+
└── train-1023.tar
175+
```
176+
177+
To make a WebDataset TAR archive, create a directory containing the audio files and metadata files to be archived and create the TAR archive using e.g. the `tar` command.
178+
The usual size per archive is generally around 1GB.
179+
Make sure each audio file and metadata pair share the same file prefix, for example:
180+
181+
```plaintext
182+
train-0000/
183+
├── 000.flac
184+
├── 000.json
185+
├── 001.flac
186+
├── 001.json
187+
├── ...
188+
├── 999.flac
189+
└── 999.json
190+
```
191+
192+
Note that for user convenience and to enable the [Dataset Viewer](./datasets-viewer), every dataset hosted in the Hub is automatically converted to Parquet format up to 5GB.
193+
Read more about it in the [Parquet format](./datasets-viewer#access-the-parquet-files) documentation.
194+
195+
### Parquet format
196+
197+
Instead of uploading the audio files and metadata as individual files, you can embed everything inside a [Parquet](https://parquet.apache.org/) file.
198+
This is useful if you have a large number of audio files, if you want to embed multiple audio columns, or if you want to store additional information about the audio in the same file.
199+
Parquet is also useful for storing data such as raw bytes, which is not supported by JSON/CSV.
200+
201+
```plaintext
202+
my_dataset_repository/
203+
└── train.parquet
204+
```
205+
206+
Audio columns are of type _struct_, with a binary field `"bytes"` for the audio data and a string field `"path"` for the image file name or path.
207+
You should specify the feature types of the columns directly in YAML in the README header, for example:
208+
209+
```yaml
210+
dataset_info:
211+
features:
212+
- name: audio
213+
dtype: audio
214+
- name: caption
215+
dtype: string
216+
```
217+
218+
Alternatively, Parquet files with Audio data can be created using the `datasets` library by setting the column type to `Audio()` and using the `.to_parquet(...)` method or `.push_to_hub(...)`. You can find a guide on loading audio datasets in `datasets` [here](../datasets/audio_load).

docs/hub/datasets-data-files-configuration.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -36,4 +36,4 @@ And if your images/audio files have metadata (e.g. captions, bounding boxes, tra
3636
We provide two guides that you can check out:
3737

3838
- [How to create an image dataset](./datasets-image) ([example datasets](https://huggingface.co/collections/datasets-examples/image-dataset-6568e7cf28639db76eb92d65))
39-
- [How to create an audio dataset](https://huggingface.co/docs/datasets/audio_dataset)
39+
- [How to create an audio dataset](./datasets-audio) ([example datasets](https://huggingface.co/collections/datasets-examples/audio-dataset-66aca0b73e8f69e3d069e607))

docs/hub/datasets-image.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -59,7 +59,7 @@ See [File names and splits](./datasets-file-names-and-splits) for more informati
5959

6060
## Additional columns
6161

62-
If there is additional information you'd like to include about your dataset, like text captions or bounding boxes, add it as a `metadata.csv` file in your repository. This lets you quickly create datasets for different computer vision tasks like text captioning or object detection.
62+
If there is additional information you'd like to include about your dataset, like text captions or bounding boxes, add it as a `metadata.csv` file in your repository. This lets you quickly create datasets for different computer vision tasks like [text captioning](https://huggingface.co/tasks/image-to-text) or [object detection](https://huggingface.co/tasks/object-detection).
6363

6464
```
6565
my_dataset_repository/

0 commit comments

Comments
 (0)