|
| 1 | +# Audio Dataset |
| 2 | + |
| 3 | +This guide will show you how to configure your dataset repository with audio files. You can find accompanying examples of repositories in this [Audio datasets examples collection](https://huggingface.co/collections/datasets-examples/audio-dataset-66aca0b73e8f69e3d069e607). |
| 4 | + |
| 5 | +A dataset with a supported structure and [file formats](./datasets-adding#file-formats) automatically has a Dataset Viewer on its page on the Hub. |
| 6 | + |
| 7 | +--- |
| 8 | + |
| 9 | +Additional information about your audio files - such as transcriptions - is automatically loaded as long as you include this information in a metadata file (`metadata.csv`/`metadata.jsonl`). |
| 10 | + |
| 11 | +Alternatively, audio files can be in Parquet files or in TAR archives following the [WebDataset](https://github.com/webdataset/webdataset) format. |
| 12 | + |
| 13 | +## Only audio files |
| 14 | + |
| 15 | +If your dataset only consists of one column with audio, you can simply store your audio files at the root: |
| 16 | + |
| 17 | +```plaintext |
| 18 | +my_dataset_repository/ |
| 19 | +├── 1.wav |
| 20 | +├── 2.wav |
| 21 | +├── 3.wav |
| 22 | +└── 4.wav |
| 23 | +``` |
| 24 | + |
| 25 | +or in a subdirectory: |
| 26 | + |
| 27 | +```plaintext |
| 28 | +my_dataset_repository/ |
| 29 | +└── audio |
| 30 | + ├── 1.wav |
| 31 | + ├── 2.wav |
| 32 | + ├── 3.wav |
| 33 | + └── 4.wav |
| 34 | +``` |
| 35 | + |
| 36 | +Multiple [formats](./datasets-adding#file-formats) are supported at the same time, including AIFF, FLAC, MP3, OGG and WAV. |
| 37 | + |
| 38 | +```plaintext |
| 39 | +my_dataset_repository/ |
| 40 | +└── audio |
| 41 | + ├── 1.aiff |
| 42 | + ├── 2.ogg |
| 43 | + ├── 3.mp3 |
| 44 | + └── 4.flac |
| 45 | +``` |
| 46 | + |
| 47 | +If you have several splits, you can put your audio files into directories named accordingly: |
| 48 | + |
| 49 | +```plaintext |
| 50 | +my_dataset_repository/ |
| 51 | +├── train |
| 52 | +│ ├── 1.wav |
| 53 | +│ └── 2.wav |
| 54 | +└── test |
| 55 | + ├── 3.wav |
| 56 | + └── 4.wav |
| 57 | +``` |
| 58 | + |
| 59 | +See [File names and splits](./datasets-file-names-and-splits) for more information and other ways to organize data by splits. |
| 60 | + |
| 61 | +## Additional columns |
| 62 | + |
| 63 | +If there is additional information you'd like to include about your dataset, like the transcription, add it as a `metadata.csv` file in your repository. This lets you quickly create datasets for different audio tasks like [text-to-speech](https://huggingface.co/tasks/text-to-speech) or [automatic speech recognition](https://huggingface.co/tasks/automatic-speech-recognition). |
| 64 | + |
| 65 | +```plaintext |
| 66 | +my_dataset_repository/ |
| 67 | +├── 1.wav |
| 68 | +├── 2.wav |
| 69 | +├── 3.wav |
| 70 | +├── 4.wav |
| 71 | +└── metadata.csv |
| 72 | +``` |
| 73 | + |
| 74 | +Your `metadata.csv` file must have a `file_name` column which links image files with their metadata: |
| 75 | + |
| 76 | +```csv |
| 77 | +file_name,animal |
| 78 | +1.wav,cat |
| 79 | +2.wav,cat |
| 80 | +3.wav,dog |
| 81 | +4.wav,dog |
| 82 | +``` |
| 83 | + |
| 84 | +You can also use a [JSONL](https://jsonlines.org/) file `metadata.jsonl`: |
| 85 | + |
| 86 | +```jsonl |
| 87 | +{"file_name": "1.wav","text": "cat"} |
| 88 | +{"file_name": "2.wav","text": "cat"} |
| 89 | +{"file_name": "3.wav","text": "dog"} |
| 90 | +{"file_name": "4.wav","text": "dog"} |
| 91 | +``` |
| 92 | + |
| 93 | +## Relative paths |
| 94 | + |
| 95 | +Metadata file must be located either in the same directory with the audio files it is linked to, or in any parent directory, like in this example: |
| 96 | + |
| 97 | +```plaintext |
| 98 | +my_dataset_repository/ |
| 99 | +└── test |
| 100 | + ├── audio |
| 101 | + │ ├── 1.wav |
| 102 | + │ ├── 2.wav |
| 103 | + │ ├── 3.wav |
| 104 | + │ └── 4.wav |
| 105 | + └── metadata.csv |
| 106 | +``` |
| 107 | + |
| 108 | +In this case, the `file_name` column must be a full relative path to the audio files, not just the filename: |
| 109 | + |
| 110 | +```csv |
| 111 | +file_name,animal |
| 112 | +audio/1.wav,cat |
| 113 | +audio/2.wav,cat |
| 114 | +audio/3.wav,dog |
| 115 | +audio/4.wav,dog |
| 116 | +``` |
| 117 | + |
| 118 | +Metadata file cannot be put in subdirectories of a directory with the audio files. |
| 119 | + |
| 120 | +In this example, the `test` directory is used to setup the name of the training split. See [File names and splits](./datasets-file-names-and-splits) for more information. |
| 121 | + |
| 122 | +## Audio classification |
| 123 | + |
| 124 | +For audio classification datasets, you can also use a simple setup: use directories to name the audio classes. Store your audio files in a directory structure like: |
| 125 | + |
| 126 | +```plaintext |
| 127 | +my_dataset_repository/ |
| 128 | +├── cat |
| 129 | +│ ├── 1.wav |
| 130 | +│ └── 2.wav |
| 131 | +└── dog |
| 132 | + ├── 3.wav |
| 133 | + └── 4.wav |
| 134 | +``` |
| 135 | + |
| 136 | +The dataset created with this structure contains two columns: `audio` and `label` (with values `cat` and `dog`). |
| 137 | + |
| 138 | +You can also provide multiple splits. To do so, your dataset directory should have the following structure (see [File names and splits](./datasets-file-names-and-splits) for more information): |
| 139 | + |
| 140 | +```plaintext |
| 141 | +my_dataset_repository/ |
| 142 | +├── test |
| 143 | +│ ├── cat |
| 144 | +│ │ └── 2.wav |
| 145 | +│ └── dog |
| 146 | +│ └── 4.wav |
| 147 | +└── train |
| 148 | + ├── cat |
| 149 | + │ └── 1.wav |
| 150 | + └── dog |
| 151 | + └── 3.wav |
| 152 | +``` |
| 153 | + |
| 154 | +You can disable this automatic addition of the `label` column in the [YAML configuration](./datasets-manual-configuration). If your directory names have no special meaning, set `drop_labels: true` in the README header: |
| 155 | + |
| 156 | +```yaml |
| 157 | +configs: |
| 158 | + - config_name: default # Name of the dataset subset, if applicable. |
| 159 | + drop_labels: true |
| 160 | +``` |
| 161 | +
|
| 162 | +## Large scale datasets |
| 163 | +
|
| 164 | +### WebDataset format |
| 165 | +
|
| 166 | +The [WebDataset](./datasets-webdataset) format is well suited for large scale audio datasets (see [AlienKevin/sbs_cantonese](https://huggingface.co/datasets/AlienKevin/sbs_cantonese) for example). |
| 167 | +It consists of TAR archives containing audio files and their metadata and is optimized for streaming. It is useful if you have a large number of audio files and to get streaming data loaders for large scale training. |
| 168 | +
|
| 169 | +```plaintext |
| 170 | +my_dataset_repository/ |
| 171 | +├── train-0000.tar |
| 172 | +├── train-0001.tar |
| 173 | +├── ... |
| 174 | +└── train-1023.tar |
| 175 | +``` |
| 176 | + |
| 177 | +To make a WebDataset TAR archive, create a directory containing the audio files and metadata files to be archived and create the TAR archive using e.g. the `tar` command. |
| 178 | +The usual size per archive is generally around 1GB. |
| 179 | +Make sure each audio file and metadata pair share the same file prefix, for example: |
| 180 | + |
| 181 | +```plaintext |
| 182 | +train-0000/ |
| 183 | +├── 000.flac |
| 184 | +├── 000.json |
| 185 | +├── 001.flac |
| 186 | +├── 001.json |
| 187 | +├── ... |
| 188 | +├── 999.flac |
| 189 | +└── 999.json |
| 190 | +``` |
| 191 | + |
| 192 | +Note that for user convenience and to enable the [Dataset Viewer](./datasets-viewer), every dataset hosted in the Hub is automatically converted to Parquet format up to 5GB. |
| 193 | +Read more about it in the [Parquet format](./datasets-viewer#access-the-parquet-files) documentation. |
| 194 | + |
| 195 | +### Parquet format |
| 196 | + |
| 197 | +Instead of uploading the audio files and metadata as individual files, you can embed everything inside a [Parquet](https://parquet.apache.org/) file. |
| 198 | +This is useful if you have a large number of audio files, if you want to embed multiple audio columns, or if you want to store additional information about the audio in the same file. |
| 199 | +Parquet is also useful for storing data such as raw bytes, which is not supported by JSON/CSV. |
| 200 | + |
| 201 | +```plaintext |
| 202 | +my_dataset_repository/ |
| 203 | +└── train.parquet |
| 204 | +``` |
| 205 | + |
| 206 | +Audio columns are of type _struct_, with a binary field `"bytes"` for the audio data and a string field `"path"` for the image file name or path. |
| 207 | +You should specify the feature types of the columns directly in YAML in the README header, for example: |
| 208 | + |
| 209 | +```yaml |
| 210 | +dataset_info: |
| 211 | + features: |
| 212 | + - name: audio |
| 213 | + dtype: audio |
| 214 | + - name: caption |
| 215 | + dtype: string |
| 216 | +``` |
| 217 | +
|
| 218 | +Alternatively, Parquet files with Audio data can be created using the `datasets` library by setting the column type to `Audio()` and using the `.to_parquet(...)` method or `.push_to_hub(...)`. You can find a guide on loading audio datasets in `datasets` [here](../datasets/audio_load). |
0 commit comments