|
1 | | -# Adding new datasets |
| 1 | +# Uploading datasets |
2 | 2 |
|
3 | | -Any Hugging Face user can create a dataset! You can start by [creating your dataset repository](https://huggingface.co/new-dataset) and choosing one of the following methods to upload your dataset: |
| 3 | +The [Hub](https://huggingface.co/datasets) is home to an extensive collection of community-curated and research datasets. We encourage you to share your dataset to the Hub to help grow the ML community and accelerate progress for everyone. All contributions are welcome; adding a dataset is just a drag and drop away! |
4 | 4 |
|
5 | | -* [Add files manually to the repository through the UI](https://huggingface.co/docs/datasets/upload_dataset#upload-your-files) |
6 | | -* [Push files with the `push_to_hub` method from 🤗 Datasets](https://huggingface.co/docs/datasets/upload_dataset#upload-from-python) |
7 | | -* [Use Git to commit and push your dataset files](https://huggingface.co/docs/datasets/share#clone-the-repository) |
| 5 | +Start by [creating a Hugging Face Hub account](https://huggingface.co/join) if you don't have one yet. |
8 | 6 |
|
9 | | -While in many cases it's possible to just add raw data to your dataset repo in any supported formats (JSON, CSV, Parquet, text, images, audio files, …), for some large datasets you may want to [create a loading script](https://huggingface.co/docs/datasets/dataset_script#create-a-dataset-loading-script). This script defines the different configurations and splits of your dataset, as well as how to download and process the data. |
| 7 | +## Upload using the Hub UI |
10 | 8 |
|
11 | | -## Datasets outside a namespace |
| 9 | +The Hub's web-based interface allows users without any developer experience to upload a dataset. |
12 | 10 |
|
13 | | -Datasets outside a namespace are maintained by the Hugging Face team. Unlike the naming convention used for community datasets (`username/dataset_name` or `org/dataset_name`), datasets outside a namespace can be referenced directly by their name (e.g. [`glue`](https://huggingface.co/datasets/glue)). If you find that an improvement is needed, use their "Community" tab to open a discussion or submit a PR on the Hub to propose edits. |
| 11 | +### Create a repository |
| 12 | + |
| 13 | +A repository hosts all your dataset files, including the revision history, making storing more than one dataset version possible. |
| 14 | + |
| 15 | +1. Click on your profile and select **New Dataset** to create a [new dataset repository](https://huggingface.co/new-dataset). |
| 16 | +2. Pick a name for your dataset, and choose whether it is a public or private dataset. A public dataset is visible to anyone, whereas a private dataset can only be viewed by you or members of your organization. |
| 17 | + |
| 18 | +<div class="flex justify-center"> |
| 19 | + <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/datasets/create_repo.png"/> |
| 20 | +</div> |
| 21 | + |
| 22 | +### Upload dataset |
| 23 | + |
| 24 | +1. Once you've created a repository, navigate to the **Files and versions** tab to add a file. Select **Add file** to upload your dataset files. We support many text, audio, and image data extensions such as `.csv`, `.mp3`, and `.jpg` among many others (see full list [here](./datasets-viewer-configure.md)). |
| 25 | + |
| 26 | +<div class="flex justify-center"> |
| 27 | + <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/datasets/upload_files.png"/> |
| 28 | +</div> |
| 29 | + |
| 30 | +2. Drag and drop your dataset files. |
| 31 | + |
| 32 | +<div class="flex justify-center"> |
| 33 | + <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/datasets/commit_files.png"/> |
| 34 | +</div> |
| 35 | + |
| 36 | +3. After uploading your dataset files, they are stored in your dataset repository. |
| 37 | + |
| 38 | +<div class="flex justify-center"> |
| 39 | + <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/datasets/files_stored.png"/> |
| 40 | +</div> |
| 41 | + |
| 42 | +### Create a Dataset card |
| 43 | + |
| 44 | +Adding a Dataset card is super valuable for helping users find your dataset and understand how to use it responsibly. |
| 45 | + |
| 46 | +1. Click on **Create Dataset Card** to create a [Dataset card](./datasets-cards). This button creates a `README.md` file in your repository. |
| 47 | + |
| 48 | +<div class="flex justify-center"> |
| 49 | + <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/datasets/dataset_card.png"/> |
| 50 | +</div> |
| 51 | + |
| 52 | +2. At the top, you'll see the **Metadata UI** with several fields to select from such as license, language, and task categories. These are the most important tags for helping users discover your dataset on the Hub (when applicable). When you select an option for a field, it will be automatically added to the top of the dataset card. |
| 53 | + |
| 54 | + You can also look at the [Dataset Card specifications](https://github.com/huggingface/hub-docs/blob/main/datasetcard.md?plain=1), which has a complete set of allowed tags, including optional like `annotations_creators`, to help you choose the ones that are useful for your dataset. |
| 55 | + |
| 56 | +<div class="flex justify-center"> |
| 57 | + <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/datasets/metadata_ui.png"/> |
| 58 | +</div> |
| 59 | + |
| 60 | +3. Write your dataset documentation in the Dataset Card to introduce your dataset to the community and help users understand what is inside: what are the use cases and limitations, where the data comes from, what are important ethical considerations, and any other relevant details. |
| 61 | + |
| 62 | + You can click on the **Import dataset card template** link at the top of the editor to automatically create a dataset card template. For a detailed example of what a good Dataset card should look like, take a look at the [CNN DailyMail Dataset card](https://huggingface.co/datasets/cnn_dailymail). |
| 63 | + |
| 64 | +### Dataset Viewer |
| 65 | + |
| 66 | +The [Dataset Viewer](./datasets-viewer) is useful to know how the data actually looks like before you download it. |
| 67 | +It is enabled by default for all public datasets. |
| 68 | + |
| 69 | +Make sure the Dataset Viewer correctly shows your data, or [Configure the Dataset Viewer](./datasets-viewer-configure). |
| 70 | + |
| 71 | +## Using the `huggingface_hub` client library |
| 72 | + |
| 73 | +The rich features set in the `huggingface_hub` library allows you to manage repositories, including creating repos and uploading datasets to the Model Hub. Visit [the client library's documentation](https://huggingface.co/docs/huggingface_hub/index) to learn more. |
| 74 | + |
| 75 | +## Using other libraries |
| 76 | + |
| 77 | +Some libraries like [🤗 Datasets](https://huggingface.co/docs/datasets/index), [Pandas](https://pandas.pydata.org/), [Dask](https://www.dask.org/) or [DuckDB](https://duckdb.org/) can upload files to the Hub. |
| 78 | +See the list of [Libraries supported by the Datasets Hub](./datasets-libraries) for more information. |
| 79 | + |
| 80 | +## Using Git |
| 81 | + |
| 82 | +Since dataset repos are just Git repositories, you can use Git to push your data files to the Hub. Follow the guide on [Getting Started with Repositories](repositories-getting-started) to learn about using the `git` CLI to commit and push your datasets. |
| 83 | + |
| 84 | +## File formats |
| 85 | + |
| 86 | +The Hub natively supports multiple file formats: |
| 87 | + |
| 88 | +- CSV (.csv, .tsv) |
| 89 | +- JSON Lines, JSON (.jsonl, .json) |
| 90 | +- Parquet (.parquet) |
| 91 | +- Text (.txt) |
| 92 | +- Images (.png, .jpg, etc.) |
| 93 | +- Audio (.wav, .mp3, etc.) |
| 94 | + |
| 95 | +It also supports files compressed using ZIP (.zip), GZIP (.gz), ZSTD (.zst), BZ2 (.bz2), LZ4 (.lz4) and LZMA (.xz). |
| 96 | + |
| 97 | +Image and audio resources can also have additional metadata files, see the [Data files Configuration](./datasets-data-files-configuration) on image and audio datasets. |
| 98 | + |
| 99 | +You may want to convert your files to these formats to benefit from all the Hub features. |
| 100 | +Other formats and structures may not be recognized by the Hub. |
0 commit comments