diff --git a/docs/hub/datasets-adding.md b/docs/hub/datasets-adding.md index 8738cf547..2c7994ed3 100644 --- a/docs/hub/datasets-adding.md +++ b/docs/hub/datasets-adding.md @@ -78,9 +78,9 @@ Since dataset repos are Git repositories, you can use Git to push your data file The Hub natively supports multiple file formats: +- Parquet (.parquet) - CSV (.csv, .tsv) - JSON Lines, JSON (.jsonl, .json) -- Parquet (.parquet) - Arrow streaming format (.arrow) - Text (.txt) - Images (.png, .jpg, etc.) @@ -96,7 +96,7 @@ Other formats and structures may not be recognized by the Hub. ### Which file format should I use? -For most types of datasets, Parquet is the recommended format due to its efficient compression, rich typing, and since a variety of tools supports this format with optimized read and batched operations. Alternatively, CSV or JSON Lines/JSON can be used for tabular data (prefer JSON Lines for nested data). Although easy to parse compared to Parquet, these formats are not recommended for data larger than several GBs. For image and audio datasets, uploading raw files is the most practical for most use cases since it's easy to access individual files. For large scale image and audio datasets streaming, [WebDataset](https://github.com/webdataset/webdataset) should be preferred over raw image and audio files to avoid the overhead of accessing individual files. Though for more general use cases involving analytics, data filtering or metadata parsing, Parquet is the recommended option for large scale image and audio datasets. +For most types of datasets, **Parquet** is the recommended format due to its efficient compression, rich typing, and since a variety of tools supports this format with optimized read and batched operations. Alternatively, CSV or JSON Lines/JSON can be used for tabular data (prefer JSON Lines for nested data). Although easy to parse compared to Parquet, these formats are not recommended for data larger than several GBs. For image and audio datasets, uploading raw files is the most practical for most use cases since it's easy to access individual files. For large scale image and audio datasets streaming, [WebDataset](https://github.com/webdataset/webdataset) should be preferred over raw image and audio files to avoid the overhead of accessing individual files. Though for more general use cases involving analytics, data filtering or metadata parsing, Parquet is the recommended option for large scale image and audio datasets. ### Dataset Viewer diff --git a/docs/hub/datasets-downloading.md b/docs/hub/datasets-downloading.md index 520b624c8..97b29064e 100644 --- a/docs/hub/datasets-downloading.md +++ b/docs/hub/datasets-downloading.md @@ -2,7 +2,7 @@ ## Integrated libraries -If a dataset on the Hub is tied to a [supported library](./datasets-libraries), loading the dataset can be done in just a few lines. For information on accessing the dataset, you can click on the "Use in dataset library" button on the dataset page to see how to do so. For example, [`samsum`](https://huggingface.co/datasets/samsum?library=true) shows how to do so with 🤗 Datasets below. +If a dataset on the Hub is tied to a [supported library](./datasets-libraries), loading the dataset can be done in just a few lines. For information on accessing the dataset, you can click on the "Use this dataset" button on the dataset page to see how to do so. For example, [`samsum`](https://huggingface.co/datasets/Samsung/samsum?library=datasets) shows how to do so with 🤗 Datasets below.
diff --git a/docs/hub/datasets-usage.md b/docs/hub/datasets-usage.md index e15e167ca..ab8fc61ce 100644 --- a/docs/hub/datasets-usage.md +++ b/docs/hub/datasets-usage.md @@ -1,6 +1,6 @@ # Using 🤗 Datasets -Once you've found an interesting dataset on the Hugging Face Hub, you can load the dataset using 🤗 Datasets. You can click on the [**Use in dataset library** button](https://huggingface.co/datasets/samsum?library=true) to copy the code to load a dataset. +Once you've found an interesting dataset on the Hugging Face Hub, you can load the dataset using 🤗 Datasets. You can click on the [**Use this dataset** button](https://huggingface.co/datasets/nyu-mll/glue?library=datasets) to copy the code to load a dataset. First you need to [Login with your Hugging Face account](/docs/huggingface_hub/quick-start#login), for example using: