Add extra guidance for sharing large datasets on the Hub (#1288)

davanstrien · osanseviero · web-flow · commit 070425c3b722 · 2024-05-16T11:37:50.000+01:00
* docs: add extra guidance for sharing large datasets on the Hub

* wording improvements

* Update docs/hub/repositories-recommendations.md

Co-authored-by: Omar Sanseviero &lt;osanseviero@gmail.com&gt;

* Update docs/hub/repositories-recommendations.md

Co-authored-by: Omar Sanseviero &lt;osanseviero@gmail.com&gt;

* make all requirements mandatory

* formatting

---------

Co-authored-by: Omar Sanseviero &lt;osanseviero@gmail.com&gt;
diff --git a/docs/hub/repositories-recommendations.md b/docs/hub/repositories-recommendations.md
@@ -57,3 +57,22 @@ happen (in rare cases) that even if the timeout is raised client-side, the proce
 completed server-side. This can be checked manually by browsing the repo on the Hub. To prevent this timeout, we recommend
 adding around 50-100 files per commit.
 
+## Sharing large datasets on the Hub
+
+One key way in which Hugging Face supports the machine learning ecosystem is by hosting datasets on the Hub, including very large datasets. To ensure we can effectively support the open-source ecosystem, we ask that you let us know in advance via datasets at huggingface.co or on [our Discord](http://hf.co/join/discord) if you are uploading datasets above a couple of hundred GBs or TBs of data.
+
+When you get in touch with us, please let us know:
+
+- What is the dataset, and who/what is it likely to be useful for?
+- The size of the dataset.
+- The format you plan to use for sharing your dataset.
+
+For hosting large datasets on the Hub, we require the following for your dataset:
+
+- A dataset card: we want to ensure that your dataset can be used effectively by the community and one of the key ways of enabling this is via a dataset card. This [guidance](docs/hub/datasets-cards.md) provides an overview of how to write a dataset card.
+- You are sharing the dataset to enable community reuse. If you plan to upload a dataset you anticipate won't have any further reuse, other platforms are likely more suitable.
+- You must follow the repository limitations outlined above.
+- Using file formats that are well integrated with the Hugging Face ecosystem. We have good support for [Parquet](https://huggingface.co/docs/datasets/v2.19.0/en/loading#parquet) and [WebDataset](https://huggingface.co/docs/datasets/v2.19.0/en/loading#webdataset) formats, which are often good options for sharing large datasets efficiently. This will also ensure the dataset viewer works for your dataset.
+- Avoid the use of custom loading scripts when using datasets. In our experience, datasets that require custom code to use often end up with limited reuse.
+
+Please get in touch with us if any of these requirements are difficult for you to meet because of the type of data or domain you are working in.