Skip to content

Commit 41220f0

Browse files
authored
some small corrections to the Datasets docs (#1105)
1 parent f6fc645 commit 41220f0

11 files changed

+28
-25
lines changed

docs/hub/datasets-adding.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ The Hub's web-based interface allows users without any developer experience to u
1212

1313
A repository hosts all your dataset files, including the revision history, making storing more than one dataset version possible.
1414

15-
1. Click on your profile and select **New Dataset** to create a [new dataset repository](https://huggingface.co/new-dataset).
15+
1. Click on your profile and select **New Dataset** to create a [new dataset repository](https://huggingface.co/new-dataset).
1616
2. Pick a name for your dataset, and choose whether it is a public or private dataset. A public dataset is visible to anyone, whereas a private dataset can only be viewed by you or members of your organization.
1717

1818
<div class="flex justify-center">
@@ -21,7 +21,7 @@ A repository hosts all your dataset files, including the revision history, makin
2121

2222
### Upload dataset
2323

24-
1. Once you've created a repository, navigate to the **Files and versions** tab to add a file. Select **Add file** to upload your dataset files. We support many text, audio, and image data extensions such as `.csv`, `.mp3`, and `.jpg` among many others (see full list [here](./datasets-viewer-configure.md)).
24+
1. Once you've created a repository, navigate to the **Files and versions** tab to add a file. Select **Add file** to upload your dataset files. We support many text, audio, image and other data extensions such as `.csv`, `.mp3`, and `.jpg` (see the full list of [File formats](#file-formats)).
2525

2626
<div class="flex justify-center">
2727
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/datasets/upload_files.png"/>
@@ -70,7 +70,7 @@ Make sure the Dataset Viewer correctly shows your data, or [Configure the Datase
7070

7171
## Using the `huggingface_hub` client library
7272

73-
The rich features set in the `huggingface_hub` library allows you to manage repositories, including creating repos and uploading datasets to the Model Hub. Visit [the client library's documentation](https://huggingface.co/docs/huggingface_hub/index) to learn more.
73+
The rich features set in the `huggingface_hub` library allows you to manage repositories, including creating repos and uploading datasets to the Hub. Visit [the client library's documentation](https://huggingface.co/docs/huggingface_hub/index) to learn more.
7474

7575
## Using other libraries
7676

@@ -79,7 +79,7 @@ See the list of [Libraries supported by the Datasets Hub](./datasets-libraries)
7979

8080
## Using Git
8181

82-
Since dataset repos are just Git repositories, you can use Git to push your data files to the Hub. Follow the guide on [Getting Started with Repositories](repositories-getting-started) to learn about using the `git` CLI to commit and push your datasets.
82+
Since dataset repos are Git repositories, you can use Git to push your data files to the Hub. Follow the guide on [Getting Started with Repositories](repositories-getting-started) to learn about using the `git` CLI to commit and push your datasets.
8383

8484
## File formats
8585

@@ -94,7 +94,7 @@ The Hub natively supports multiple file formats:
9494

9595
It also supports files compressed using ZIP (.zip), GZIP (.gz), ZSTD (.zst), BZ2 (.bz2), LZ4 (.lz4) and LZMA (.xz).
9696

97-
Image and audio resources can also have additional metadata files, see the [Data files Configuration](./datasets-data-files-configuration) on image and audio datasets.
97+
Image and audio resources can also have additional metadata files, see the [Data files Configuration](./datasets-data-files-configuration#image-and-audio-datasets) on image and audio datasets.
9898

9999
You may want to convert your files to these formats to benefit from all the Hub features.
100100
Other formats and structures may not be recognized by the Hub.

docs/hub/datasets-cards.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44

55
Each dataset may be documented by the `README.md` file in the repository. This file is called a **dataset card**, and the Hugging Face Hub will render its contents on the dataset's main page. To inform users about how to responsibly use the data, it's a good idea to include information about any potential biases within the dataset. Generally, dataset cards help users understand the contents of the dataset and give context for how the dataset should be used.
66

7-
You can also add dataset metadata to your card. The metadata describes important information about a dataset such as its license, language, and size. It also contains tags to help users discover a dataset on the Hub. Tags are defined in a YAML metadata section at the top of the `README.md` file.
7+
You can also add dataset metadata to your card. The metadata describes important information about a dataset such as its license, language, and size. It also contains tags to help users discover a dataset on the Hub, and [data files configuration](./datasets-manual-configuration.md) options. Tags are defined in a YAML metadata section at the top of the `README.md` file.
88

99
## Dataset card metadata
1010

docs/hub/datasets-dask.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# Dask
22

33
[Dask](https://github.com/dask/dask) is a parallel and distributed computing library that scales the existing Python and PyData ecosystem.
4-
Since it uses [fsspec](https://filesystem-spec.readthedocs.io) to read and write remote data, you can use the Hugging Face paths (`hf://`) to read and write data on the Hub:
4+
Since it uses [fsspec](https://filesystem-spec.readthedocs.io) to read and write remote data, you can use the Hugging Face paths ([`hf://`](https://huggingface.co/docs/huggingface_hub/guides/hf_file_system#integrations)) to read and write data on the Hub:
55

66
First you need to [Login with your Hugging Face account](../huggingface_hub/quick-start#login), for example using:
77

@@ -17,7 +17,7 @@ from huggingface_hub import HfApi
1717
HfApi().create_repo(repo_id="username/my_dataset", repo_type="dataset")
1818
```
1919

20-
Finally, you can use Hugging Face paths in Dask:
20+
Finally, you can use [Hugging Face paths](https://huggingface.co/docs/huggingface_hub/guides/hf_file_system#integrations) in Dask:
2121

2222
```python
2323
import dask.dataframe as dd

docs/hub/datasets-downloading.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
## Integrated libraries
44

5-
If a dataset on the Hub is tied to a [supported library](./datasets-libraries), loading the dataset can be done in just a few lines. For information on accessing the dataset, you can click on the "Use in _Library_" button on the dataset page to see how to do so. For example, `samsum` shows how to do so with 🤗 Datasets below.
5+
If a dataset on the Hub is tied to a [supported library](./datasets-libraries), loading the dataset can be done in just a few lines. For information on accessing the dataset, you can click on the "Use in dataset library" button on the dataset page to see how to do so. For example, [`samsum`](https://huggingface.co/datasets/samsum?library=true) shows how to do so with 🤗 Datasets below.
66

77
<div class="flex justify-center">
88
<img class="block dark:hidden" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/datasets-usage.png"/>
@@ -16,7 +16,7 @@ If a dataset on the Hub is tied to a [supported library](./datasets-libraries),
1616

1717
## Using the Hugging Face Client Library
1818

19-
You can use the [`huggingface_hub`](https://github.com/huggingface/huggingface_hub) library to create, delete, update and retrieve information from repos. You can also download files from repos or integrate them into your library! For example, you can quickly load a CSV dataset with a few lines using Pandas.
19+
You can use the [`huggingface_hub`](https://huggingface.co/docs/huggingface_hub) library to create, delete, update and retrieve information from repos. You can also download files from repos or integrate them into your library! For example, you can quickly load a CSV dataset with a few lines using Pandas.
2020

2121
```py
2222
from huggingface_hub import hf_hub_download
@@ -32,7 +32,7 @@ dataset = pd.read_csv(
3232

3333
## Using Git
3434

35-
Since all datasets on the dataset Hub are Git repositories, you can clone the datasets locally by running:
35+
Since all datasets on the Hub are Git repositories, you can clone the datasets locally by running:
3636

3737
```bash
3838
git lfs install

docs/hub/datasets-duckdb.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# DuckDB
22

33
[DuckDB](https://github.com/duckdb/duckdb) is an in-process SQL [OLAP](https://en.wikipedia.org/wiki/Online_analytical_processing) database management system.
4-
Since it supports [fsspec](https://filesystem-spec.readthedocs.io) to read and write remote data, you can use the Hugging Face paths (`hf://`) to read and write data on the Hub:
4+
Since it supports [fsspec](https://filesystem-spec.readthedocs.io) to read and write remote data, you can use the Hugging Face paths ([`hf://`](https://huggingface.co/docs/huggingface_hub/guides/hf_file_system#integrations)) to read and write data on the Hub:
55

66
First you need to [Login with your Hugging Face account](../huggingface_hub/quick-start#login), for example using:
77

@@ -17,7 +17,7 @@ from huggingface_hub import HfApi
1717
HfApi().create_repo(repo_id="username/my_dataset", repo_type="dataset")
1818
```
1919

20-
Finally, you can use Hugging Face paths in DuckDB:
20+
Finally, you can use [Hugging Face paths]([Hugging Face paths](https://huggingface.co/docs/huggingface_hub/guides/hf_file_system#integrations)) in DuckDB:
2121

2222
```python
2323
>>> from huggingface_hub import HfFileSystem
@@ -39,3 +39,5 @@ You can reload it later:
3939
>>> duckdb.register_filesystem(fs)
4040
>>> df = duckdb.query("SELECT * FROM 'hf://datasets/username/my_dataset/data.parquet' LIMIT 10;").df()
4141
```
42+
43+
To have more information on the Hugging Face paths and how they are implemented, please refer to the [the client library's documentation on the HfFileSystem](https://huggingface.co/docs/huggingface_hub/guides/hf_file_system).

docs/hub/datasets-file-names-and-splits.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
To host and share your dataset, create a dataset repository on the Hugging Face Hub and upload your data files.
44

5-
This guide will show you how to name your files and directories in your dataset repository when you upload it and enable all the Dataset Hub features like the Dataset Viewer.
5+
This guide will show you how to name your files and directories in your dataset repository when you upload it and enable all the Datasets Hub features like the Dataset Viewer.
66
A dataset with a supported structure and [file formats](./datasets-adding#file-formats) automatically has a dataset viewer on its page on the Hub.
77

88
Note that if none of the structures below suits your case, you can have more control over how you define splits and subsets with the [Manual Configuration](./datasets-manual-configuration).

docs/hub/datasets-gated.md

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -12,9 +12,10 @@ The User Access request dialog can be modified to include additional text and ch
1212
---
1313
extra_gated_prompt: "You agree to not attempt to determine the identity of individuals in this dataset"
1414
extra_gated_fields:
15-
Company: text
16-
Country: text
17-
I agree to use this dataset for non-commercial use ONLY: checkbox
15+
Name: text
16+
Affiliation: text
17+
Email: text
18+
I agree to not attempt to determine the identity of speakers in this dataset: checkbox
1819
---
1920
```
2021

@@ -73,4 +74,4 @@ In some cases, you might also want to modify the text in the heading of the gate
7374
extra_gated_heading: "Acknowledge license to accept the repository"
7475
extra_gated_button_content: "Acknowledge license"
7576
---
76-
```
77+
```

docs/hub/datasets-libraries.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Libraries
22

3-
The Dataset Hub has support for several libraries in the Open Source ecosystem.
3+
The Datasets Hub has support for several libraries in the Open Source ecosystem.
44
Thanks to the [huggingface_hub Python library](../huggingface_hub), it's easy to enable sharing your datasets on the Hub.
55
We're happy to welcome to the Hub a set of Open Source libraries that are pushing Machine Learning forward.
66

docs/hub/datasets-overview.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,13 +2,13 @@
22

33
## Datasets on the Hub
44

5-
The Hugging Face Hub hosts a [large number of community-curated datasets](https://huggingface.co/datasets) for a diverse range of tasks such as translation, automatic speech recognition, and image classification. Alongside the information contained in the [dataset card](./datasets-cards), many datasets, such as [GLUE](https://huggingface.co/datasets/glue), include a Dataset Preview to showcase the data.
5+
The Hugging Face Hub hosts a [large number of community-curated datasets](https://huggingface.co/datasets) for a diverse range of tasks such as translation, automatic speech recognition, and image classification. Alongside the information contained in the [dataset card](./datasets-cards), many datasets, such as [GLUE](https://huggingface.co/datasets/glue), include a [Dataset Viewer](./datasets-viewer) to showcase the data.
66

7-
Each dataset is a [Git repository](./repositories), equipped with the necessary scripts to download the data and generate splits for training, evaluation, and testing. For information on how a dataset repository is structured, refer to the [Structure your repository guide](https://huggingface.co/docs/datasets/repository_structure). Following the supported repo structure will ensure that your repository will have a preview on its dataset page on the Hub.
7+
Each dataset is a [Git repository](./repositories), equipped with the necessary scripts to download the data and generate splits for training, evaluation, and testing. For information on how a dataset repository is structured, refer to the [Data files Configuration page](./datasets-data-files-configuration). Following the supported repo structure will ensure that the dataset page on the Hub will have a Viewer.
88

99
## Search for datasets
1010

11-
Like models and Spaces, you can search the Hub for datasets using the search bar in the top navigation or on the [main datasets page](https://huggingface.co/datasets). There's a large number of languages, tasks, and licenses that you can use to filter your results to find a dataset that's right for you.
11+
Like models and spaces, you can search the Hub for datasets using the search bar in the top navigation or on the [main datasets page](https://huggingface.co/datasets). There's a large number of languages, tasks, and licenses that you can use to filter your results to find a dataset that's right for you.
1212

1313
<div class="flex justify-center">
1414
<img class="block dark:hidden" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/datasets-main.png"/>

docs/hub/datasets-pandas.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# Pandas
22

33
[Pandas](https://github.com/pandas-dev/pandas) is a widely used Python data analysis toolkit.
4-
Since it uses [fsspec](https://filesystem-spec.readthedocs.io) to read and write remote data, you can use the Hugging Face paths (`hf://`) to read and write data on the Hub:
4+
Since it uses [fsspec](https://filesystem-spec.readthedocs.io) to read and write remote data, you can use the Hugging Face paths ([`hf://`](https://huggingface.co/docs/huggingface_hub/guides/hf_file_system#integrations)) to read and write data on the Hub:
55

66
First you need to [Login with your Hugging Face account](../huggingface_hub/quick-start#login), for example using:
77

@@ -17,7 +17,7 @@ from huggingface_hub import HfApi
1717
HfApi().create_repo(repo_id="username/my_dataset", repo_type="dataset")
1818
```
1919

20-
Finally, you can use Hugging Face paths in Pandas:
20+
Finally, you can use [Hugging Face paths]([Hugging Face paths](https://huggingface.co/docs/huggingface_hub/guides/hf_file_system#integrations)) in Pandas:
2121

2222
```python
2323
import pandas as pd

0 commit comments

Comments
 (0)