You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/hub/datasets-adding.md
+5-5Lines changed: 5 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -12,7 +12,7 @@ The Hub's web-based interface allows users without any developer experience to u
12
12
13
13
A repository hosts all your dataset files, including the revision history, making storing more than one dataset version possible.
14
14
15
-
1. Click on your profile and select **New Dataset** to create a [new dataset repository](https://huggingface.co/new-dataset).
15
+
1. Click on your profile and select **New Dataset** to create a [new dataset repository](https://huggingface.co/new-dataset).
16
16
2. Pick a name for your dataset, and choose whether it is a public or private dataset. A public dataset is visible to anyone, whereas a private dataset can only be viewed by you or members of your organization.
17
17
18
18
<divclass="flex justify-center">
@@ -21,7 +21,7 @@ A repository hosts all your dataset files, including the revision history, makin
21
21
22
22
### Upload dataset
23
23
24
-
1. Once you've created a repository, navigate to the **Files and versions** tab to add a file. Select **Add file** to upload your dataset files. We support many text, audio, and image data extensions such as `.csv`, `.mp3`, and `.jpg`among many others (see full list [here](./datasets-viewer-configure.md)).
24
+
1. Once you've created a repository, navigate to the **Files and versions** tab to add a file. Select **Add file** to upload your dataset files. We support many text, audio, image and other data extensions such as `.csv`, `.mp3`, and `.jpg` (see the full list of [File formats](#file-formats)).
@@ -70,7 +70,7 @@ Make sure the Dataset Viewer correctly shows your data, or [Configure the Datase
70
70
71
71
## Using the `huggingface_hub` client library
72
72
73
-
The rich features set in the `huggingface_hub` library allows you to manage repositories, including creating repos and uploading datasets to the Model Hub. Visit [the client library's documentation](https://huggingface.co/docs/huggingface_hub/index) to learn more.
73
+
The rich features set in the `huggingface_hub` library allows you to manage repositories, including creating repos and uploading datasets to the Hub. Visit [the client library's documentation](https://huggingface.co/docs/huggingface_hub/index) to learn more.
74
74
75
75
## Using other libraries
76
76
@@ -79,7 +79,7 @@ See the list of [Libraries supported by the Datasets Hub](./datasets-libraries)
79
79
80
80
## Using Git
81
81
82
-
Since dataset repos are just Git repositories, you can use Git to push your data files to the Hub. Follow the guide on [Getting Started with Repositories](repositories-getting-started) to learn about using the `git` CLI to commit and push your datasets.
82
+
Since dataset repos are Git repositories, you can use Git to push your data files to the Hub. Follow the guide on [Getting Started with Repositories](repositories-getting-started) to learn about using the `git` CLI to commit and push your datasets.
It also supports files compressed using ZIP (.zip), GZIP (.gz), ZSTD (.zst), BZ2 (.bz2), LZ4 (.lz4) and LZMA (.xz).
96
96
97
-
Image and audio resources can also have additional metadata files, see the [Data files Configuration](./datasets-data-files-configuration) on image and audio datasets.
97
+
Image and audio resources can also have additional metadata files, see the [Data files Configuration](./datasets-data-files-configuration#image-and-audio-datasets) on image and audio datasets.
98
98
99
99
You may want to convert your files to these formats to benefit from all the Hub features.
100
100
Other formats and structures may not be recognized by the Hub.
Copy file name to clipboardExpand all lines: docs/hub/datasets-cards.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,7 +4,7 @@
4
4
5
5
Each dataset may be documented by the `README.md` file in the repository. This file is called a **dataset card**, and the Hugging Face Hub will render its contents on the dataset's main page. To inform users about how to responsibly use the data, it's a good idea to include information about any potential biases within the dataset. Generally, dataset cards help users understand the contents of the dataset and give context for how the dataset should be used.
6
6
7
-
You can also add dataset metadata to your card. The metadata describes important information about a dataset such as its license, language, and size. It also contains tags to help users discover a dataset on the Hub. Tags are defined in a YAML metadata section at the top of the `README.md` file.
7
+
You can also add dataset metadata to your card. The metadata describes important information about a dataset such as its license, language, and size. It also contains tags to help users discover a dataset on the Hub, and [data files configuration](./datasets-manual-configuration.md) options. Tags are defined in a YAML metadata section at the top of the `README.md` file.
Copy file name to clipboardExpand all lines: docs/hub/datasets-dask.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,7 +1,7 @@
1
1
# Dask
2
2
3
3
[Dask](https://github.com/dask/dask) is a parallel and distributed computing library that scales the existing Python and PyData ecosystem.
4
-
Since it uses [fsspec](https://filesystem-spec.readthedocs.io) to read and write remote data, you can use the Hugging Face paths (`hf://`) to read and write data on the Hub:
4
+
Since it uses [fsspec](https://filesystem-spec.readthedocs.io) to read and write remote data, you can use the Hugging Face paths ([`hf://`](https://huggingface.co/docs/huggingface_hub/guides/hf_file_system#integrations)) to read and write data on the Hub:
5
5
6
6
First you need to [Login with your Hugging Face account](../huggingface_hub/quick-start#login), for example using:
7
7
@@ -17,7 +17,7 @@ from huggingface_hub import HfApi
Copy file name to clipboardExpand all lines: docs/hub/datasets-downloading.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,7 +2,7 @@
2
2
3
3
## Integrated libraries
4
4
5
-
If a dataset on the Hub is tied to a [supported library](./datasets-libraries), loading the dataset can be done in just a few lines. For information on accessing the dataset, you can click on the "Use in _Library_" button on the dataset page to see how to do so. For example, `samsum` shows how to do so with 🤗 Datasets below.
5
+
If a dataset on the Hub is tied to a [supported library](./datasets-libraries), loading the dataset can be done in just a few lines. For information on accessing the dataset, you can click on the "Use in dataset library" button on the dataset page to see how to do so. For example, [`samsum`](https://huggingface.co/datasets/samsum?library=true) shows how to do so with 🤗 Datasets below.
@@ -16,7 +16,7 @@ If a dataset on the Hub is tied to a [supported library](./datasets-libraries),
16
16
17
17
## Using the Hugging Face Client Library
18
18
19
-
You can use the [`huggingface_hub`](https://github.com/huggingface/huggingface_hub) library to create, delete, update and retrieve information from repos. You can also download files from repos or integrate them into your library! For example, you can quickly load a CSV dataset with a few lines using Pandas.
19
+
You can use the [`huggingface_hub`](https://huggingface.co/docs/huggingface_hub) library to create, delete, update and retrieve information from repos. You can also download files from repos or integrate them into your library! For example, you can quickly load a CSV dataset with a few lines using Pandas.
20
20
21
21
```py
22
22
from huggingface_hub import hf_hub_download
@@ -32,7 +32,7 @@ dataset = pd.read_csv(
32
32
33
33
## Using Git
34
34
35
-
Since all datasets on the dataset Hub are Git repositories, you can clone the datasets locally by running:
35
+
Since all datasets on the Hub are Git repositories, you can clone the datasets locally by running:
Copy file name to clipboardExpand all lines: docs/hub/datasets-duckdb.md
+4-2Lines changed: 4 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,7 +1,7 @@
1
1
# DuckDB
2
2
3
3
[DuckDB](https://github.com/duckdb/duckdb) is an in-process SQL [OLAP](https://en.wikipedia.org/wiki/Online_analytical_processing) database management system.
4
-
Since it supports [fsspec](https://filesystem-spec.readthedocs.io) to read and write remote data, you can use the Hugging Face paths (`hf://`) to read and write data on the Hub:
4
+
Since it supports [fsspec](https://filesystem-spec.readthedocs.io) to read and write remote data, you can use the Hugging Face paths ([`hf://`](https://huggingface.co/docs/huggingface_hub/guides/hf_file_system#integrations)) to read and write data on the Hub:
5
5
6
6
First you need to [Login with your Hugging Face account](../huggingface_hub/quick-start#login), for example using:
7
7
@@ -17,7 +17,7 @@ from huggingface_hub import HfApi
Finally, you can use Hugging Face paths in DuckDB:
20
+
Finally, you can use [Hugging Face paths]([Hugging Face paths](https://huggingface.co/docs/huggingface_hub/guides/hf_file_system#integrations)) in DuckDB:
21
21
22
22
```python
23
23
>>>from huggingface_hub import HfFileSystem
@@ -39,3 +39,5 @@ You can reload it later:
39
39
>>> duckdb.register_filesystem(fs)
40
40
>>> df = duckdb.query("SELECT * FROM 'hf://datasets/username/my_dataset/data.parquet' LIMIT 10;").df()
41
41
```
42
+
43
+
To have more information on the Hugging Face paths and how they are implemented, please refer to the [the client library's documentation on the HfFileSystem](https://huggingface.co/docs/huggingface_hub/guides/hf_file_system).
Copy file name to clipboardExpand all lines: docs/hub/datasets-file-names-and-splits.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,7 +2,7 @@
2
2
3
3
To host and share your dataset, create a dataset repository on the Hugging Face Hub and upload your data files.
4
4
5
-
This guide will show you how to name your files and directories in your dataset repository when you upload it and enable all the Dataset Hub features like the Dataset Viewer.
5
+
This guide will show you how to name your files and directories in your dataset repository when you upload it and enable all the Datasets Hub features like the Dataset Viewer.
6
6
A dataset with a supported structure and [file formats](./datasets-adding#file-formats) automatically has a dataset viewer on its page on the Hub.
7
7
8
8
Note that if none of the structures below suits your case, you can have more control over how you define splits and subsets with the [Manual Configuration](./datasets-manual-configuration).
Copy file name to clipboardExpand all lines: docs/hub/datasets-overview.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,13 +2,13 @@
2
2
3
3
## Datasets on the Hub
4
4
5
-
The Hugging Face Hub hosts a [large number of community-curated datasets](https://huggingface.co/datasets) for a diverse range of tasks such as translation, automatic speech recognition, and image classification. Alongside the information contained in the [dataset card](./datasets-cards), many datasets, such as [GLUE](https://huggingface.co/datasets/glue), include a Dataset Preview to showcase the data.
5
+
The Hugging Face Hub hosts a [large number of community-curated datasets](https://huggingface.co/datasets) for a diverse range of tasks such as translation, automatic speech recognition, and image classification. Alongside the information contained in the [dataset card](./datasets-cards), many datasets, such as [GLUE](https://huggingface.co/datasets/glue), include a [Dataset Viewer](./datasets-viewer) to showcase the data.
6
6
7
-
Each dataset is a [Git repository](./repositories), equipped with the necessary scripts to download the data and generate splits for training, evaluation, and testing. For information on how a dataset repository is structured, refer to the [Structure your repository guide](https://huggingface.co/docs/datasets/repository_structure). Following the supported repo structure will ensure that your repository will have a preview on its dataset page on the Hub.
7
+
Each dataset is a [Git repository](./repositories), equipped with the necessary scripts to download the data and generate splits for training, evaluation, and testing. For information on how a dataset repository is structured, refer to the [Data files Configuration page](./datasets-data-files-configuration). Following the supported repo structure will ensure that the dataset page on the Hub will have a Viewer.
8
8
9
9
## Search for datasets
10
10
11
-
Like models and Spaces, you can search the Hub for datasets using the search bar in the top navigation or on the [main datasets page](https://huggingface.co/datasets). There's a large number of languages, tasks, and licenses that you can use to filter your results to find a dataset that's right for you.
11
+
Like models and spaces, you can search the Hub for datasets using the search bar in the top navigation or on the [main datasets page](https://huggingface.co/datasets). There's a large number of languages, tasks, and licenses that you can use to filter your results to find a dataset that's right for you.
Copy file name to clipboardExpand all lines: docs/hub/datasets-pandas.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,7 +1,7 @@
1
1
# Pandas
2
2
3
3
[Pandas](https://github.com/pandas-dev/pandas) is a widely used Python data analysis toolkit.
4
-
Since it uses [fsspec](https://filesystem-spec.readthedocs.io) to read and write remote data, you can use the Hugging Face paths (`hf://`) to read and write data on the Hub:
4
+
Since it uses [fsspec](https://filesystem-spec.readthedocs.io) to read and write remote data, you can use the Hugging Face paths ([`hf://`](https://huggingface.co/docs/huggingface_hub/guides/hf_file_system#integrations)) to read and write data on the Hub:
5
5
6
6
First you need to [Login with your Hugging Face account](../huggingface_hub/quick-start#login), for example using:
7
7
@@ -17,7 +17,7 @@ from huggingface_hub import HfApi
Finally, you can use Hugging Face paths in Pandas:
20
+
Finally, you can use [Hugging Face paths]([Hugging Face paths](https://huggingface.co/docs/huggingface_hub/guides/hf_file_system#integrations)) in Pandas:
0 commit comments