You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: CONTRIBUTING.md
+2-1Lines changed: 2 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -126,7 +126,8 @@ A [more complete guide](https://github.com/huggingface/datasets/blob/master/ADD_
126
126
127
127
6. Finally, take some time to document your dataset for other users. Each dataset should be accompanied by a `README.md` dataset card in its directory which describes the data and contains tags representing languages and tasks supported to be easily discoverable. You can find information on how to fill out the card either manually or by using our [web app](https://huggingface.co/datasets/card-creator/) in the following [guide](https://github.com/huggingface/datasets/blob/master/templates/README_guide.md).
128
128
129
-
7. If all tests pass, your dataset works correctly. Awesome! You can now follow steps 6, 7 and 8 of the section [*How to contribute to 🤗Datasets?*](#how-to-contribute-to-🤗Datasets). If you experience problems with the dummy data tests, you might want to take a look at the section *Help for dummy data tests* below.
129
+
7. If all tests pass, your dataset works correctly. Awesome! You can now follow steps 6, 7 and 8 of the section [*How to contribute to 🤗 Datasets?*](#how-to-contribute-to-Datasets). If you experience problems with the dummy data tests, you might want to take a look at the section *Help for dummy data tests* below.
`🤗Datasets` is a lightweight library providing **two** main features:
28
+
🤗 Datasets is a lightweight library providing **two** main features:
29
29
30
30
-**one-line dataloaders for many public datasets**: one liners to download and pre-process any of the  major public datasets (in 467 languages and dialects!) provided on the [HuggingFace Datasets Hub](https://huggingface.co/datasets). With a simple command like `squad_dataset = load_dataset("squad")`, get any of these datasets ready to use in a dataloader for training/evaluating a ML model (Numpy/Pandas/PyTorch/TensorFlow/JAX),
31
31
-**efficient data pre-processing**: simple, fast and reproducible data pre-processing for the above public datasets as well as your own local datasets in CSV/JSON/text. With simple commands like `tokenized_dataset = dataset.map(tokenize_exemple)`, efficiently prepare the dataset for inspection and ML model evaluation and training.
`🤗Datasets` also provides access to +15 evaluation metrics and is designed to let the community easily add and share new datasets and evaluation metrics.
41
+
🤗 Datasets also provides access to +15 evaluation metrics and is designed to let the community easily add and share new datasets and evaluation metrics.
42
42
43
-
`🤗Datasets` has many additional interesting features:
44
-
- Thrive on large datasets: `🤗Datasets` naturally frees the user from RAM memory limitation, all datasets are memory-mapped using an efficient zero-serialization cost backend (Apache Arrow).
43
+
🤗 Datasets has many additional interesting features:
44
+
- Thrive on large datasets: 🤗 Datasets naturally frees the user from RAM memory limitation, all datasets are memory-mapped using an efficient zero-serialization cost backend (Apache Arrow).
45
45
- Smart caching: never wait for your data to process several times.
46
46
- Lightweight and fast with a transparent and pythonic API (multi-processing/caching/memory-mapping).
47
47
- Built-in interoperability with NumPy, pandas, PyTorch, Tensorflow 2 and JAX.
48
48
49
-
`🤗Datasets` originated from a fork of the awesome [`TensorFlow Datasets`](https://github.com/tensorflow/datasets) and the HuggingFace team want to deeply thank the TensorFlow Datasets team for building this amazing library. More details on the differences between `🤗Datasets` and `tfds` can be found in the section [Main differences between `🤗Datasets` and `tfds`](#main-differences-between-datasets-and-tfds).
49
+
🤗 Datasets originated from a fork of the awesome [TensorFlow Datasets](https://github.com/tensorflow/datasets) and the HuggingFace team want to deeply thank the TensorFlow Datasets team for building this amazing library. More details on the differences between 🤗 Datasets and `tfds` can be found in the section [Main differences between 🤗 Datasets and `tfds`](#main-differences-between-datasets-and-tfds).
50
50
51
51
# Installation
52
52
53
53
## With pip
54
54
55
-
`🤗Datasets` can be installed from PyPi and has to be installed in a virtual environment (venv or conda for instance)
55
+
🤗 Datasets can be installed from PyPi and has to be installed in a virtual environment (venv or conda for instance)
56
56
57
57
```bash
58
58
pip install datasets
59
59
```
60
60
61
61
## With conda
62
62
63
-
`🤗Datasets` can be installed using conda as follows:
63
+
🤗 Datasets can be installed using conda as follows:
@@ -72,13 +72,13 @@ For more details on installation, check the installation page in the documentati
72
72
73
73
## Installation to use with PyTorch/TensorFlow/pandas
74
74
75
-
If you plan to use `🤗Datasets` with PyTorch (1.0+), TensorFlow (2.2+) or pandas, you should also install PyTorch, TensorFlow or pandas.
75
+
If you plan to use 🤗 Datasets with PyTorch (1.0+), TensorFlow (2.2+) or pandas, you should also install PyTorch, TensorFlow or pandas.
76
76
77
77
For more details on using the library with NumPy, pandas, PyTorch or TensorFlow, check the quick tour page in the documentation: https://huggingface.co/docs/datasets/quicktour.html
78
78
79
79
# Usage
80
80
81
-
`🤗Datasets` is made to be very simple to use. The main methods are:
81
+
🤗 Datasets is made to be very simple to use. The main methods are:
82
82
83
83
-`datasets.list_datasets()` to list the available datasets
84
84
-`datasets.load_dataset(dataset_name, **kwargs)` to instantiate a dataset
@@ -117,11 +117,11 @@ For more details on using the library, check the quick tour page in the document
117
117
118
118
- Loading a dataset https://huggingface.co/docs/datasets/loading_datasets.html
119
119
- What's in a Dataset: https://huggingface.co/docs/datasets/exploring.html
120
-
- Processing data with `🤗Datasets`: https://huggingface.co/docs/datasets/processing.html
120
+
- Processing data with 🤗 Datasets: https://huggingface.co/docs/datasets/processing.html
121
121
- Writing your own dataset loading script: https://huggingface.co/docs/datasets/add_dataset.html
122
122
- etc.
123
123
124
-
Another introduction to `🤗Datasets` is the tutorial on Google Colab here:
124
+
Another introduction to 🤗 Datasets is the tutorial on Google Colab here:
125
125
[](https://colab.research.google.com/github/huggingface/datasets/blob/master/notebooks/Overview.ipynb)
126
126
127
127
# Add a new dataset to the Hub
@@ -132,17 +132,17 @@ You will find [the step-by-step guide here](https://github.com/huggingface/datas
132
132
133
133
You can also have your own repository for your dataset on the Hub under your or your organization's namespace and share it with the community. More information in [the documentation section about dataset sharing](https://huggingface.co/docs/datasets/share_dataset.html).
134
134
135
-
# Main differences between `🤗Datasets` and `tfds`
135
+
# Main differences between 🤗 Datasets and `tfds`
136
136
137
-
If you are familiar with the great `Tensorflow Datasets`, here are the main differences between `🤗Datasets` and `tfds`:
138
-
- the scripts in `🤗Datasets` are not provided within the library but are queried, downloaded/cached and dynamically loaded upon request
139
-
-`🤗Datasets` also provides evaluation metrics in a similar fashion to the datasets, i.e. as dynamically installed scripts with a unified API. This gives access to the pair of a benchmark dataset and a benchmark metric for instance for benchmarks like [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) or [GLUE](https://gluebenchmark.com/).
140
-
- the backend serialization of `🤗Datasets` is based on [Apache Arrow](https://arrow.apache.org/) instead of TF Records and leverage python dataclasses for info and features with some diverging features (we mostly don't do encoding and store the raw data as much as possible in the backend serialization cache).
141
-
- the user-facing dataset object of `🤗Datasets` is not a `tf.data.Dataset` but a built-in framework-agnostic dataset class with methods inspired by what we like in `tf.data` (like a `map()` method). It basically wraps a memory-mapped Arrow table cache.
137
+
If you are familiar with the great TensorFlow Datasets, here are the main differences between 🤗 Datasets and `tfds`:
138
+
- the scripts in 🤗 Datasets are not provided within the library but are queried, downloaded/cached and dynamically loaded upon request
139
+
-🤗 Datasets also provides evaluation metrics in a similar fashion to the datasets, i.e. as dynamically installed scripts with a unified API. This gives access to the pair of a benchmark dataset and a benchmark metric for instance for benchmarks like [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) or [GLUE](https://gluebenchmark.com/).
140
+
- the backend serialization of 🤗 Datasets is based on [Apache Arrow](https://arrow.apache.org/) instead of TF Records and leverage python dataclasses for info and features with some diverging features (we mostly don't do encoding and store the raw data as much as possible in the backend serialization cache).
141
+
- the user-facing dataset object of 🤗 Datasets is not a `tf.data.Dataset` but a built-in framework-agnostic dataset class with methods inspired by what we like in `tf.data` (like a `map()` method). It basically wraps a memory-mapped Arrow table cache.
142
142
143
143
# Disclaimers
144
144
145
-
Similar to `TensorFlow Datasets`, `🤗Datasets` is a utility library that downloads and prepares public datasets. We do not host or distribute these datasets, vouch for their quality or fairness, or claim that you have license to use them. It is your responsibility to determine whether you have permission to use the dataset under the dataset's license.
145
+
Similar to TensorFlow Datasets, 🤗 Datasets is a utility library that downloads and prepares public datasets. We do not host or distribute these datasets, vouch for their quality or fairness, or claim that you have license to use them. It is your responsibility to determine whether you have permission to use the dataset under the dataset's license.
146
146
147
147
If you're a dataset owner and wish to update any part of it (description, citation, etc.), or do not want your dataset to be included in this library, please get in touch through a [GitHub issue](https://github.com/huggingface/datasets/issues/new). Thanks for your contribution to the ML community!
Copy file name to clipboardExpand all lines: datasets/norne/README.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -238,7 +238,7 @@ To access these reduced versions of the dataset, you can use the configs `bokmaa
238
238
239
239
NorNE was created as a collaboration between [Schibsted Media Group](https://schibsted.com/), [Språkbanken](https://www.nb.no/forskning/sprakbanken/) at the [National Library of Norway](https://www.nb.no) and the [Language Technology Group](https://www.mn.uio.no/ifi/english/research/groups/ltg/) at the University of Oslo.
240
240
241
-
NorNE was added to Huggingface Datasets by the AI-Lab at the National Library of Norway.
241
+
NorNE was added to 🤗 Datasets by the AI-Lab at the National Library of Norway.
Copy file name to clipboardExpand all lines: docs/source/exploring.rst
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -190,7 +190,7 @@ Up to now, the rows/batches/columns returned when querying the elements of the d
190
190
191
191
Sometimes we would like to have more sophisticated objects returned by our dataset, for instance NumPy arrays or PyTorch tensors instead of python lists.
192
192
193
-
🤗Datasets provides a way to do that through what is called a ``format``.
193
+
🤗Datasets provides a way to do that through what is called a ``format``.
194
194
195
195
While the internal storage of the dataset is always the Apache Arrow format, by setting a specific format on a dataset, you can filter some columns and cast the output of :func:`datasets.Dataset.__getitem__` in NumPy/pandas/PyTorch/TensorFlow, on-the-fly.
Copy file name to clipboardExpand all lines: docs/source/index.rst
+5-5Lines changed: 5 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -5,17 +5,17 @@ Datasets and evaluation metrics for natural language processing
5
5
6
6
Compatible with NumPy, Pandas, PyTorch and TensorFlow
7
7
8
-
🤗Datasets is a lightweight and extensible library to easily share and access datasets and evaluation metrics for Natural Language Processing (NLP).
8
+
🤗Datasets is a lightweight and extensible library to easily share and access datasets and evaluation metrics for Natural Language Processing (NLP).
9
9
10
-
🤗Datasets has many interesting features (beside easy sharing and accessing datasets/metrics):
10
+
🤗Datasets has many interesting features (beside easy sharing and accessing datasets/metrics):
11
11
12
12
Built-in interoperability with Numpy, Pandas, PyTorch and Tensorflow 2
13
13
Lightweight and fast with a transparent and pythonic API
14
-
Strive on large datasets: 🤗Datasets naturally frees the user from RAM memory limitation, all datasets are memory-mapped on drive by default.
14
+
Strive on large datasets: 🤗Datasets naturally frees the user from RAM memory limitation, all datasets are memory-mapped on drive by default.
15
15
Smart caching: never wait for your data to process several times
16
-
🤗Datasets currently provides access to ~100 NLP datasets and ~10 evaluation metrics and is designed to let the community easily add and share new datasets and evaluation metrics. You can browse the full set of datasets with the live 🤗Datasets viewer.
16
+
🤗Datasets currently provides access to ~100 NLP datasets and ~10 evaluation metrics and is designed to let the community easily add and share new datasets and evaluation metrics. You can browse the full set of datasets with the live 🤗Datasets viewer.
17
17
18
-
🤗Datasets originated from a fork of the awesome TensorFlow Datasets and the HuggingFace team want to deeply thank the TensorFlow Datasets team for building this amazing library. More details on the differences between 🤗Datasets and tfds can be found in the section Main differences between 🤗Datasets and tfds.
18
+
🤗Datasets originated from a fork of the awesome TensorFlow Datasets and the HuggingFace team want to deeply thank the TensorFlow Datasets team for building this amazing library. More details on the differences between 🤗Datasets and `tfds` can be found in the section Main differences between 🤗Datasets and `tfds`.
Copy file name to clipboardExpand all lines: docs/source/installation.md
+8-8Lines changed: 8 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,21 +1,21 @@
1
1
# Installation
2
2
3
-
🤗Datasets is tested on Python 3.6+.
3
+
🤗Datasets is tested on Python 3.6+.
4
4
5
-
You should install 🤗Datasets in a [virtual environment](https://docs.python.org/3/library/venv.html). If you're
5
+
You should install 🤗Datasets in a [virtual environment](https://docs.python.org/3/library/venv.html). If you're
6
6
unfamiliar with Python virtual environments, check out the [user guide](https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/). Create a virtual environment with the version of Python you're going to use and activate it.
7
7
8
-
Now, if you want to use 🤗Datasets, you can install it with pip. If you'd like to play with the examples, you must install it from source.
8
+
Now, if you want to use 🤗Datasets, you can install it with pip. If you'd like to play with the examples, you must install it from source.
9
9
10
10
## Installation with pip
11
11
12
-
🤗Datasets can be installed using pip as follows:
12
+
🤗Datasets can be installed using pip as follows:
13
13
14
14
```bash
15
15
pip install datasets
16
16
```
17
17
18
-
To check 🤗Datasets is properly installed, run the following command:
18
+
To check 🤗Datasets is properly installed, run the following command:
@@ -27,7 +27,7 @@ It should download version 1 of the [Stanford Question Answering Dataset](https:
27
27
{'answers': {'answer_start': [515], 'text': ['Saint Bernadette Soubirous']}, 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.', 'id': '5733be284776f41900661182', 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?', 'title': 'University_of_Notre_Dame'}
28
28
```
29
29
30
-
If you want to use the 🤗Datasets library with TensorFlow 2.0 or PyTorch, you will need to install these seperately.
30
+
If you want to use the 🤗Datasets library with TensorFlow 2.0 or PyTorch, you will need to install these seperately.
31
31
Please refer to [TensorFlow installation page](https://www.tensorflow.org/install/pip#tensorflow-2.0-rc-is-available)
32
32
and/or [PyTorch installation page](https://pytorch.org/get-started/locally/#start-locally) regarding the specific install command for your platform.
0 commit comments