Skip to content

Commit 4aff493

Browse files
mariosaskostas00lhoestq
authored
More consistent naming (#2611)
* More consistent naming * Update datasets/norne/README.md Co-authored-by: Stas Bekman <[email protected]> * Fix anchor Co-authored-by: Stas Bekman <[email protected]> * Remove backticks in name * Remove backticks in anchor * Replace Tensorflow with TensorFlow * more 🤗 Co-authored-by: Stas Bekman <[email protected]> Co-authored-by: Quentin Lhoest <[email protected]>
1 parent dcc2cf1 commit 4aff493

File tree

15 files changed

+81
-80
lines changed

15 files changed

+81
-80
lines changed

CONTRIBUTING.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -126,7 +126,8 @@ A [more complete guide](https://github.com/huggingface/datasets/blob/master/ADD_
126126

127127
6. Finally, take some time to document your dataset for other users. Each dataset should be accompanied by a `README.md` dataset card in its directory which describes the data and contains tags representing languages and tasks supported to be easily discoverable. You can find information on how to fill out the card either manually or by using our [web app](https://huggingface.co/datasets/card-creator/) in the following [guide](https://github.com/huggingface/datasets/blob/master/templates/README_guide.md).
128128

129-
7. If all tests pass, your dataset works correctly. Awesome! You can now follow steps 6, 7 and 8 of the section [*How to contribute to 🤗Datasets?*](#how-to-contribute-to-🤗Datasets). If you experience problems with the dummy data tests, you might want to take a look at the section *Help for dummy data tests* below.
129+
7. If all tests pass, your dataset works correctly. Awesome! You can now follow steps 6, 7 and 8 of the section [*How to contribute to 🤗 Datasets?*](#how-to-contribute-to-Datasets). If you experience problems with the dummy data tests, you might want to take a look at the section *Help for dummy data tests* below.
130+
130131

131132

132133
### Help for dummy data tests

README.md

Lines changed: 19 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@
2525
<a href="https://zenodo.org/badge/latestdoi/250213286"><img src="https://zenodo.org/badge/250213286.svg" alt="DOI"></a>
2626
</p>
2727

28-
`🤗Datasets` is a lightweight library providing **two** main features:
28+
🤗 Datasets is a lightweight library providing **two** main features:
2929

3030
- **one-line dataloaders for many public datasets**: one liners to download and pre-process any of the ![number of datasets](https://img.shields.io/endpoint?url=https://huggingface.co/api/shields/datasets&color=brightgreen) major public datasets (in 467 languages and dialects!) provided on the [HuggingFace Datasets Hub](https://huggingface.co/datasets). With a simple command like `squad_dataset = load_dataset("squad")`, get any of these datasets ready to use in a dataloader for training/evaluating a ML model (Numpy/Pandas/PyTorch/TensorFlow/JAX),
3131
- **efficient data pre-processing**: simple, fast and reproducible data pre-processing for the above public datasets as well as your own local datasets in CSV/JSON/text. With simple commands like `tokenized_dataset = dataset.map(tokenize_exemple)`, efficiently prepare the dataset for inspection and ML model evaluation and training.
@@ -38,29 +38,29 @@
3838
<a href="https://hf.co/course"><img src="https://raw.githubusercontent.com/huggingface/datasets/master/docs/source/imgs/course_banner.png"></a>
3939
</h3>
4040

41-
`🤗Datasets` also provides access to +15 evaluation metrics and is designed to let the community easily add and share new datasets and evaluation metrics.
41+
🤗 Datasets also provides access to +15 evaluation metrics and is designed to let the community easily add and share new datasets and evaluation metrics.
4242

43-
`🤗Datasets` has many additional interesting features:
44-
- Thrive on large datasets: `🤗Datasets` naturally frees the user from RAM memory limitation, all datasets are memory-mapped using an efficient zero-serialization cost backend (Apache Arrow).
43+
🤗 Datasets has many additional interesting features:
44+
- Thrive on large datasets: 🤗 Datasets naturally frees the user from RAM memory limitation, all datasets are memory-mapped using an efficient zero-serialization cost backend (Apache Arrow).
4545
- Smart caching: never wait for your data to process several times.
4646
- Lightweight and fast with a transparent and pythonic API (multi-processing/caching/memory-mapping).
4747
- Built-in interoperability with NumPy, pandas, PyTorch, Tensorflow 2 and JAX.
4848

49-
`🤗Datasets` originated from a fork of the awesome [`TensorFlow Datasets`](https://github.com/tensorflow/datasets) and the HuggingFace team want to deeply thank the TensorFlow Datasets team for building this amazing library. More details on the differences between `🤗Datasets` and `tfds` can be found in the section [Main differences between `🤗Datasets` and `tfds`](#main-differences-between-datasets-and-tfds).
49+
🤗 Datasets originated from a fork of the awesome [TensorFlow Datasets](https://github.com/tensorflow/datasets) and the HuggingFace team want to deeply thank the TensorFlow Datasets team for building this amazing library. More details on the differences between 🤗 Datasets and `tfds` can be found in the section [Main differences between 🤗 Datasets and `tfds`](#main-differences-between-datasets-and-tfds).
5050

5151
# Installation
5252

5353
## With pip
5454

55-
`🤗Datasets` can be installed from PyPi and has to be installed in a virtual environment (venv or conda for instance)
55+
🤗 Datasets can be installed from PyPi and has to be installed in a virtual environment (venv or conda for instance)
5656

5757
```bash
5858
pip install datasets
5959
```
6060

6161
## With conda
6262

63-
`🤗Datasets` can be installed using conda as follows:
63+
🤗 Datasets can be installed using conda as follows:
6464

6565
```bash
6666
conda install -c huggingface -c conda-forge datasets
@@ -72,13 +72,13 @@ For more details on installation, check the installation page in the documentati
7272

7373
## Installation to use with PyTorch/TensorFlow/pandas
7474

75-
If you plan to use `🤗Datasets` with PyTorch (1.0+), TensorFlow (2.2+) or pandas, you should also install PyTorch, TensorFlow or pandas.
75+
If you plan to use 🤗 Datasets with PyTorch (1.0+), TensorFlow (2.2+) or pandas, you should also install PyTorch, TensorFlow or pandas.
7676

7777
For more details on using the library with NumPy, pandas, PyTorch or TensorFlow, check the quick tour page in the documentation: https://huggingface.co/docs/datasets/quicktour.html
7878

7979
# Usage
8080

81-
`🤗Datasets` is made to be very simple to use. The main methods are:
81+
🤗 Datasets is made to be very simple to use. The main methods are:
8282

8383
- `datasets.list_datasets()` to list the available datasets
8484
- `datasets.load_dataset(dataset_name, **kwargs)` to instantiate a dataset
@@ -106,7 +106,7 @@ squad_metric = load_metric('squad')
106106
# Process the dataset - add a column with the length of the context texts
107107
dataset_with_length = squad_dataset.map(lambda x: {"length": len(x["context"])})
108108

109-
# Process the dataset - tokenize the context texts (using a tokenizer from the 🤗Transformers library)
109+
# Process the dataset - tokenize the context texts (using a tokenizer from the 🤗 Transformers library)
110110
from transformers import AutoTokenizer
111111
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
112112

@@ -117,11 +117,11 @@ For more details on using the library, check the quick tour page in the document
117117

118118
- Loading a dataset https://huggingface.co/docs/datasets/loading_datasets.html
119119
- What's in a Dataset: https://huggingface.co/docs/datasets/exploring.html
120-
- Processing data with `🤗Datasets`: https://huggingface.co/docs/datasets/processing.html
120+
- Processing data with 🤗 Datasets: https://huggingface.co/docs/datasets/processing.html
121121
- Writing your own dataset loading script: https://huggingface.co/docs/datasets/add_dataset.html
122122
- etc.
123123

124-
Another introduction to `🤗Datasets` is the tutorial on Google Colab here:
124+
Another introduction to 🤗 Datasets is the tutorial on Google Colab here:
125125
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/datasets/blob/master/notebooks/Overview.ipynb)
126126

127127
# Add a new dataset to the Hub
@@ -132,17 +132,17 @@ You will find [the step-by-step guide here](https://github.com/huggingface/datas
132132

133133
You can also have your own repository for your dataset on the Hub under your or your organization's namespace and share it with the community. More information in [the documentation section about dataset sharing](https://huggingface.co/docs/datasets/share_dataset.html).
134134

135-
# Main differences between `🤗Datasets` and `tfds`
135+
# Main differences between 🤗 Datasets and `tfds`
136136

137-
If you are familiar with the great `Tensorflow Datasets`, here are the main differences between `🤗Datasets` and `tfds`:
138-
- the scripts in `🤗Datasets` are not provided within the library but are queried, downloaded/cached and dynamically loaded upon request
139-
- `🤗Datasets` also provides evaluation metrics in a similar fashion to the datasets, i.e. as dynamically installed scripts with a unified API. This gives access to the pair of a benchmark dataset and a benchmark metric for instance for benchmarks like [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) or [GLUE](https://gluebenchmark.com/).
140-
- the backend serialization of `🤗Datasets` is based on [Apache Arrow](https://arrow.apache.org/) instead of TF Records and leverage python dataclasses for info and features with some diverging features (we mostly don't do encoding and store the raw data as much as possible in the backend serialization cache).
141-
- the user-facing dataset object of `🤗Datasets` is not a `tf.data.Dataset` but a built-in framework-agnostic dataset class with methods inspired by what we like in `tf.data` (like a `map()` method). It basically wraps a memory-mapped Arrow table cache.
137+
If you are familiar with the great TensorFlow Datasets, here are the main differences between 🤗 Datasets and `tfds`:
138+
- the scripts in 🤗 Datasets are not provided within the library but are queried, downloaded/cached and dynamically loaded upon request
139+
- 🤗 Datasets also provides evaluation metrics in a similar fashion to the datasets, i.e. as dynamically installed scripts with a unified API. This gives access to the pair of a benchmark dataset and a benchmark metric for instance for benchmarks like [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) or [GLUE](https://gluebenchmark.com/).
140+
- the backend serialization of 🤗 Datasets is based on [Apache Arrow](https://arrow.apache.org/) instead of TF Records and leverage python dataclasses for info and features with some diverging features (we mostly don't do encoding and store the raw data as much as possible in the backend serialization cache).
141+
- the user-facing dataset object of 🤗 Datasets is not a `tf.data.Dataset` but a built-in framework-agnostic dataset class with methods inspired by what we like in `tf.data` (like a `map()` method). It basically wraps a memory-mapped Arrow table cache.
142142

143143
# Disclaimers
144144

145-
Similar to `TensorFlow Datasets`, `🤗Datasets` is a utility library that downloads and prepares public datasets. We do not host or distribute these datasets, vouch for their quality or fairness, or claim that you have license to use them. It is your responsibility to determine whether you have permission to use the dataset under the dataset's license.
145+
Similar to TensorFlow Datasets, 🤗 Datasets is a utility library that downloads and prepares public datasets. We do not host or distribute these datasets, vouch for their quality or fairness, or claim that you have license to use them. It is your responsibility to determine whether you have permission to use the dataset under the dataset's license.
146146

147147
If you're a dataset owner and wish to update any part of it (description, citation, etc.), or do not want your dataset to be included in this library, please get in touch through a [GitHub issue](https://github.com/huggingface/datasets/issues/new). Thanks for your contribution to the ML community!
148148

datasets/newsroom/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -61,7 +61,7 @@ And additional features:
6161
- compression_bin: low, medium, high.
6262

6363
This dataset can be downloaded upon requests. Unzip all the contents
64-
"train.jsonl, dev.josnl, test.jsonl" to the tfds folder.
64+
"train.jsonl, dev.josnl, test.jsonl" to the `tfds` folder.
6565

6666
### Supported Tasks and Leaderboards
6767

datasets/norne/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -238,7 +238,7 @@ To access these reduced versions of the dataset, you can use the configs `bokmaa
238238

239239
NorNE was created as a collaboration between [Schibsted Media Group](https://schibsted.com/), [Språkbanken](https://www.nb.no/forskning/sprakbanken/) at the [National Library of Norway](https://www.nb.no) and the [Language Technology Group](https://www.mn.uio.no/ifi/english/research/groups/ltg/) at the University of Oslo.
240240

241-
NorNE was added to Huggingface Datasets by the AI-Lab at the National Library of Norway.
241+
NorNE was added to 🤗 Datasets by the AI-Lab at the National Library of Norway.
242242

243243
### Licensing Information
244244

docs/source/exploring.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -190,7 +190,7 @@ Up to now, the rows/batches/columns returned when querying the elements of the d
190190

191191
Sometimes we would like to have more sophisticated objects returned by our dataset, for instance NumPy arrays or PyTorch tensors instead of python lists.
192192

193-
🤗Datasets provides a way to do that through what is called a ``format``.
193+
🤗 Datasets provides a way to do that through what is called a ``format``.
194194

195195
While the internal storage of the dataset is always the Apache Arrow format, by setting a specific format on a dataset, you can filter some columns and cast the output of :func:`datasets.Dataset.__getitem__` in NumPy/pandas/PyTorch/TensorFlow, on-the-fly.
196196

docs/source/index.rst

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -5,17 +5,17 @@ Datasets and evaluation metrics for natural language processing
55

66
Compatible with NumPy, Pandas, PyTorch and TensorFlow
77

8-
🤗Datasets is a lightweight and extensible library to easily share and access datasets and evaluation metrics for Natural Language Processing (NLP).
8+
🤗 Datasets is a lightweight and extensible library to easily share and access datasets and evaluation metrics for Natural Language Processing (NLP).
99

10-
🤗Datasets has many interesting features (beside easy sharing and accessing datasets/metrics):
10+
🤗 Datasets has many interesting features (beside easy sharing and accessing datasets/metrics):
1111

1212
Built-in interoperability with Numpy, Pandas, PyTorch and Tensorflow 2
1313
Lightweight and fast with a transparent and pythonic API
14-
Strive on large datasets: 🤗Datasets naturally frees the user from RAM memory limitation, all datasets are memory-mapped on drive by default.
14+
Strive on large datasets: 🤗 Datasets naturally frees the user from RAM memory limitation, all datasets are memory-mapped on drive by default.
1515
Smart caching: never wait for your data to process several times
16-
🤗Datasets currently provides access to ~100 NLP datasets and ~10 evaluation metrics and is designed to let the community easily add and share new datasets and evaluation metrics. You can browse the full set of datasets with the live 🤗Datasets viewer.
16+
🤗 Datasets currently provides access to ~100 NLP datasets and ~10 evaluation metrics and is designed to let the community easily add and share new datasets and evaluation metrics. You can browse the full set of datasets with the live 🤗 Datasets viewer.
1717

18-
🤗Datasets originated from a fork of the awesome TensorFlow Datasets and the HuggingFace team want to deeply thank the TensorFlow Datasets team for building this amazing library. More details on the differences between 🤗Datasets and tfds can be found in the section Main differences between 🤗Datasets and tfds.
18+
🤗 Datasets originated from a fork of the awesome TensorFlow Datasets and the HuggingFace team want to deeply thank the TensorFlow Datasets team for building this amazing library. More details on the differences between 🤗 Datasets and `tfds` can be found in the section Main differences between 🤗 Datasets and `tfds`.
1919

2020
Contents
2121
---------------------------------

docs/source/installation.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,21 @@
11
# Installation
22

3-
🤗Datasets is tested on Python 3.6+.
3+
🤗 Datasets is tested on Python 3.6+.
44

5-
You should install 🤗Datasets in a [virtual environment](https://docs.python.org/3/library/venv.html). If you're
5+
You should install 🤗 Datasets in a [virtual environment](https://docs.python.org/3/library/venv.html). If you're
66
unfamiliar with Python virtual environments, check out the [user guide](https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/). Create a virtual environment with the version of Python you're going to use and activate it.
77

8-
Now, if you want to use 🤗Datasets, you can install it with pip. If you'd like to play with the examples, you must install it from source.
8+
Now, if you want to use 🤗 Datasets, you can install it with pip. If you'd like to play with the examples, you must install it from source.
99

1010
## Installation with pip
1111

12-
🤗Datasets can be installed using pip as follows:
12+
🤗 Datasets can be installed using pip as follows:
1313

1414
```bash
1515
pip install datasets
1616
```
1717

18-
To check 🤗Datasets is properly installed, run the following command:
18+
To check 🤗 Datasets is properly installed, run the following command:
1919

2020
```bash
2121
python -c "from datasets import load_dataset; print(load_dataset('squad', split='train')[0])"
@@ -27,7 +27,7 @@ It should download version 1 of the [Stanford Question Answering Dataset](https:
2727
{'answers': {'answer_start': [515], 'text': ['Saint Bernadette Soubirous']}, 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.', 'id': '5733be284776f41900661182', 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?', 'title': 'University_of_Notre_Dame'}
2828
```
2929

30-
If you want to use the 🤗Datasets library with TensorFlow 2.0 or PyTorch, you will need to install these seperately.
30+
If you want to use the 🤗 Datasets library with TensorFlow 2.0 or PyTorch, you will need to install these seperately.
3131
Please refer to [TensorFlow installation page](https://www.tensorflow.org/install/pip#tensorflow-2.0-rc-is-available)
3232
and/or [PyTorch installation page](https://pytorch.org/get-started/locally/#start-locally) regarding the specific install command for your platform.
3333

@@ -48,11 +48,11 @@ Again, you can run:
4848
python -c "from datasets import load_dataset; print(load_dataset('squad', split='train')[0])"
4949
```
5050

51-
to check 🤗Datasets is properly installed.
51+
to check 🤗 Datasets is properly installed.
5252

5353
## With conda
5454

55-
🤗Datasets can be installed using conda as follows:
55+
🤗 Datasets can be installed using conda as follows:
5656

5757
```bash
5858
conda install -c huggingface -c conda-forge datasets

0 commit comments

Comments
 (0)