Skip to content

Commit bb66b6c

Browse files
authored
Fixes in docs (#7620)
* fixes in docs * docstrings
1 parent 9dd00c4 commit bb66b6c

File tree

19 files changed

+118
-65
lines changed

19 files changed

+118
-65
lines changed

docs/source/about_dataset_features.mdx

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -10,10 +10,10 @@ Let's have a look at the features of the MRPC dataset from the GLUE benchmark:
1010
>>> from datasets import load_dataset
1111
>>> dataset = load_dataset('nyu-mll/glue', 'mrpc', split='train')
1212
>>> dataset.features
13-
{'idx': Value(dtype='int32', id=None),
14-
'label': ClassLabel(names=['not_equivalent', 'equivalent'], id=None),
15-
'sentence1': Value(dtype='string', id=None),
16-
'sentence2': Value(dtype='string', id=None),
13+
{'idx': Value(dtype='int32'),
14+
'label': ClassLabel(names=['not_equivalent', 'equivalent']),
15+
'sentence1': Value(dtype='string'),
16+
'sentence2': Value(dtype='string'),
1717
}
1818
```
1919

@@ -38,11 +38,11 @@ If your data type contains a list of objects, then you want to use the [`Sequenc
3838
>>> from datasets import load_dataset
3939
>>> dataset = load_dataset('rajpurkar/squad', split='train')
4040
>>> dataset.features
41-
{'answers': Sequence(feature={'text': Value(dtype='string', id=None), 'answer_start': Value(dtype='int32', id=None)}, length=-1, id=None),
42-
'context': Value(dtype='string', id=None),
43-
'id': Value(dtype='string', id=None),
44-
'question': Value(dtype='string', id=None),
45-
'title': Value(dtype='string', id=None)}
41+
{'answers': Sequence(feature={'text': Value(dtype='string'), 'answer_start': Value(dtype='int32')}, length=-1),
42+
'context': Value(dtype='string'),
43+
'id': Value(dtype='string'),
44+
'question': Value(dtype='string'),
45+
'title': Value(dtype='string')}
4646
```
4747

4848
The `answers` field is constructed using the [`Sequence`] feature because it contains two subfields, `text` and `answer_start`, which are lists of `string` and `int32`, respectively.

docs/source/access.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -54,7 +54,7 @@ You can combine row and column name indexing to return a specific value at a pos
5454
'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'
5555
```
5656

57-
Indexing order doesn't matter. Indexing by the column name first returns a [`Column`] object that you can index as usual with row indices as usual:
57+
Indexing order doesn't matter. Indexing by the column name first returns a [`Column`] object that you can index as usual with row indices:
5858

5959
```py
6060
>>> import time

docs/source/index.mdx

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,9 @@
22

33
<img class="float-left !m-0 !border-0 !dark:border-0 !shadow-none !max-w-lg w-[150px]" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/datasets/datasets_logo.png"/>
44

5-
🤗 Datasets is a library for easily accessing and sharing datasets for Audio, Computer Vision, and Natural Language Processing (NLP) tasks.
5+
🤗 Datasets is a library for easily accessing and sharing AI datasets for Audio, Computer Vision, and Natural Language Processing (NLP) tasks.
66

7-
Load a dataset in a single line of code, and use our powerful data processing methods to quickly get your dataset ready for training in a deep learning model. Backed by the Apache Arrow format, process large datasets with zero-copy reads without any memory constraints for optimal speed and efficiency. We also feature a deep integration with the [Hugging Face Hub](https://huggingface.co/datasets), allowing you to easily load and share a dataset with the wider machine learning community.
7+
Load a dataset in a single line of code, and use our powerful data processing and streaming methods to quickly get your dataset ready for training in a deep learning model. Backed by the Apache Arrow format, process large datasets with zero-copy reads without any memory constraints for optimal speed and efficiency. We also feature a deep integration with the [Hugging Face Hub](https://huggingface.co/datasets), allowing you to easily load and share a dataset with the wider machine learning community.
88

99
Find your dataset today on the [Hugging Face Hub](https://huggingface.co/datasets), and take an in-depth look inside of it with the live viewer.
1010

docs/source/load_hub.mdx

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -20,8 +20,8 @@ Movie Review Dataset. This is a dataset of containing 5,331 positive and 5,331 n
2020

2121
# Inspect dataset features
2222
>>> ds_builder.info.features
23-
{'label': ClassLabel(names=['neg', 'pos'], id=None),
24-
'text': Value(dtype='string', id=None)}
23+
{'label': ClassLabel(names=['neg', 'pos']),
24+
'text': Value(dtype='string')}
2525
```
2626

2727
If you're happy with the dataset, then load it with [`load_dataset`]:

docs/source/loading.mdx

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -417,6 +417,6 @@ Now when you look at your dataset features, you can see it uses the custom label
417417

418418
```py
419419
>>> dataset['train'].features
420-
{'text': Value(dtype='string', id=None),
421-
'label': ClassLabel(names=['sadness', 'joy', 'love', 'anger', 'fear', 'surprise'], id=None)}
420+
{'text': Value(dtype='string'),
421+
'label': ClassLabel(names=['sadness', 'joy', 'love', 'anger', 'fear', 'surprise'])}
422422
```

docs/source/package_reference/main_classes.mdx

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -112,6 +112,8 @@ The base class [`Dataset`] implements a Dataset backed by an Apache Arrow table.
112112

113113
[[autodoc]] datasets.is_caching_enabled
114114

115+
[[autodoc]] datasets.Column
116+
115117
## DatasetDict
116118

117119
Dictionary with split names as keys ('train', 'test' for example), and `Dataset` objects as values.
@@ -200,6 +202,8 @@ The base class [`IterableDataset`] implements an iterable Dataset backed by pyth
200202
- supervised_keys
201203
- version
202204

205+
[[autodoc]] datasets.IterableColumn
206+
203207
## IterableDatasetDict
204208

205209
Dictionary with split names as keys ('train', 'test' for example), and `IterableDataset` objects as values.

docs/source/process.mdx

Lines changed: 22 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -223,21 +223,21 @@ The [`~Dataset.cast`] function transforms the feature type of one or more column
223223

224224
```py
225225
>>> dataset.features
226-
{'sentence1': Value(dtype='string', id=None),
227-
'sentence2': Value(dtype='string', id=None),
228-
'label': ClassLabel(names=['not_equivalent', 'equivalent'], id=None),
229-
'idx': Value(dtype='int32', id=None)}
226+
{'sentence1': Value(dtype='string'),
227+
'sentence2': Value(dtype='string'),
228+
'label': ClassLabel(names=['not_equivalent', 'equivalent']),
229+
'idx': Value(dtype='int32')}
230230

231231
>>> from datasets import ClassLabel, Value
232232
>>> new_features = dataset.features.copy()
233233
>>> new_features["label"] = ClassLabel(names=["negative", "positive"])
234234
>>> new_features["idx"] = Value("int64")
235235
>>> dataset = dataset.cast(new_features)
236236
>>> dataset.features
237-
{'sentence1': Value(dtype='string', id=None),
238-
'sentence2': Value(dtype='string', id=None),
239-
'label': ClassLabel(names=['negative', 'positive'], id=None),
240-
'idx': Value(dtype='int64', id=None)}
237+
{'sentence1': Value(dtype='string'),
238+
'sentence2': Value(dtype='string'),
239+
'label': ClassLabel(names=['negative', 'positive']),
240+
'idx': Value(dtype='int64')}
241241
```
242242

243243
<Tip>
@@ -250,11 +250,11 @@ Use the [`~Dataset.cast_column`] function to change the feature type of a single
250250

251251
```py
252252
>>> dataset.features
253-
{'audio': Audio(sampling_rate=44100, mono=True, id=None)}
253+
{'audio': Audio(sampling_rate=44100, mono=True)}
254254

255255
>>> dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))
256256
>>> dataset.features
257-
{'audio': Audio(sampling_rate=16000, mono=True, id=None)}
257+
{'audio': Audio(sampling_rate=16000, mono=True)}
258258
```
259259

260260
### Flatten
@@ -265,11 +265,11 @@ Sometimes a column can be a nested structure of several types. Take a look at th
265265
>>> from datasets import load_dataset
266266
>>> dataset = load_dataset("rajpurkar/squad", split="train")
267267
>>> dataset.features
268-
{'answers': Sequence(feature={'text': Value(dtype='string', id=None), 'answer_start': Value(dtype='int32', id=None)}, length=-1, id=None),
269-
'context': Value(dtype='string', id=None),
270-
'id': Value(dtype='string', id=None),
271-
'question': Value(dtype='string', id=None),
272-
'title': Value(dtype='string', id=None)}
268+
{'answers': Sequence(feature={'text': Value(dtype='string'), 'answer_start': Value(dtype='int32')}, length=-1),
269+
'context': Value(dtype='string'),
270+
'id': Value(dtype='string'),
271+
'question': Value(dtype='string'),
272+
'title': Value(dtype='string')}
273273
```
274274

275275
The `answers` field contains two subfields: `text` and `answer_start`. Use the [`~Dataset.flatten`] function to extract the subfields into their own separate columns:
@@ -810,12 +810,18 @@ The example below uses the [`pydub`](http://pydub.com/) package to open an audio
810810

811811
Once your dataset is ready, you can save it as a Hugging Face Dataset in Parquet format and reuse it later with [`load_dataset`].
812812

813-
Save your dataset by providing the name of the dataset repository on Hugging Face you wish to save it to to [`~IterableDataset.push_to_hub`]:
813+
Save your dataset by providing the name of the dataset repository on Hugging Face you wish to save it to to [`~Dataset.push_to_hub`]:
814814

815815
```python
816816
encoded_dataset.push_to_hub("username/my_dataset")
817817
```
818818

819+
You can use multiple processes to upload it in parallel. This is especially useful if you want to speed up the process:
820+
821+
```python
822+
dataset.push_to_hub("username/my_dataset", num_proc=8)
823+
```
824+
819825
Use the [`load_dataset`] function to reload the dataset (in streaming mode or not):
820826

821827
```python

docs/source/quickstart.mdx

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -312,9 +312,9 @@ Use the [`~Dataset.map`] function to speed up processing by applying your tokeni
312312
'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .',
313313
'label': 1,
314314
'idx': 0,
315-
'input_ids': array([ 101, 7277, 2180, 5303, 4806, 1117, 1711, 117, 2292, 1119, 1270, 107, 1103, 7737, 107, 117, 1104, 9938, 4267, 12223, 21811, 1117, 2554, 119, 102, 11336, 6732, 3384, 1106, 1140, 1112, 1178, 107, 1103, 7737, 107, 117, 7277, 2180, 5303, 4806, 1117, 1711, 1104, 9938, 4267, 12223, 21811, 1117, 2554, 119, 102]),
316-
'token_type_ids': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]),
317-
'attention_mask': array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])}
315+
'input_ids': [ 101, 7277, 2180, 5303, 4806, 1117, 1711, 117, 2292, 1119, 1270, 107, 1103, 7737, 107, 117, 1104, 9938, 4267, 12223, 21811, 1117, 2554, 119, 102, 11336, 6732, 3384, 1106, 1140, 1112, 1178, 107, 1103, 7737, 107, 117, 7277, 2180, 5303, 4806, 1117, 1711, 1104, 9938, 4267, 12223, 21811, 1117, 2554, 119, 102, 0, 0, ...],
316+
'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, ...],
317+
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, ...]}
318318
```
319319

320320
**4**. Rename the `label` column to `labels`, which is the expected input name in [BertForSequenceClassification](https://huggingface.co/docs/transformers/main/en/model_doc/bert#transformers.BertForSequenceClassification):
@@ -327,12 +327,13 @@ Use the [`~Dataset.map`] function to speed up processing by applying your tokeni
327327

328328
<frameworkcontent>
329329
<pt>
330-
Use the [`~Dataset.set_format`] function to set the dataset format to `torch` and specify the columns you want to format. This function applies formatting on-the-fly. After converting to PyTorch tensors, wrap the dataset in [`torch.utils.data.DataLoader`](https://alband.github.io/doc_view/data.html?highlight=torch%20utils%20data%20dataloader#torch.utils.data.DataLoader):
330+
Use the [`~Dataset.with_format`] function to set the dataset format to `torch` and specify the columns you want to format. This function applies formatting on-the-fly. After converting to PyTorch tensors, wrap the dataset in [`torch.utils.data.DataLoader`](https://alband.github.io/doc_view/data.html?highlight=torch%20utils%20data%20dataloader#torch.utils.data.DataLoader):
331331

332332
```py
333333
>>> import torch
334334

335-
>>> dataset.set_format(type="torch", columns=["input_ids", "token_type_ids", "attention_mask", "labels"])
335+
>>> dataset = dataset.select_columns(["input_ids", "token_type_ids", "attention_mask", "labels"])
336+
>>> dataset = dataset.with_format(type="torch")
336337
>>> dataloader = torch.utils.data.DataLoader(dataset, batch_size=32)
337338
```
338339
</pt>

docs/source/stream.mdx

Lines changed: 16 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -241,21 +241,21 @@ When you need to remove one or more columns, give [`IterableDataset.remove_colum
241241
>>> from datasets import load_dataset
242242
>>> dataset = load_dataset('nyu-mll/glue', 'mrpc', split='train', streaming=True)
243243
>>> dataset.features
244-
{'sentence1': Value(dtype='string', id=None),
245-
'sentence2': Value(dtype='string', id=None),
246-
'label': ClassLabel(names=['not_equivalent', 'equivalent'], id=None),
247-
'idx': Value(dtype='int32', id=None)}
244+
{'sentence1': Value(dtype='string'),
245+
'sentence2': Value(dtype='string'),
246+
'label': ClassLabel(names=['not_equivalent', 'equivalent']),
247+
'idx': Value(dtype='int32')}
248248

249249
>>> from datasets import ClassLabel, Value
250250
>>> new_features = dataset.features.copy()
251251
>>> new_features["label"] = ClassLabel(names=['negative', 'positive'])
252252
>>> new_features["idx"] = Value('int64')
253253
>>> dataset = dataset.cast(new_features)
254254
>>> dataset.features
255-
{'sentence1': Value(dtype='string', id=None),
256-
'sentence2': Value(dtype='string', id=None),
257-
'label': ClassLabel(names=['negative', 'positive'], id=None),
258-
'idx': Value(dtype='int64', id=None)}
255+
{'sentence1': Value(dtype='string'),
256+
'sentence2': Value(dtype='string'),
257+
'label': ClassLabel(names=['negative', 'positive']),
258+
'idx': Value(dtype='int64')}
259259
```
260260

261261
<Tip>
@@ -268,11 +268,11 @@ Use [`IterableDataset.cast_column`] to change the feature type of just one colum
268268

269269
```py
270270
>>> dataset.features
271-
{'audio': Audio(sampling_rate=44100, mono=True, id=None)}
271+
{'audio': Audio(sampling_rate=44100, mono=True)}
272272

273273
>>> dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))
274274
>>> dataset.features
275-
{'audio': Audio(sampling_rate=16000, mono=True, id=None)}
275+
{'audio': Audio(sampling_rate=16000, mono=True)}
276276
```
277277

278278
## Map
@@ -517,6 +517,12 @@ Save your dataset by providing the name of the dataset repository on Hugging Fac
517517
dataset.push_to_hub("username/my_dataset")
518518
```
519519

520+
If the dataset consists of multiple shards (`dataset.num_shards > 1`), you can use multiple processes to upload it in parallel. This is especially useful if you applied `map()` or `filter()` steps since they will run faster in parallel:
521+
522+
```python
523+
dataset.push_to_hub("username/my_dataset", num_proc=8)
524+
```
525+
520526
Use the [`load_dataset`] function to reload the dataset:
521527

522528
```python

setup.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -237,7 +237,7 @@
237237

238238
setup(
239239
name="datasets",
240-
version="3.6.0.dev0", # expected format is one of x.y.z.dev0, or x.y.z.rc1 or x.y.z (no to dashes, yes to dots)
240+
version="4.0.0.dev0", # expected format is one of x.y.z.dev0, or x.y.z.rc1 or x.y.z (no to dashes, yes to dots)
241241
description="HuggingFace community-driven open-source library of datasets",
242242
long_description=open("README.md", encoding="utf-8").read(),
243243
long_description_content_type="text/markdown",

0 commit comments

Comments
 (0)