Fixes in docs (#7620)

lhoestq · web-flow · commit bb66b6cf0fd3 · 2025-06-17T15:58:24.000+02:00
* fixes in docs

* docstrings
diff --git a/docs/source/about_dataset_features.mdx b/docs/source/about_dataset_features.mdx
@@ -10,10 +10,10 @@ Let's have a look at the features of the MRPC dataset from the GLUE benchmark:
 >>> from datasets import load_dataset
 >>> dataset = load_dataset('nyu-mll/glue', 'mrpc', split='train')
 >>> dataset.features
-{'idx': Value(dtype='int32', id=None),
- 'label': ClassLabel(names=['not_equivalent', 'equivalent'], id=None),
- 'sentence1': Value(dtype='string', id=None),
- 'sentence2': Value(dtype='string', id=None),
+{'idx': Value(dtype='int32'),
+ 'label': ClassLabel(names=['not_equivalent', 'equivalent']),
+ 'sentence1': Value(dtype='string'),
+ 'sentence2': Value(dtype='string'),
 }
 ```
 
@@ -38,11 +38,11 @@ If your data type contains a list of objects, then you want to use the [`Sequenc
 >>> from datasets import load_dataset
 >>> dataset = load_dataset('rajpurkar/squad', split='train')
 >>> dataset.features
-{'answers': Sequence(feature={'text': Value(dtype='string', id=None), 'answer_start': Value(dtype='int32', id=None)}, length=-1, id=None),
-'context': Value(dtype='string', id=None),
-'id': Value(dtype='string', id=None),
-'question': Value(dtype='string', id=None),
-'title': Value(dtype='string', id=None)}
+{'answers': Sequence(feature={'text': Value(dtype='string'), 'answer_start': Value(dtype='int32')}, length=-1),
+'context': Value(dtype='string'),
+'id': Value(dtype='string'),
+'question': Value(dtype='string'),
+'title': Value(dtype='string')}
 ```
 
 The `answers` field is constructed using the [`Sequence`] feature because it contains two subfields, `text` and `answer_start`, which are lists of `string` and `int32`, respectively.
diff --git a/docs/source/access.mdx b/docs/source/access.mdx
@@ -54,7 +54,7 @@ You can combine row and column name indexing to return a specific value at a pos
 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'
 ```
 
-Indexing order doesn't matter. Indexing by the column name first returns a [`Column`] object that you can index as usual with row indices as usual:
+Indexing order doesn't matter. Indexing by the column name first returns a [`Column`] object that you can index as usual with row indices:
 
 ```py
 >>> import time
diff --git a/docs/source/index.mdx b/docs/source/index.mdx
@@ -2,9 +2,9 @@
 
 <img class="float-left !m-0 !border-0 !dark:border-0 !shadow-none !max-w-lg w-[150px]" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/datasets/datasets_logo.png"/>
 
-🤗 Datasets is a library for easily accessing and sharing datasets for Audio, Computer Vision, and Natural Language Processing (NLP) tasks.
+🤗 Datasets is a library for easily accessing and sharing AI datasets for Audio, Computer Vision, and Natural Language Processing (NLP) tasks.
 
-Load a dataset in a single line of code, and use our powerful data processing methods to quickly get your dataset ready for training in a deep learning model. Backed by the Apache Arrow format, process large datasets with zero-copy reads without any memory constraints for optimal speed and efficiency. We also feature a deep integration with the [Hugging Face Hub](https://huggingface.co/datasets), allowing you to easily load and share a dataset with the wider machine learning community.
+Load a dataset in a single line of code, and use our powerful data processing and streaming methods to quickly get your dataset ready for training in a deep learning model. Backed by the Apache Arrow format, process large datasets with zero-copy reads without any memory constraints for optimal speed and efficiency. We also feature a deep integration with the [Hugging Face Hub](https://huggingface.co/datasets), allowing you to easily load and share a dataset with the wider machine learning community.
 
 Find your dataset today on the [Hugging Face Hub](https://huggingface.co/datasets), and take an in-depth look inside of it with the live viewer.
 
diff --git a/docs/source/load_hub.mdx b/docs/source/load_hub.mdx
@@ -20,8 +20,8 @@ Movie Review Dataset. This is a dataset of containing 5,331 positive and 5,331 n
 
 # Inspect dataset features
 >>> ds_builder.info.features
-{'label': ClassLabel(names=['neg', 'pos'], id=None),
- 'text': Value(dtype='string', id=None)}
+{'label': ClassLabel(names=['neg', 'pos']),
+ 'text': Value(dtype='string')}
 ```
 
 If you're happy with the dataset, then load it with [`load_dataset`]:
diff --git a/docs/source/loading.mdx b/docs/source/loading.mdx
@@ -417,6 +417,6 @@ Now when you look at your dataset features, you can see it uses the custom label
 
 ```py
 >>> dataset['train'].features
-{'text': Value(dtype='string', id=None),
-'label': ClassLabel(names=['sadness', 'joy', 'love', 'anger', 'fear', 'surprise'], id=None)}
+{'text': Value(dtype='string'),
+'label': ClassLabel(names=['sadness', 'joy', 'love', 'anger', 'fear', 'surprise'])}
 ```
diff --git a/docs/source/package_reference/main_classes.mdx b/docs/source/package_reference/main_classes.mdx
@@ -112,6 +112,8 @@ The base class [`Dataset`] implements a Dataset backed by an Apache Arrow table.
 
 [[autodoc]] datasets.is_caching_enabled
 
+[[autodoc]] datasets.Column
+
 ## DatasetDict
 
 Dictionary with split names as keys ('train', 'test' for example), and `Dataset` objects as values.
@@ -200,6 +202,8 @@ The base class [`IterableDataset`] implements an iterable Dataset backed by pyth
     - supervised_keys
     - version
 
+[[autodoc]] datasets.IterableColumn
+
 ## IterableDatasetDict
 
 Dictionary with split names as keys ('train', 'test' for example), and `IterableDataset` objects as values.
diff --git a/docs/source/process.mdx b/docs/source/process.mdx
@@ -223,21 +223,21 @@ The [`~Dataset.cast`] function transforms the feature type of one or more column
 
 ```py
 >>> dataset.features
-{'sentence1': Value(dtype='string', id=None),
-'sentence2': Value(dtype='string', id=None),
-'label': ClassLabel(names=['not_equivalent', 'equivalent'], id=None),
-'idx': Value(dtype='int32', id=None)}
+{'sentence1': Value(dtype='string'),
+'sentence2': Value(dtype='string'),
+'label': ClassLabel(names=['not_equivalent', 'equivalent']),
+'idx': Value(dtype='int32')}
 
 >>> from datasets import ClassLabel, Value
 >>> new_features = dataset.features.copy()
 >>> new_features["label"] = ClassLabel(names=["negative", "positive"])
 >>> new_features["idx"] = Value("int64")
 >>> dataset = dataset.cast(new_features)
 >>> dataset.features
-{'sentence1': Value(dtype='string', id=None),
-'sentence2': Value(dtype='string', id=None),
-'label': ClassLabel(names=['negative', 'positive'], id=None),
-'idx': Value(dtype='int64', id=None)}
+{'sentence1': Value(dtype='string'),
+'sentence2': Value(dtype='string'),
+'label': ClassLabel(names=['negative', 'positive']),
+'idx': Value(dtype='int64')}
 ```
 
 <Tip>
@@ -250,11 +250,11 @@ Use the [`~Dataset.cast_column`] function to change the feature type of a single
 
 ```py
 >>> dataset.features
-{'audio': Audio(sampling_rate=44100, mono=True, id=None)}
+{'audio': Audio(sampling_rate=44100, mono=True)}
 
 >>> dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))
 >>> dataset.features
-{'audio': Audio(sampling_rate=16000, mono=True, id=None)}
+{'audio': Audio(sampling_rate=16000, mono=True)}
 ```
 
 ### Flatten
@@ -265,11 +265,11 @@ Sometimes a column can be a nested structure of several types. Take a look at th
 >>> from datasets import load_dataset
 >>> dataset = load_dataset("rajpurkar/squad", split="train")
 >>> dataset.features
-{'answers': Sequence(feature={'text': Value(dtype='string', id=None), 'answer_start': Value(dtype='int32', id=None)}, length=-1, id=None),
-'context': Value(dtype='string', id=None),
-'id': Value(dtype='string', id=None),
-'question': Value(dtype='string', id=None),
-'title': Value(dtype='string', id=None)}
+{'answers': Sequence(feature={'text': Value(dtype='string'), 'answer_start': Value(dtype='int32')}, length=-1),
+'context': Value(dtype='string'),
+'id': Value(dtype='string'),
+'question': Value(dtype='string'),
+'title': Value(dtype='string')}
 ```
 
 The `answers` field contains two subfields: `text` and `answer_start`. Use the [`~Dataset.flatten`] function to extract the subfields into their own separate columns:
@@ -810,12 +810,18 @@ The example below uses the [`pydub`](http://pydub.com/) package to open an audio
 
 Once your dataset is ready, you can save it as a Hugging Face Dataset in Parquet format and reuse it later with [`load_dataset`].
 
-Save your dataset by providing the name of the dataset repository on Hugging Face you wish to save it to to [`~IterableDataset.push_to_hub`]:
+Save your dataset by providing the name of the dataset repository on Hugging Face you wish to save it to to [`~Dataset.push_to_hub`]:
 
 ```python
 encoded_dataset.push_to_hub("username/my_dataset")
 ```
 
+You can use multiple processes to upload it in parallel. This is especially useful if you want to speed up the process:
+
+```python
+dataset.push_to_hub("username/my_dataset", num_proc=8)
+```
+
 Use the [`load_dataset`] function to reload the dataset (in streaming mode or not):
 
 ```python
diff --git a/docs/source/quickstart.mdx b/docs/source/quickstart.mdx
@@ -312,9 +312,9 @@ Use the [`~Dataset.map`] function to speed up processing by applying your tokeni
 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .',
 'label': 1,
 'idx': 0,
-'input_ids': array([  101,  7277,  2180,  5303,  4806,  1117,  1711,   117,  2292, 1119,  1270,   107,  1103,  7737,   107,   117,  1104,  9938, 4267, 12223, 21811,  1117,  2554,   119,   102, 11336,  6732, 3384,  1106,  1140,  1112,  1178,   107,  1103,  7737,   107, 117,  7277,  2180,  5303,  4806,  1117,  1711,  1104,  9938, 4267, 12223, 21811,  1117,  2554,   119,   102]),
-'token_type_ids': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]),
-'attention_mask': array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])}
+'input_ids': [  101,  7277,  2180,  5303,  4806,  1117,  1711,   117,  2292, 1119,  1270,   107,  1103,  7737,   107,   117,  1104,  9938, 4267, 12223, 21811,  1117,  2554,   119,   102, 11336,  6732, 3384,  1106,  1140,  1112,  1178,   107,  1103,  7737,   107, 117,  7277,  2180,  5303,  4806,  1117,  1711,  1104,  9938, 4267, 12223, 21811,  1117,  2554,   119,   102, 0, 0, ...],
+'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, ...],
+'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, ...]}
 ```
 
 **4**. Rename the `label` column to `labels`, which is the expected input name in [BertForSequenceClassification](https://huggingface.co/docs/transformers/main/en/model_doc/bert#transformers.BertForSequenceClassification):
@@ -327,12 +327,13 @@ Use the [`~Dataset.map`] function to speed up processing by applying your tokeni
 
 <frameworkcontent>
 <pt>
-Use the [`~Dataset.set_format`] function to set the dataset format to `torch` and specify the columns you want to format. This function applies formatting on-the-fly. After converting to PyTorch tensors, wrap the dataset in [`torch.utils.data.DataLoader`](https://alband.github.io/doc_view/data.html?highlight=torch%20utils%20data%20dataloader#torch.utils.data.DataLoader):
+Use the [`~Dataset.with_format`] function to set the dataset format to `torch` and specify the columns you want to format. This function applies formatting on-the-fly. After converting to PyTorch tensors, wrap the dataset in [`torch.utils.data.DataLoader`](https://alband.github.io/doc_view/data.html?highlight=torch%20utils%20data%20dataloader#torch.utils.data.DataLoader):
 
 ```py
 >>> import torch
 
->>> dataset.set_format(type="torch", columns=["input_ids", "token_type_ids", "attention_mask", "labels"])
+>>> dataset = dataset.select_columns(["input_ids", "token_type_ids", "attention_mask", "labels"])
+>>> dataset = dataset.with_format(type="torch")
 >>> dataloader = torch.utils.data.DataLoader(dataset, batch_size=32)
 ```
 </pt>
diff --git a/docs/source/stream.mdx b/docs/source/stream.mdx
@@ -241,21 +241,21 @@ When you need to remove one or more columns, give [`IterableDataset.remove_colum
 >>> from datasets import load_dataset
 >>> dataset = load_dataset('nyu-mll/glue', 'mrpc', split='train', streaming=True)
 >>> dataset.features
-{'sentence1': Value(dtype='string', id=None),
-'sentence2': Value(dtype='string', id=None),
-'label': ClassLabel(names=['not_equivalent', 'equivalent'], id=None),
-'idx': Value(dtype='int32', id=None)}
+{'sentence1': Value(dtype='string'),
+'sentence2': Value(dtype='string'),
+'label': ClassLabel(names=['not_equivalent', 'equivalent']),
+'idx': Value(dtype='int32')}
 
 >>> from datasets import ClassLabel, Value
 >>> new_features = dataset.features.copy()
 >>> new_features["label"] = ClassLabel(names=['negative', 'positive'])
 >>> new_features["idx"] = Value('int64')
 >>> dataset = dataset.cast(new_features)
 >>> dataset.features
-{'sentence1': Value(dtype='string', id=None),
-'sentence2': Value(dtype='string', id=None),
-'label': ClassLabel(names=['negative', 'positive'], id=None),
-'idx': Value(dtype='int64', id=None)}
+{'sentence1': Value(dtype='string'),
+'sentence2': Value(dtype='string'),
+'label': ClassLabel(names=['negative', 'positive']),
+'idx': Value(dtype='int64')}
 ```
 
 <Tip>
@@ -268,11 +268,11 @@ Use [`IterableDataset.cast_column`] to change the feature type of just one colum
 
 ```py
 >>> dataset.features
-{'audio': Audio(sampling_rate=44100, mono=True, id=None)}
+{'audio': Audio(sampling_rate=44100, mono=True)}
 
 >>> dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))
 >>> dataset.features
-{'audio': Audio(sampling_rate=16000, mono=True, id=None)}
+{'audio': Audio(sampling_rate=16000, mono=True)}
 ```
 
 ## Map
@@ -517,6 +517,12 @@ Save your dataset by providing the name of the dataset repository on Hugging Fac
 dataset.push_to_hub("username/my_dataset")
 ```
 
+If the dataset consists of multiple shards (`dataset.num_shards > 1`), you can use multiple processes to upload it in parallel. This is especially useful if you applied `map()` or `filter()` steps since they will run faster in parallel:
+
+```python
+dataset.push_to_hub("username/my_dataset", num_proc=8)
+```
+
 Use the [`load_dataset`] function to reload the dataset:
 
 ```python
diff --git a/setup.py b/setup.py
@@ -237,7 +237,7 @@
 
 setup(
     name="datasets",
-    version="3.6.0.dev0",  # expected format is one of x.y.z.dev0, or x.y.z.rc1 or x.y.z (no to dashes, yes to dots)
+    version="4.0.0.dev0",  # expected format is one of x.y.z.dev0, or x.y.z.rc1 or x.y.z (no to dashes, yes to dots)
     description="HuggingFace community-driven open-source library of datasets",
     long_description=open("README.md", encoding="utf-8").read(),
     long_description_content_type="text/markdown",
diff --git a/src/datasets/__init__.py b/src/datasets/__init__.py
@@ -12,7 +12,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-__version__ = "3.6.0.dev0"
+__version__ = "4.0.0.dev0"
 
 from .arrow_dataset import Column, Dataset
 from .arrow_reader import ReadInstruction
diff --git a/src/datasets/arrow_dataset.py b/src/datasets/arrow_dataset.py
@@ -628,7 +628,25 @@ class NonExistentDatasetError(Exception):
 
 
 class Column(Sequence_):
-    """An iterable for a specific column of an [`Dataset`]."""
+    """
+    An iterable for a specific column of a [`Dataset`].
+
+    Example:
+
+    Iterate on the texts of the "text" column of a dataset:
+
+    ```python
+    for text in dataset["text"]:
+        ...
+    ```
+
+    It also works with nested columns:
+
+    ```python
+    for source in dataset["metadata"]["source"]:
+        ...
+    ```
+    """
 
     def __init__(self, source: Union["Dataset", "Column"], column_name: str):
         self.source = source
diff --git a/src/datasets/features/audio.py b/src/datasets/features/audio.py
@@ -65,7 +65,7 @@ class Audio:
     sampling_rate: Optional[int] = None
     mono: bool = True
     decode: bool = True
-    id: Optional[str] = None
+    id: Optional[str] = field(default=None, repr=False)
     # Automatically constructed
     dtype: ClassVar[str] = "dict"
     pa_type: ClassVar[Any] = pa.struct({"bytes": pa.binary(), "path": pa.string()})
diff --git a/src/datasets/features/features.py b/src/datasets/features/features.py
@@ -515,7 +515,7 @@ class Value:
     """
 
     dtype: str
-    id: Optional[str] = None
+    id: Optional[str] = field(default=None, repr=False)
     # Automatically constructed
     pa_type: ClassVar[Any] = None
     _type: str = field(default="Value", init=False, repr=False)
@@ -575,7 +575,7 @@ class Array2D(_ArrayXD):
 
     shape: tuple
     dtype: str
-    id: Optional[str] = None
+    id: Optional[str] = field(default=None, repr=False)
     # Automatically constructed
     _type: str = field(default="Array2D", init=False, repr=False)
 
@@ -600,7 +600,7 @@ class Array3D(_ArrayXD):
 
     shape: tuple
     dtype: str
-    id: Optional[str] = None
+    id: Optional[str] = field(default=None, repr=False)
     # Automatically constructed
     _type: str = field(default="Array3D", init=False, repr=False)
 
@@ -625,7 +625,7 @@ class Array4D(_ArrayXD):
 
     shape: tuple
     dtype: str
-    id: Optional[str] = None
+    id: Optional[str] = field(default=None, repr=False)
     # Automatically constructed
     _type: str = field(default="Array4D", init=False, repr=False)
 
@@ -650,7 +650,7 @@ class Array5D(_ArrayXD):
 
     shape: tuple
     dtype: str
-    id: Optional[str] = None
+    id: Optional[str] = field(default=None, repr=False)
     # Automatically constructed
     _type: str = field(default="Array5D", init=False, repr=False)
 
@@ -985,7 +985,7 @@ class ClassLabel:
     num_classes: InitVar[Optional[int]] = None  # Pseudo-field: ignored by asdict/fields when converting to/from dict
     names: list[str] = None
     names_file: InitVar[Optional[str]] = None  # Pseudo-field: ignored by asdict/fields when converting to/from dict
-    id: Optional[str] = None
+    id: Optional[str] = field(default=None, repr=False)
     # Automatically constructed
     dtype: ClassVar[str] = "int64"
     pa_type: ClassVar[Any] = pa.int64()
@@ -1171,7 +1171,7 @@ class Sequence:
 
     feature: Any
     length: int = -1
-    id: Optional[str] = None
+    id: Optional[str] = field(default=None, repr=False)
     # Automatically constructed
     dtype: ClassVar[str] = "list"
     pa_type: ClassVar[Any] = None
@@ -1190,7 +1190,7 @@ class LargeList:
     """
 
     feature: Any
-    id: Optional[str] = None
+    id: Optional[str] = field(default=None, repr=False)
     # Automatically constructed
     pa_type: ClassVar[Any] = None
     _type: str = field(default="LargeList", init=False, repr=False)
diff --git a/src/datasets/features/image.py b/src/datasets/features/image.py
@@ -82,7 +82,7 @@ class Image:
 
     mode: Optional[str] = None
     decode: bool = True
-    id: Optional[str] = None
+    id: Optional[str] = field(default=None, repr=False)
     # Automatically constructed
     dtype: ClassVar[str] = "PIL.Image.Image"
     pa_type: ClassVar[Any] = pa.struct({"bytes": pa.binary(), "path": pa.string()})
diff --git a/src/datasets/features/pdf.py b/src/datasets/features/pdf.py
diff --git a/src/datasets/features/translation.py b/src/datasets/features/translation.py
diff --git a/src/datasets/features/video.py b/src/datasets/features/video.py
diff --git a/src/datasets/iterable_dataset.py b/src/datasets/iterable_dataset.py