More Parquet streaming docs (#7803)

lhoestq · web-flow · commit cfcdfce542f7 · 2025-10-09T12:01:43.000+02:00
* more parquet stream arg docs

* minor

* minor
diff --git a/docs/source/stream.mdx b/docs/source/stream.mdx
@@ -19,7 +19,8 @@ For example, the English split of the [HuggingFaceFW/fineweb](https://huggingfac
 >>> from datasets import load_dataset
 >>> dataset = load_dataset('HuggingFaceFW/fineweb', split='train', streaming=True)
 >>> print(next(iter(dataset)))
-{'text': "How AP reported in all formats from tornado-stricken regionsMarch 8, 2012\nWhen the first serious bout of tornadoes of 2012 blew through middle America in the middle of the night, they touched down in places hours from any AP bureau...
+{'text': 'How AP reported in all formats from tornado-stricken regionsMarch 8, 2012\nWhen the first serious bout of tornadoes of 2012 blew through middle America in the middle of the night, they touched down in places hours from any AP bureau...', ...,
+ 'language_score': 0.9721424579620361, 'token_count': 717}
 ```
 
 Dataset streaming also lets you work with a dataset made of local files without doing any conversion.
@@ -29,6 +30,7 @@ This is especially helpful when:
 - You don't want to wait for an extremely large local dataset to be converted to Arrow.
 - The converted files size would exceed the amount of available disk space on your computer.
 - You want to quickly explore just a few samples of a dataset.
+- You want to load only certain columns or efficiently filter a Parquet dataset.
 
 For example, you can stream a local dataset of hundreds of compressed JSONL files like [oscar-corpus/OSCAR-2201](https://huggingface.co/datasets/oscar-corpus/OSCAR-2201) to use it instantly:
 
@@ -40,6 +42,19 @@ For example, you can stream a local dataset of hundreds of compressed JSONL file
 {'id': 0, 'text': 'Founded in 2015, Golden Bees is a leading programmatic recruitment platform dedicated to employers, HR agencies and job boards. The company has developed unique HR-custom technologies and predictive algorithms to identify and attract the best candidates for a job opportunity.', ...
 ```
 
+Parquet is a columnar format that allows you to stream and load only a subset of columns and ignore unwanted columns. Parquet also stores metadata such as column statistics (at the file and row group level), enabling efficient filtering. Use the `columns` and `filters` arguments of [`datasets.packaged_modules.parquet.ParquetConfig`] to stream Parquet datasets, select columns, and apply filters:
+
+```py
+>>> from datasets import load_dataset
+>>> dataset = load_dataset('HuggingFaceFW/fineweb', split='train', streaming=True, columns=["url", "date"])
+>>> print(next(iter(dataset)))
+{'url': 'http://%20jwashington@ap.org/Content/Press-Release/2012/How-AP-reported-in-all-formats-from-tornado-stricken-regions', 'date': '2013-05-18T05:48:54Z'}
+>>> dataset = load_dataset('HuggingFaceFW/fineweb', split='train', streaming=True, filters=[("language_score", ">=", 0.99)])
+>>> print(next(iter(dataset)))
+{'text': 'Everyone wishes for something. And lots of people believe they know how to make their wishes come true with magical thinking.\nWhat is it? "Magical thinking is a belief in forms of causation, with no known physical basis," said Professor Emily Pronin of Princeton...', ...,
+ 'language_score': 0.9900368452072144, 'token_count': 716}
+```
+
 Loading a dataset in streaming mode creates a new dataset type instance (instead of the classic [`Dataset`] object), known as an [`IterableDataset`].
 This special type of dataset has its own set of processing methods shown below.
 
diff --git a/src/datasets/packaged_modules/parquet/parquet.py b/src/datasets/packaged_modules/parquet/parquet.py
@@ -32,10 +32,12 @@ class ParquetConfig(datasets.BuilderConfig):
             If possible the predicate will be pushed down to exploit the partition information
             or internal metadata found in the data source, e.g. Parquet statistics.
             Otherwise filters the loaded RecordBatches before yielding them.
-        fragment_scan_options (`pyarrow.dataset.ParquetFragmentScanOptions`)
+        fragment_scan_options (`pyarrow.dataset.ParquetFragmentScanOptions`, *optional*)
             Scan-specific options for Parquet fragments.
             This is especially useful to configure buffering and caching.
 
+            <Added version="4.2.0"/>
+
     Example:
 
     Load a subset of columns: