Skip to content

Commit cfcdfce

Browse files
authored
More Parquet streaming docs (#7803)
* more parquet stream arg docs * minor * minor
1 parent 4e18df1 commit cfcdfce

File tree

2 files changed

+19
-2
lines changed

2 files changed

+19
-2
lines changed

docs/source/stream.mdx

Lines changed: 16 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,8 @@ For example, the English split of the [HuggingFaceFW/fineweb](https://huggingfac
1919
>>> from datasets import load_dataset
2020
>>> dataset = load_dataset('HuggingFaceFW/fineweb', split='train', streaming=True)
2121
>>> print(next(iter(dataset)))
22-
{'text': "How AP reported in all formats from tornado-stricken regionsMarch 8, 2012\nWhen the first serious bout of tornadoes of 2012 blew through middle America in the middle of the night, they touched down in places hours from any AP bureau...
22+
{'text': 'How AP reported in all formats from tornado-stricken regionsMarch 8, 2012\nWhen the first serious bout of tornadoes of 2012 blew through middle America in the middle of the night, they touched down in places hours from any AP bureau...', ...,
23+
'language_score': 0.9721424579620361, 'token_count': 717}
2324
```
2425

2526
Dataset streaming also lets you work with a dataset made of local files without doing any conversion.
@@ -29,6 +30,7 @@ This is especially helpful when:
2930
- You don't want to wait for an extremely large local dataset to be converted to Arrow.
3031
- The converted files size would exceed the amount of available disk space on your computer.
3132
- You want to quickly explore just a few samples of a dataset.
33+
- You want to load only certain columns or efficiently filter a Parquet dataset.
3234

3335
For example, you can stream a local dataset of hundreds of compressed JSONL files like [oscar-corpus/OSCAR-2201](https://huggingface.co/datasets/oscar-corpus/OSCAR-2201) to use it instantly:
3436

@@ -40,6 +42,19 @@ For example, you can stream a local dataset of hundreds of compressed JSONL file
4042
{'id': 0, 'text': 'Founded in 2015, Golden Bees is a leading programmatic recruitment platform dedicated to employers, HR agencies and job boards. The company has developed unique HR-custom technologies and predictive algorithms to identify and attract the best candidates for a job opportunity.', ...
4143
```
4244

45+
Parquet is a columnar format that allows you to stream and load only a subset of columns and ignore unwanted columns. Parquet also stores metadata such as column statistics (at the file and row group level), enabling efficient filtering. Use the `columns` and `filters` arguments of [`datasets.packaged_modules.parquet.ParquetConfig`] to stream Parquet datasets, select columns, and apply filters:
46+
47+
```py
48+
>>> from datasets import load_dataset
49+
>>> dataset = load_dataset('HuggingFaceFW/fineweb', split='train', streaming=True, columns=["url", "date"])
50+
>>> print(next(iter(dataset)))
51+
{'url': 'http://%[email protected]/Content/Press-Release/2012/How-AP-reported-in-all-formats-from-tornado-stricken-regions', 'date': '2013-05-18T05:48:54Z'}
52+
>>> dataset = load_dataset('HuggingFaceFW/fineweb', split='train', streaming=True, filters=[("language_score", ">=", 0.99)])
53+
>>> print(next(iter(dataset)))
54+
{'text': 'Everyone wishes for something. And lots of people believe they know how to make their wishes come true with magical thinking.\nWhat is it? "Magical thinking is a belief in forms of causation, with no known physical basis," said Professor Emily Pronin of Princeton...', ...,
55+
'language_score': 0.9900368452072144, 'token_count': 716}
56+
```
57+
4358
Loading a dataset in streaming mode creates a new dataset type instance (instead of the classic [`Dataset`] object), known as an [`IterableDataset`].
4459
This special type of dataset has its own set of processing methods shown below.
4560

src/datasets/packaged_modules/parquet/parquet.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -32,10 +32,12 @@ class ParquetConfig(datasets.BuilderConfig):
3232
If possible the predicate will be pushed down to exploit the partition information
3333
or internal metadata found in the data source, e.g. Parquet statistics.
3434
Otherwise filters the loaded RecordBatches before yielding them.
35-
fragment_scan_options (`pyarrow.dataset.ParquetFragmentScanOptions`)
35+
fragment_scan_options (`pyarrow.dataset.ParquetFragmentScanOptions`, *optional*)
3636
Scan-specific options for Parquet fragments.
3737
This is especially useful to configure buffering and caching.
3838
39+
<Added version="4.2.0"/>
40+
3941
Example:
4042
4143
Load a subset of columns:

0 commit comments

Comments
 (0)