You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{'text': "How AP reported in all formats from tornado-stricken regionsMarch 8, 2012\nWhen the first serious bout of tornadoes of 2012 blew through middle America in the middle of the night, they touched down in places hours from any AP bureau...
22
+
{'text': 'How AP reported in all formats from tornado-stricken regionsMarch 8, 2012\nWhen the first serious bout of tornadoes of 2012 blew through middle America in the middle of the night, they touched down in places hours from any AP bureau...', ...,
Dataset streaming also lets you work with a dataset made of local files without doing any conversion.
@@ -29,6 +30,7 @@ This is especially helpful when:
29
30
- You don't want to wait for an extremely large local dataset to be converted to Arrow.
30
31
- The converted files size would exceed the amount of available disk space on your computer.
31
32
- You want to quickly explore just a few samples of a dataset.
33
+
- You want to load only certain columns or efficiently filter a Parquet dataset.
32
34
33
35
For example, you can stream a local dataset of hundreds of compressed JSONL files like [oscar-corpus/OSCAR-2201](https://huggingface.co/datasets/oscar-corpus/OSCAR-2201) to use it instantly:
34
36
@@ -40,6 +42,19 @@ For example, you can stream a local dataset of hundreds of compressed JSONL file
40
42
{'id': 0, 'text': 'Founded in 2015, Golden Bees is a leading programmatic recruitment platform dedicated to employers, HR agencies and job boards. The company has developed unique HR-custom technologies and predictive algorithms to identify and attract the best candidates for a job opportunity.', ...
41
43
```
42
44
45
+
Parquet is a columnar format that allows you to stream and load only a subset of columns and ignore unwanted columns. Parquet also stores metadata such as column statistics (at the fileand row group level), enabling efficient filtering. Use the `columns`and`filters` arguments of [`datasets.packaged_modules.parquet.ParquetConfig`] to stream Parquet datasets, select columns, and apply filters:
{'text': 'Everyone wishes for something. And lots of people believe they know how to make their wishes come true with magical thinking.\nWhat is it? "Magical thinking is a belief in forms of causation, with no known physical basis," said Professor Emily Pronin of Princeton...', ...,
0 commit comments