You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Spark enables real-time, large-scale data processing in a distributed environment.
4
4
5
-
In particular you can use `huggingface_hub` to access Hugging Face datasets repositories in PySpark
5
+
In particular you can use `pyspark_huggingface` to access Hugging Face datasets repositories in PySpark.
6
6
7
7
## Installation
8
8
9
-
To be able to read and write to Hugging Face URLs (e.g. `hf://datasets/username/dataset/data.parquet`), you need to install the `huggingface_hub` library:
9
+
To be able to read and write to Hugging Face Datasets, you need to install the `pyspark_huggingface` library:
10
10
11
11
```
12
-
pip install huggingface_hub
12
+
pip install pyspark_huggingface
13
13
```
14
14
15
-
You also need to install `pyarrow` to read/write Parquet / JSON / CSV / etc. files using the filesystem API provided by `huggingFace_hub`:
16
-
17
-
```
18
-
pip install pyarrow
19
-
```
15
+
This will also install required dependencies like `huggingface_hub` for authentication, and `pyarrow` for reading and writing datasets.
20
16
21
17
## Authentication
22
18
@@ -28,142 +24,54 @@ You can use the CLI for example:
28
24
huggingface-cli login
29
25
```
30
26
31
-
It's also possible to provide your Hugging Face token with the `HF_TOKEN` environment variable or passing the `storage_options` parameter to helper functions below:
32
-
33
-
```python
34
-
storage_options = {"token": "hf_xxx"}
35
-
```
27
+
It's also possible to provide your Hugging Face token with the `HF_TOKEN` environment variable or passing the `token` option to the spark context builder.
36
28
37
29
For more details about authentication, check out [this guide](https://huggingface.co/docs/huggingface_hub/quick-start#authentication).
38
30
39
-
## Read
31
+
## Enable the "huggingface" Data Source
40
32
41
-
PySpark doesn't have an official support for Hugging Face paths, so we provide a helper function to read datasets in a distributed manner.
33
+
PySpark 4 came with a new Data Source API which allows to use datasets from custom sources.
34
+
If `pyspark_huggingface` is installed, PySpark auto-imports it and enables the "huggingface" Data Dource.
42
35
43
-
For example you can read Parquet files from Hugging Face in an optimized way using PyArrow by defining this `read_parquet` helper function:
36
+
The library also backports the Data Source API for the "huggingface" Data Source for PySpark 3.5, 3.4 and 3.3.
37
+
However in this case `pyspark_huggingface` should be imported explicitly to activate the backport and enable the "huggingface" Data Dource:
44
38
45
39
```python
46
-
from functools import partial
47
-
from typing import Iterator, Optional, Union
48
-
49
-
import pyarrow as pa
50
-
import pyarrow.parquet as pq
51
-
from huggingface_hub import HfFileSystem
52
-
from pyspark.sql.dataframe import DataFrame
53
-
from pyspark.sql.pandas.types import from_arrow_schema
Loads Parquet files from Hugging Face using PyArrow, returning a PySPark `DataFrame`.
71
-
72
-
It reads Parquet files in a distributed manner.
73
-
74
-
Access private or gated repositories using `huggingface-cli login` or passing a token
75
-
using the `storage_options` argument: `storage_options={"token": "hf_xxx"}`
40
+
>>>import pyspark_huggingface
41
+
huggingface datasource enabled for pyspark 3.x.x (backport from pyspark 4)
42
+
```
76
43
77
-
Parameters
78
-
----------
79
-
path : str
80
-
Path to the file. Prefix with a protocol like `hf://` to read from Hugging Face.
81
-
You can read from multiple files if you pass a globstring.
82
-
columns : list, default None
83
-
If not None, only these columns will be read from the file.
84
-
filters : List[Tuple] or List[List[Tuple]], default None
85
-
To filter out data.
86
-
Filter syntax: [[(column, op, val), ...],...]
87
-
where op is [==, =, >, >=, <, <=, !=, in, not in]
88
-
The innermost tuples are transposed into a set of filters applied
89
-
through an `AND` operation.
90
-
The outer list combines these sets of filters through an `OR`
91
-
operation.
92
-
A single list of tuples can also be used, meaning that no `OR`
93
-
operation between set of filters is to be conducted.
44
+
## Read
94
45
95
-
**kwargs
96
-
Any additional kwargs are passed to pyarrow.parquet.ParquetDataset.
46
+
The "huggingface" Data Source allows to read datasets from Hugging Face, using `pyarrow` under the hood to stream Arrow data.
47
+
This is compatible withall the dataset in [supported format](https://huggingface.co/docs/hub/datasets-adding#file-formats) on Hugging Face, like Parquet datasets.
97
48
98
-
Returns
99
-
-------
100
-
DataFrame
101
-
DataFrame based on parquet file.
49
+
For example here is how to load the [stanfordnlp/imdb](https://huggingface.co/stanfordnlp/imdb) dataset:
We use the `read_parquet` function to read data from the dataset, compute the number of dialogue per language and filter the dataset.
67
+
We use the `.format()` function to use the "huggingface" Data Source, and`.load()` to load the dataset (more precisely the config or subset named "7M" containing 7M samples). Then we compute the number of dialogue per language andfilter the dataset.
160
68
161
69
After logging-in to access the gated repository, we can run:
To compute the number of dialogues per language we run this code.
195
-
The `columns` argument is useful to only load the data we need, since PySpark doesn't enable predicate push-down in this case.
196
-
There is also a `filters` argument to only load data with values within a certain range.
102
+
This loads the dataset in a streaming fashion, and the output DataFrame has one partition per data filein the dataset to enable efficient distributed processing.
103
+
104
+
To compute the number of dialogues per language we run this code that uses the `columns` option and a `groupBy()` operation.
105
+
The `columns` option is useful to only load the data we need, since PySpark doesn't enable predicate push-down with the Data Source API.
106
+
There is also a `filters` option to only load data with values within a certain range.
It is also possible to apply filters or remove columns on the loaded DataFrame, but it is more efficient to do it while loading, especially on Parquet datasets.
161
+
Indeed, Parquet contains metadata at the fileand row group level, which allows to skip entire parts of the dataset that don't contain samples that satisfy the criteria. Columns in Parquet can also be loaded indepentently, whch allows to skip the excluded columns and avoid loading unnecessary data.
162
+
241
163
### Run SQL queries
242
164
243
165
Once you have your PySpark Dataframe ready, you can run SQL queries using `spark.sql`:
0 commit comments