You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -203,128 +206,33 @@ Again, specifying the `columns` option is not necessary, but is useful to avoid
203
206
204
207
## Write
205
208
206
-
We also provide a helper function to write datasets in a distributed manner to a Hugging Face repository.
207
-
208
-
You can write a PySpark Dataframe to Hugging Face using this `write_parquet` helper function based on the `huggingface_hub`API.
209
-
In particular it uses the `preupload_lfs_files` utility to upload Parquet files in parallel in a distributed manner, and only commits the files once they're all uploaded:
210
-
209
+
You can write a PySpark Dataframe to Hugging Face with the "huggingface" Data Source.
210
+
It uploads Parquet files in parallel in a distributed manner, and only commits the files once they're all uploaded.
211
+
It works like this:
211
212
212
213
```python
213
-
import math
214
-
import pickle
215
-
import tempfile
216
-
from functools import partial
217
-
from typing import Iterator, Optional
218
-
219
-
import pyarrow as pa
220
-
import pyarrow.parquet as pq
221
-
from huggingface_hub import CommitOperationAdd, HfFileSystem
222
-
from pyspark.sql.dataframe import DataFrame
223
-
from pyspark.sql.pandas.types import from_arrow_schema, to_arrow_schema
additions: list[CommitOperationAdd] = [pickle.loads(addition) for addition in pa.Table.from_batches(iterator, schema=pa.schema({"addition": pa.binary()}))[0].to_pylist()]
Here is how we can use this function to write the filtered version of the [BAAI/Infinity-Instruct](https://huggingface.co/datasets/BAAI/Infinity-Instruct) dataset back to Hugging Face.
299
219
300
220
First you need to [create a dataset repository](https://huggingface.co/new-dataset), e.g. `username/Infinity-Instruct-Chinese-Only` (you can set it to private if you want).
301
-
Then, make sure you are authenticatedandyou can run:
221
+
Then, make sure you are authenticated, you can use the "huggingface" Data Source, set the `mode` to "overwrite" (or"append"if you want to extend an existing dataset), and push to Hugging Face with`.save()`:
You can duplicate the [Spark on HF JupyterLab](https://huggingface.co/spaces/lhoestq/Spark-on-HF-JupyterLab) Space to get a Notebook with PySpark andthose helper functions pre-installed.
234
+
You can launch the [Spark Notebooks](https://huggingface.co/spaces/Dataset-Tools/Spark-Notebooks) in Spaces to get Notebooks with PySpark and`pyspark_huggingface` pre-installed.
327
235
328
-
Click on "Duplicate Space", choose a name for your Space, select your hardware and you are ready:
236
+
Click on "Launch Spark Notebooks", choose a name for your Space, select your hardware and you are ready !
0 commit comments