|
| 1 | +# Daft |
| 2 | + |
| 3 | +[Daft](https://daft.ai/) is a high-performance data engine providing simple and reliable data processing for any modality and scale. Daft has native support for reading from and writing to Hugging Face datasets. |
| 4 | + |
| 5 | +<div class="flex justify-center"> |
| 6 | +<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/daft_hf.png"/> |
| 7 | +</div> |
| 8 | + |
| 9 | + |
| 10 | +## Getting Started |
| 11 | + |
| 12 | +To get started, pip install `daft` with the `huggingface` feature: |
| 13 | + |
| 14 | +```bash |
| 15 | +pip install 'daft[huggingface]' |
| 16 | +``` |
| 17 | + |
| 18 | +## Read |
| 19 | + |
| 20 | +Daft is able to read datasets directly from the Hugging Face Hub using the [`daft.read_huggingface()`](https://docs.daft.ai/en/stable/api/io/#daft.read_huggingface) function or via the `hf://datasets/` protocol. |
| 21 | + |
| 22 | +### Reading an Entire Dataset |
| 23 | + |
| 24 | +Using [`daft.read_huggingface()`](https://docs.daft.ai/en/stable/api/io/#daft.read_huggingface), you can easily load a dataset. |
| 25 | + |
| 26 | + |
| 27 | +```python |
| 28 | +import daft |
| 29 | + |
| 30 | +df = daft.read_huggingface("username/dataset_name") |
| 31 | +``` |
| 32 | + |
| 33 | +This will read the entire dataset into a DataFrame. |
| 34 | + |
| 35 | +### Reading Specific Files |
| 36 | + |
| 37 | +Not only can you read entire datasets, but you can also read individual files from a dataset repository. Using a read function that takes in a path (such as [`daft.read_parquet()`](https://docs.daft.ai/en/stable/api/io/#daft.read_parquet), [`daft.read_csv()`](https://docs.daft.ai/en/stable/api/io/#daft.read_csv), or [`daft.read_json()`](https://docs.daft.ai/en/stable/api/io/#daft.read_json)), specify a Hugging Face dataset path via the `hf://datasets/` prefix: |
| 38 | + |
| 39 | +```python |
| 40 | +import daft |
| 41 | + |
| 42 | +# read a specific Parquet file |
| 43 | +df = daft.read_parquet("hf://datasets/username/dataset_name/file_name.parquet") |
| 44 | + |
| 45 | +# or a csv file |
| 46 | +df = daft.read_csv("hf://datasets/username/dataset_name/file_name.csv") |
| 47 | + |
| 48 | +# or a set of Parquet files using a glob pattern |
| 49 | +df = daft.read_parquet("hf://datasets/username/dataset_name/**/*.parquet") |
| 50 | +``` |
| 51 | + |
| 52 | +## Write |
| 53 | + |
| 54 | +Daft is able to write Parquet files to a Hugging Face dataset repository using [`daft.DataFrame.write_huggingface`](https://docs.daft.ai/en/stable/api/dataframe/#daft.DataFrame.write_deltalake). Daft supports [Content-Defined Chunking](https://huggingface.co/blog/parquet-cdc) and [Xet](https://huggingface.co/blog/xet-on-the-hub) for faster, deduplicated writes. |
| 55 | + |
| 56 | +Basic usage: |
| 57 | + |
| 58 | +```python |
| 59 | +import daft |
| 60 | + |
| 61 | +df: daft.DataFrame = ... |
| 62 | + |
| 63 | +df.write_huggingface("username/dataset_name") |
| 64 | +``` |
| 65 | + |
| 66 | +See the [`DataFrame.write_huggingface`](https://docs.daft.ai/en/stable/api/dataframe/#daft.DataFrame.write_huggingface) API page for more info. |
| 67 | + |
| 68 | +## Authentication |
| 69 | + |
| 70 | +The `token` parameter in [`daft.io.HuggingFaceConfig`](https://docs.daft.ai/en/stable/api/config/#daft.io.HuggingFaceConfig) can be used to specify a Hugging Face access token for requests that require authentication (e.g. reading private dataset repositories or writing to a dataset repository). |
| 71 | + |
| 72 | +Example of loading a dataset with a specified token: |
| 73 | + |
| 74 | +```python |
| 75 | +from daft.io import IOConfig, HuggingFaceConfig |
| 76 | + |
| 77 | +io_config = IOConfig(hf=HuggingFaceConfig(token="your_token")) |
| 78 | +df = daft.read_parquet("hf://datasets/username/dataset_name", io_config=io_config) |
| 79 | +``` |
0 commit comments