Skip to content

Commit a7bc6b4

Browse files
kevinzwangdavanstrienpcuenca
authored
Add Daft to list of integrated libraries for datasets (#1892)
* Add Daft to list of integrated libraries for datasets * Update docs/hub/datasets-daft.md Co-authored-by: Pedro Cuenca <[email protected]> * fix typo --------- Co-authored-by: Daniel van Strien <[email protected]> Co-authored-by: Pedro Cuenca <[email protected]>
1 parent 5a75b73 commit a7bc6b4

File tree

4 files changed

+84
-2
lines changed

4 files changed

+84
-2
lines changed

docs/hub/_toctree.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -175,6 +175,8 @@
175175
sections:
176176
- local: datasets-argilla
177177
title: Argilla
178+
- local: datasets-daft
179+
title: Daft
178180
- local: datasets-dask
179181
title: Dask
180182
- local: datasets-usage

docs/hub/datasets-adding.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -67,7 +67,7 @@ The rich features set in the `huggingface_hub` library allows you to manage repo
6767

6868
## Using other libraries
6969

70-
Some libraries like [🤗 Datasets](/docs/datasets/index), [Pandas](https://pandas.pydata.org/), [Polars](https://pola.rs), [Dask](https://www.dask.org/) or [DuckDB](https://duckdb.org/) can upload files to the Hub.
70+
Some libraries like [🤗 Datasets](/docs/datasets/index), [Pandas](https://pandas.pydata.org/), [Polars](https://pola.rs), [Dask](https://www.dask.org/), [DuckDB](https://duckdb.org/), or [Daft](https://daft.ai/) can upload files to the Hub.
7171
See the list of [Libraries supported by the Datasets Hub](./datasets-libraries) for more information.
7272

7373
## Using Git

docs/hub/datasets-daft.md

Lines changed: 79 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,79 @@
1+
# Daft
2+
3+
[Daft](https://daft.ai/) is a high-performance data engine providing simple and reliable data processing for any modality and scale. Daft has native support for reading from and writing to Hugging Face datasets.
4+
5+
<div class="flex justify-center">
6+
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/daft_hf.png"/>
7+
</div>
8+
9+
10+
## Getting Started
11+
12+
To get started, pip install `daft` with the `huggingface` feature:
13+
14+
```bash
15+
pip install 'daft[huggingface]'
16+
```
17+
18+
## Read
19+
20+
Daft is able to read datasets directly from the Hugging Face Hub using the [`daft.read_huggingface()`](https://docs.daft.ai/en/stable/api/io/#daft.read_huggingface) function or via the `hf://datasets/` protocol.
21+
22+
### Reading an Entire Dataset
23+
24+
Using [`daft.read_huggingface()`](https://docs.daft.ai/en/stable/api/io/#daft.read_huggingface), you can easily load a dataset.
25+
26+
27+
```python
28+
import daft
29+
30+
df = daft.read_huggingface("username/dataset_name")
31+
```
32+
33+
This will read the entire dataset into a DataFrame.
34+
35+
### Reading Specific Files
36+
37+
Not only can you read entire datasets, but you can also read individual files from a dataset repository. Using a read function that takes in a path (such as [`daft.read_parquet()`](https://docs.daft.ai/en/stable/api/io/#daft.read_parquet), [`daft.read_csv()`](https://docs.daft.ai/en/stable/api/io/#daft.read_csv), or [`daft.read_json()`](https://docs.daft.ai/en/stable/api/io/#daft.read_json)), specify a Hugging Face dataset path via the `hf://datasets/` prefix:
38+
39+
```python
40+
import daft
41+
42+
# read a specific Parquet file
43+
df = daft.read_parquet("hf://datasets/username/dataset_name/file_name.parquet")
44+
45+
# or a csv file
46+
df = daft.read_csv("hf://datasets/username/dataset_name/file_name.csv")
47+
48+
# or a set of Parquet files using a glob pattern
49+
df = daft.read_parquet("hf://datasets/username/dataset_name/**/*.parquet")
50+
```
51+
52+
## Write
53+
54+
Daft is able to write Parquet files to a Hugging Face dataset repository using [`daft.DataFrame.write_huggingface`](https://docs.daft.ai/en/stable/api/dataframe/#daft.DataFrame.write_deltalake). Daft supports [Content-Defined Chunking](https://huggingface.co/blog/parquet-cdc) and [Xet](https://huggingface.co/blog/xet-on-the-hub) for faster, deduplicated writes.
55+
56+
Basic usage:
57+
58+
```python
59+
import daft
60+
61+
df: daft.DataFrame = ...
62+
63+
df.write_huggingface("username/dataset_name")
64+
```
65+
66+
See the [`DataFrame.write_huggingface`](https://docs.daft.ai/en/stable/api/dataframe/#daft.DataFrame.write_huggingface) API page for more info.
67+
68+
## Authentication
69+
70+
The `token` parameter in [`daft.io.HuggingFaceConfig`](https://docs.daft.ai/en/stable/api/config/#daft.io.HuggingFaceConfig) can be used to specify a Hugging Face access token for requests that require authentication (e.g. reading private dataset repositories or writing to a dataset repository).
71+
72+
Example of loading a dataset with a specified token:
73+
74+
```python
75+
from daft.io import IOConfig, HuggingFaceConfig
76+
77+
io_config = IOConfig(hf=HuggingFaceConfig(token="your_token"))
78+
df = daft.read_parquet("hf://datasets/username/dataset_name", io_config=io_config)
79+
```

docs/hub/datasets-libraries.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@ The table below summarizes the supported libraries and their level of integratio
99
| Library | Description | Download from Hub | Push to Hub |
1010
| ----------------------------------- | ------------------------------------------------------------------------------------------------------------------------------ | ----------------- | ----------- |
1111
| [Argilla](./datasets-argilla) | Collaboration tool for AI engineers and domain experts that value high quality data. |||
12+
| [Daft](./datasets-daft) | Data engine for large scale, multimodal data processing with a Python-native interface. |||
1213
| [Dask](./datasets-dask) | Parallel and distributed computing library that scales the existing Python and PyData ecosystem. |||
1314
| [Datasets](./datasets-usage) | 🤗 Datasets is a library for accessing and sharing datasets for Audio, Computer Vision, and Natural Language Processing (NLP). |||
1415
| [Distilabel](./datasets-distilabel) | The framework for synthetic data generation and AI feedback. |||
@@ -87,7 +88,7 @@ Examples of this kind of integration:
8788

8889
#### Rely on an existing libraries integration with the Hub
8990

90-
Polars, Pandas, Dask, Spark and DuckDB all can write to a Hugging Face Hub repository. See [datasets libraries](https://huggingface.co/docs/hub/datasets-libraries) for more details.
91+
Polars, Pandas, Dask, Spark, DuckDB, and Daft can all write to a Hugging Face Hub repository. See [datasets libraries](https://huggingface.co/docs/hub/datasets-libraries) for more details.
9192

9293
If you are already using one of these libraries in your code, adding the ability to push to the Hub is straightforward. For example, if you have a synthetic data generation library that can return a Pandas DataFrame, here is the code you would need to write to the Hub:
9394

0 commit comments

Comments
 (0)