Skip to content
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/hub/datasets-adding.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@ The rich features set in the `huggingface_hub` library allows you to manage repo

## Using other libraries

Some libraries like [🤗 Datasets](/docs/datasets/index), [Pandas](https://pandas.pydata.org/), [Polars](https://pola.rs), [Dask](https://www.dask.org/) or [DuckDB](https://duckdb.org/) can upload files to the Hub.
Some libraries like [🤗 Datasets](/docs/datasets/index), [Pandas](https://pandas.pydata.org/), [Polars](https://pola.rs), [Dask](https://www.dask.org/), [DuckDB](https://duckdb.org/), or [Daft](https://daft.ai/) can upload files to the Hub.
See the list of [Libraries supported by the Datasets Hub](./datasets-libraries) for more information.

## Using Git
Expand Down
79 changes: 79 additions & 0 deletions docs/hub/datasets-daft.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
# Daft

[Daft](https://daft.ai/) is a high-performance data engine providing simple and reliable data processing for any modality and scale. Daft has native support for reading from and writing to Hugging Face datasets.

<div class="flex justify-center">
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/daft_hf.png"/>
</div>


## Getting Started

To get started, pip install `daft` with the `huggingface` feature:

```bash
pip install 'daft[hugggingface]'
```

## Read

Daft is able to read datasets directly from Hugging Face using the [`daft.read_huggingface()`](https://docs.daft.ai/en/stable/api/io/#daft.read_huggingface) function or via the `hf://datasets/` protocol.

### Reading an Entire Dataset

Using [`daft.read_huggingface()`](https://docs.daft.ai/en/stable/api/io/#daft.read_huggingface), you can easily read a Hugging Face dataset.


```python
import daft

df = daft.read_huggingface("username/dataset_name")
```

This will read the entire dataset into a DataFrame.

### Reading Specific Files

Not only can you read entire datasets, but you can also read individual files from a dataset. Using a read function that takes in a path (such as [`daft.read_parquet()`](https://docs.daft.ai/en/stable/api/io/#daft.read_parquet), [`daft.read_csv()`](https://docs.daft.ai/en/stable/api/io/#daft.read_csv), or [`daft.read_json()`](https://docs.daft.ai/en/stable/api/io/#daft.read_json)), specify a Hugging Face dataset path via the `hf://datasets/` prefix:

```python
import daft

# read a specific Parquet file
df = daft.read_parquet("hf://datasets/username/dataset_name/file_name.parquet")

# or a csv file
df = daft.read_csv("hf://datasets/username/dataset_name/file_name.csv")

# or a set of Parquet files using a glob pattern
df = daft.read_parquet("hf://datasets/username/dataset_name/**/*.parquet")
```

## Write

Daft is able to write Parquet files to Hugging Face datasets using [`daft.DataFrame.write_huggingface`](https://docs.daft.ai/en/stable/api/dataframe/#daft.DataFrame.write_deltalake). Daft supports [Content-Defined Chunking](https://huggingface.co/blog/parquet-cdc) and [Xet](https://huggingface.co/blog/xet-on-the-hub) for faster, deduplicated writes.

Basic usage:

```python
import daft

df: daft.DataFrame = ...

df.write_huggingface("username/dataset_name")
```

See the [`DataFrame.write_huggingface`](https://docs.daft.ai/en/stable/api/dataframe/#daft.DataFrame.write_deltalake) API page for more info.

## Authentication

The `token` parameter in [`daft.io.HuggingFaceConfig`](https://docs.daft.ai/en/stable/api/config/#daft.io.HuggingFaceConfig) can be used to specify a Hugging Face access token for requests that require authentication (e.g. reading private datasets or writing to a dataset).

Example of reading a dataset with a specified token:

```python
from daft.io import IOConfig, HuggingFaceConfig

io_config = IOConfig(hf=HuggingFaceConfig(token="your_token"))
df = daft.read_parquet("hf://datasets/username/dataset_name", io_config=io_config)
```
3 changes: 2 additions & 1 deletion docs/hub/datasets-libraries.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ The table below summarizes the supported libraries and their level of integratio
| Library | Description | Download from Hub | Push to Hub |
| ----------------------------------- | ------------------------------------------------------------------------------------------------------------------------------ | ----------------- | ----------- |
| [Argilla](./datasets-argilla) | Collaboration tool for AI engineers and domain experts that value high quality data. | ✅ | ✅ |
| [Daft](./datasets-daft) | Data engine for large scale, multimodal data processing with a Python-native interface. | ✅ | ✅ |
| [Dask](./datasets-dask) | Parallel and distributed computing library that scales the existing Python and PyData ecosystem. | ✅ | ✅ |
| [Datasets](./datasets-usage) | 🤗 Datasets is a library for accessing and sharing datasets for Audio, Computer Vision, and Natural Language Processing (NLP). | ✅ | ✅ |
| [Distilabel](./datasets-distilabel) | The framework for synthetic data generation and AI feedback. | ✅ | ✅ |
Expand Down Expand Up @@ -87,7 +88,7 @@ Examples of this kind of integration:

#### Rely on an existing libraries integration with the Hub

Polars, Pandas, Dask, Spark and DuckDB all can write to a Hugging Face Hub repository. See [datasets libraries](https://huggingface.co/docs/hub/datasets-libraries) for more details.
Polars, Pandas, Dask, Spark, DuckDB, and Daft all can write to a Hugging Face Hub repository. See [datasets libraries](https://huggingface.co/docs/hub/datasets-libraries) for more details.

If you are already using one of these libraries in your code, adding the ability to push to the Hub is straightforward. For example, if you have a synthetic data generation library that can return a Pandas DataFrame, here is the code you would need to write to the Hub:

Expand Down