Add Daft to list of integrated libraries for datasets (#1892)

kevinzwang · davanstrien · pcuenca · web-flow · commit a7bc6b4185a4 · 2025-09-09T09:54:04.000+01:00
* Add Daft to list of integrated libraries for datasets

* Update docs/hub/datasets-daft.md

Co-authored-by: Pedro Cuenca &lt;pedro@huggingface.co&gt;

* fix typo

---------

Co-authored-by: Daniel van Strien &lt;davanstrien@users.noreply.github.com&gt;
Co-authored-by: Pedro Cuenca &lt;pedro@huggingface.co&gt;
diff --git a/docs/hub/_toctree.yml b/docs/hub/_toctree.yml
@@ -175,6 +175,8 @@
     sections:
       - local: datasets-argilla
         title: Argilla
+      - local: datasets-daft
+        title: Daft
       - local: datasets-dask
         title: Dask
       - local: datasets-usage
diff --git a/docs/hub/datasets-adding.md b/docs/hub/datasets-adding.md
@@ -67,7 +67,7 @@ The rich features set in the `huggingface_hub` library allows you to manage repo
 
 ## Using other libraries
 
-Some libraries like [🤗 Datasets](/docs/datasets/index), [Pandas](https://pandas.pydata.org/), [Polars](https://pola.rs), [Dask](https://www.dask.org/) or [DuckDB](https://duckdb.org/) can upload files to the Hub.
+Some libraries like [🤗 Datasets](/docs/datasets/index), [Pandas](https://pandas.pydata.org/), [Polars](https://pola.rs), [Dask](https://www.dask.org/), [DuckDB](https://duckdb.org/), or [Daft](https://daft.ai/) can upload files to the Hub.
 See the list of [Libraries supported by the Datasets Hub](./datasets-libraries) for more information.
 
 ## Using Git
diff --git a/docs/hub/datasets-daft.md b/docs/hub/datasets-daft.md
@@ -0,0 +1,79 @@
+# Daft
+
+[Daft](https://daft.ai/) is a high-performance data engine providing simple and reliable data processing for any modality and scale. Daft has native support for reading from and writing to Hugging Face datasets.
+
+<div class="flex justify-center">
+<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/daft_hf.png"/>
+</div>
+
+
+## Getting Started
+
+To get started, pip install `daft` with the `huggingface` feature:
+
+```bash
+pip install 'daft[huggingface]'
+```
+
+## Read
+
+Daft is able to read datasets directly from the Hugging Face Hub using the [`daft.read_huggingface()`](https://docs.daft.ai/en/stable/api/io/#daft.read_huggingface) function or via the `hf://datasets/` protocol.
+
+### Reading an Entire Dataset
+
+Using [`daft.read_huggingface()`](https://docs.daft.ai/en/stable/api/io/#daft.read_huggingface), you can easily load a dataset.
+
+
+```python
+import daft
+
+df = daft.read_huggingface("username/dataset_name")
+```
+
+This will read the entire dataset into a DataFrame.
+
+### Reading Specific Files
+
+Not only can you read entire datasets, but you can also read individual files from a dataset repository. Using a read function that takes in a path (such as [`daft.read_parquet()`](https://docs.daft.ai/en/stable/api/io/#daft.read_parquet), [`daft.read_csv()`](https://docs.daft.ai/en/stable/api/io/#daft.read_csv), or [`daft.read_json()`](https://docs.daft.ai/en/stable/api/io/#daft.read_json)), specify a Hugging Face dataset path via the `hf://datasets/` prefix:
+
+```python
+import daft
+
+# read a specific Parquet file
+df = daft.read_parquet("hf://datasets/username/dataset_name/file_name.parquet")
+
+# or a csv file
+df = daft.read_csv("hf://datasets/username/dataset_name/file_name.csv")
+
+# or a set of Parquet files using a glob pattern
+df = daft.read_parquet("hf://datasets/username/dataset_name/**/*.parquet")
+```
+
+## Write
+
+Daft is able to write Parquet files to a Hugging Face dataset repository using [`daft.DataFrame.write_huggingface`](https://docs.daft.ai/en/stable/api/dataframe/#daft.DataFrame.write_deltalake). Daft supports [Content-Defined Chunking](https://huggingface.co/blog/parquet-cdc) and [Xet](https://huggingface.co/blog/xet-on-the-hub) for faster, deduplicated writes.
+
+Basic usage:
+
+```python
+import daft
+
+df: daft.DataFrame = ...
+
+df.write_huggingface("username/dataset_name")
+```
+
+See the [`DataFrame.write_huggingface`](https://docs.daft.ai/en/stable/api/dataframe/#daft.DataFrame.write_huggingface) API page for more info.
+
+## Authentication
+
+The `token` parameter in [`daft.io.HuggingFaceConfig`](https://docs.daft.ai/en/stable/api/config/#daft.io.HuggingFaceConfig) can be used to specify a Hugging Face access token for requests that require authentication (e.g. reading private dataset repositories or writing to a dataset repository).
+
+Example of loading a dataset with a specified token:
+
+```python
+from daft.io import IOConfig, HuggingFaceConfig
+
+io_config = IOConfig(hf=HuggingFaceConfig(token="your_token"))
+df = daft.read_parquet("hf://datasets/username/dataset_name", io_config=io_config)
+```
diff --git a/docs/hub/datasets-libraries.md b/docs/hub/datasets-libraries.md
@@ -9,6 +9,7 @@ The table below summarizes the supported libraries and their level of integratio
 | Library                             | Description                                                                                                                    | Download from Hub | Push to Hub |
 | ----------------------------------- | ------------------------------------------------------------------------------------------------------------------------------ | ----------------- | ----------- |
 | [Argilla](./datasets-argilla)       | Collaboration tool for AI engineers and domain experts that value high quality data.                                           | ✅                | ✅          |
+| [Daft](./datasets-daft)             | Data engine for large scale, multimodal data processing with a Python-native interface.                                        | ✅                | ✅          |
 | [Dask](./datasets-dask)             | Parallel and distributed computing library that scales the existing Python and PyData ecosystem.                               | ✅                | ✅          |
 | [Datasets](./datasets-usage)        | 🤗 Datasets is a library for accessing and sharing datasets for Audio, Computer Vision, and Natural Language Processing (NLP). | ✅                | ✅          |
 | [Distilabel](./datasets-distilabel) | The framework for synthetic data generation and AI feedback.                                                                   | ✅                | ✅          |
@@ -87,7 +88,7 @@ Examples of this kind of integration:
 
 #### Rely on an existing libraries integration with the Hub
 
-Polars, Pandas, Dask, Spark and DuckDB all can write to a Hugging Face Hub repository. See [datasets libraries](https://huggingface.co/docs/hub/datasets-libraries) for more details.
+Polars, Pandas, Dask, Spark, DuckDB, and Daft can all write to a Hugging Face Hub repository. See [datasets libraries](https://huggingface.co/docs/hub/datasets-libraries) for more details.
 
 If you are already using one of these libraries in your code, adding the ability to push to the Hub is straightforward. For example, if you have a synthetic data generation library that can return a Pandas DataFrame, here is the code you would need to write to the Hub: