Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions docs/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,10 @@
* [Images](modalities/images.md)
* [Audio](modalities/audio.md)
* [Videos](modalities/videos.md)
* [Documents](modalities/documents.md)
* [JSON and Nested Data](modalities/json.md)
* [URLs and Files](modalities/urls.md)
* [Files and URLs](modalities/files.md)
* [Embeddings](modalities/embeddings.md)
* [Custom Modalities](modalities/custom.md)
* Scale Custom Python Code
* [New UDF Overview](custom-code/index.md)
Expand Down Expand Up @@ -90,7 +92,7 @@
* [User-Defined Functions](api/udf.md)
* Data Types
* [DataType](api/datatypes/all_datatypes.md)
* [daft.File Types](api/datatypes/daft_file_types.md)
* [File Types](api/datatypes/file_types.md)
* [Type Conversions](api/datatypes/type_conversions.md)
* [Casting](api/datatypes/casting.md)
* [Window](api/window.md)
Expand Down
13 changes: 0 additions & 13 deletions docs/api/datatypes/daft_file_types.md

This file was deleted.

13 changes: 13 additions & 0 deletions docs/api/datatypes/file_types.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
The `File` DataType provides first-class support for handling file data across local and remote storage, enabling seamless file operations in distributed environments.

::: daft.file.File
options:
filters: ["!^_"]

::: daft.file.AudioFile
options:
filters: ["!^_"]

::: daft.file.VideoFile
options:
filters: ["!^_"]
18 changes: 9 additions & 9 deletions docs/examples/document-processing.md
Original file line number Diff line number Diff line change
Expand Up @@ -803,9 +803,9 @@ print(df.schema())

#### Explaining Structure Access Expressions

Note that we're using [`.struct`](../api/expressions.md#daft.expressions.struct) to construct an expression that allows Daft to extract individual field values from our complex document structure.
Note that we're using `col("indexed_texts")["text"]` to construct an expression that allows Daft to extract individual field values from our complex document structure.

When write `col("text_blocks").struct.get("bounding_box")`, we're telling Daft that we want to access the `bounding_box` field of each element from the `text_blocks` column. From this, we can provide additional field-selecting logic (e.g. `["x"]` to get the value for field `x` on the `bounding_box` value from each structure in `text_blocks`).
When we write `col("text_blocks").struct.get("bounding_box")`, we're telling Daft that we want to access the `bounding_box` field of each element from the `text_blocks` column. From this, we can provide additional field-selecting logic (e.g. `["x"]` to get the value for field `x` on the `bounding_box` value from each structure in `text_blocks`).

The last part of our text box processing step is to extract the text and bounding box coordinates into their own columns. We also want to preserve the reading order index as its own column too.

Expand All @@ -815,14 +815,14 @@ This format makes it easier to form follow up queries on our data, such as:

```python
df = (
df.with_column("text_blocks", col("indexed_texts").struct.get("text"))
.with_column("reading_order_index", col("indexed_texts").struct.get("index"))
df.with_column("text_blocks", col("indexed_texts")["text"])
.with_column("reading_order_index", col("indexed_texts")["index"])
.exclude("indexed_texts")
.with_column("text", col("text_blocks").struct.get("text"))
.with_column("x", col("text_blocks").struct.get("bounding_box")["x"])
.with_column("y", col("text_blocks").struct.get("bounding_box")["y"])
.with_column("h", col("text_blocks").struct.get("bounding_box")["h"])
.with_column("w", col("text_blocks").struct.get("bounding_box")["w"])
.with_column("text", col("text_blocks")["text"])
.with_column("x", col("text_blocks")["bounding_box"]["x"])
.with_column("y", col("text_blocks")["bounding_box"]["y"])
.with_column("h", col("text_blocks")["bounding_box"]["h"])
.with_column("w", col("text_blocks")["bounding_box"]["w"])
.exclude("text_blocks")
)
print(df.schema())
Expand Down
87 changes: 64 additions & 23 deletions docs/modalities/audio.md

Large diffs are not rendered by default.

282 changes: 282 additions & 0 deletions docs/modalities/documents.md

Large diffs are not rendered by default.

158 changes: 158 additions & 0 deletions docs/modalities/embeddings.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,158 @@
# Working with Embeddings

Embeddings transform text, images, and other data into dense vector representations that capture semantic meaning—enabling similarity search, retrieval-augmented generation (RAG), and AI-powered discovery. Daft makes it easy to generate, store, and query embeddings at scale.

With the native [`daft.DataType.embedding`](../api/datatypes/embedding.md) type and [`embed_text`](../api/functions/embed_text.md) function, you can:

- **Generate embeddings** from any text column using providers like OpenAI, Cohere, or local models
- **Compute similarity** with built-in distance functions like `cosine_distance`
- **Build search pipelines** that scale from local development to distributed clusters
- **Write to vector databases** like Turbopuffer, Pinecone, or LanceDB

## Semantic Search Example

The following example creates a simple semantic search pipeline—embedding documents, comparing them to a query, and ranking by similarity:

```python
import daft
from daft.functions import embed_text, cosine_distance

# Create a knowledge base with documents
documents = daft.from_pydict(
{
"text": [
"Python is a high-level programming language",
"Machine learning models require training data",
"Daft is a distributed dataframe library",
"Embeddings capture semantic meaning of text",
],
}
)

# Embed all documents
documents = documents.with_column(
"embedding",
embed_text(
daft.col("text"),
provider="openai",
model="text-embedding-3-small",
),
)

# Create a query
query = daft.from_pydict({"query_text": ["What is Daft?"]})

# Embed the query
query = query.with_column(
"query_embedding",
embed_text(
daft.col("query_text"),
provider="openai",
model="text-embedding-3-small",
),
)

# Cross join to compare query against all documents
results = query.join(documents, how="cross")

# Calculate cosine distance (lower is more similar)
results = results.with_column(
"distance", cosine_distance(daft.col("query_embedding"), daft.col("embedding"))
)

# Sort by distance and show top results
results = results.sort("distance").select("query_text", "text", "distance", "embedding")
results.show()
```

```{title="Output"}
╭───────────────┬────────────────────────────────┬────────────────────┬──────────────────────────╮
│ query_text ┆ text ┆ distance ┆ embedding │
│ --- ┆ --- ┆ --- ┆ --- │
│ String ┆ String ┆ Float64 ┆ Embedding[Float32; 1536] │
╞═══════════════╪════════════════════════════════╪════════════════════╪══════════════════════════╡
│ What is Daft? ┆ Daft is a distributed datafra… ┆ 0.3621492191359764 ┆ ▄▇▆▅▄▄█▆▄▄▃▂▄▃▃▃▁▄▃▃▄▄▃▂ │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ What is Daft? ┆ Python is a high-level progra… ┆ 0.9163975397319742 ┆ ▇▆▅▇▅▆█▇▃▄▆▄▄▁▅▄▅▃▁▃▃▂▅▃ │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ What is Daft? ┆ Embeddings capture semantic m… ┆ 0.9374004015203741 ┆ ▄█▅▄▅▅▅▇▄▃▂▁▃▄▄▁▃▃▂▂▂▂▁▃ │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ What is Daft? ┆ Machine learning models requi… ┆ 0.9696998373223874 ┆ ▇▇▆▃▄▆▅█▆▂▄▃▄▄▂▄▂▁▂▂▁▃▂▁ │
╰───────────────┴────────────────────────────────┴────────────────────┴──────────────────────────╯

(Showing first 4 of 4 rows)
```

## Building a Document Search Pipeline

For production use cases, you'll typically combine embeddings with LLM-powered metadata extraction and write the results to a vector database.

This example shows an end-to-end pipeline that:

1. Loads PDF documents from cloud storage
2. Extracts structured metadata using an LLM
3. Generates vector embeddings from the abstracts
4. Writes everything to Turbopuffer for semantic search

```python
# /// script
# description = "This example shows how using LLMs and embedding models, Daft chunks documents, extracts metadata, generates vectors, and writes them to any vector database..."
# dependencies = ["daft[openai, turbopuffer]", "pymupdf"]
# ///
import os
import daft
from daft import col, lit
from daft.functions import embed_text, prompt, file, unnest, monotonically_increasing_id
from pydantic import BaseModel

class Classifier(BaseModel):
title: str
author: str
year: int
keywords: list[str]
abstract: str

daft.set_execution_config(enable_dynamic_batching=True)
daft.set_provider("openai", api_key=os.environ.get("OPENAI_API_KEY"))

# Load documents and generate vector embeddings
df = (
daft.from_glob_path("hf://datasets/Eventual-Inc/sample-files/papers/*.pdf").limit(10)
.with_column(
"metadata",
prompt(
messages=file(col("path")),
system_message="Read the paper and extract the classifier metadata.",
return_format=Classifier,
model="gpt-5-mini",
Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The model name "gpt-5-mini" does not exist. This should likely be "gpt-4o-mini" or another valid OpenAI model. GPT-5 has not been released as of the knowledge cutoff.

Suggested change
model="gpt-5-mini",
model="gpt-4o-mini",

Copilot uses AI. Check for mistakes.
)
)
.with_column(
"abstract_embedding",
embed_text(
daft.col("metadata")["abstract"],
model="text-embedding-3-large"
)
)
.with_column("id", monotonically_increasing_id())
.select("id", "path", unnest(col("metadata")), "abstract_embedding")
)

# Write to Turbopuffer
df.write_turbopuffer(
namespace="ai_papers",
api_key=os.environ.get("TURBOPUFFER_API_KEY"),
distance_metric="cosine_distance",
region='us-west-2',
schema={
"id": "int64",
"path": "string",
"title": "string",
"author": "string",
"year": "int",
"keywords": "list[string]",
"abstract": "string",
"abstract_embedding": "vector",
}
)
```
Loading