Eventual-Inc · everettVT · Dec 25, 2025 · Jan 6, 2026 · Jan 16, 2026 · Jan 16, 2026
diff --git a/docs/SUMMARY.md b/docs/SUMMARY.md
@@ -14,8 +14,10 @@
         * [Images](modalities/images.md)
         * [Audio](modalities/audio.md)
         * [Videos](modalities/videos.md)
+        * [Documents](modalities/documents.md)
         * [JSON and Nested Data](modalities/json.md)
-        * [URLs and Files](modalities/urls.md)
+        * [Files and URLs](modalities/files.md)
+        * [Embeddings](modalities/embeddings.md)
         * [Custom Modalities](modalities/custom.md)
     * Scale Custom Python Code
         * [New UDF Overview](custom-code/index.md)
@@ -90,7 +92,7 @@
     * [User-Defined Functions](api/udf.md)
     * Data Types
         * [DataType](api/datatypes/all_datatypes.md)
-        * [daft.File Types](api/datatypes/daft_file_types.md)
+        * [File Types](api/datatypes/file_types.md)
         * [Type Conversions](api/datatypes/type_conversions.md)
         * [Casting](api/datatypes/casting.md)
     * [Window](api/window.md)

diff --git a/docs/api/datatypes/daft_file_types.md b/docs/api/datatypes/daft_file_types.md
diff --git a/docs/api/datatypes/file_types.md b/docs/api/datatypes/file_types.md
@@ -0,0 +1,13 @@
+The `File` DataType provides first-class support for handling file data across local and remote storage, enabling seamless file operations in distributed environments.
+
+::: daft.file.File
+    options:
+        filters: ["!^_"]
+
+::: daft.file.AudioFile
+    options:
+        filters: ["!^_"]
+
+::: daft.file.VideoFile
+    options:
+        filters: ["!^_"]
diff --git a/docs/examples/document-processing.md b/docs/examples/document-processing.md
@@ -803,9 +803,9 @@ print(df.schema())
 
 #### Explaining Structure Access Expressions
 
-Note that we're using [`.struct`](../api/expressions.md#daft.expressions.struct) to construct an expression that allows Daft to extract individual field values from our complex document structure.
+Note that we're using `col("indexed_texts")["text"]` to construct an expression that allows Daft to extract individual field values from our complex document structure.
 
-When write `col("text_blocks").struct.get("bounding_box")`, we're telling Daft that we want to access the `bounding_box` field of each element from the `text_blocks` column. From this, we can provide additional field-selecting logic (e.g. `["x"]` to get the value for field `x` on the `bounding_box` value from each structure in `text_blocks`).
+When we write `col("text_blocks").struct.get("bounding_box")`, we're telling Daft that we want to access the `bounding_box` field of each element from the `text_blocks` column. From this, we can provide additional field-selecting logic (e.g. `["x"]` to get the value for field `x` on the `bounding_box` value from each structure in `text_blocks`).
 
 The last part of our text box processing step is to extract the text and bounding box coordinates into their own columns. We also want to preserve the reading order index as its own column too.
 
@@ -815,14 +815,14 @@ This format makes it easier to form follow up queries on our data, such as:
 
 ```python
 df = (
-    df.with_column("text_blocks", col("indexed_texts").struct.get("text"))
-    .with_column("reading_order_index", col("indexed_texts").struct.get("index"))
+    df.with_column("text_blocks", col("indexed_texts")["text"])
+    .with_column("reading_order_index", col("indexed_texts")["index"])
     .exclude("indexed_texts")
-    .with_column("text", col("text_blocks").struct.get("text"))
-    .with_column("x", col("text_blocks").struct.get("bounding_box")["x"])
-    .with_column("y", col("text_blocks").struct.get("bounding_box")["y"])
-    .with_column("h", col("text_blocks").struct.get("bounding_box")["h"])
-    .with_column("w", col("text_blocks").struct.get("bounding_box")["w"])
+    .with_column("text", col("text_blocks")["text"])
+    .with_column("x", col("text_blocks")["bounding_box"]["x"])
+    .with_column("y", col("text_blocks")["bounding_box"]["y"])
+    .with_column("h", col("text_blocks")["bounding_box"]["h"])
+    .with_column("w", col("text_blocks")["bounding_box"]["w"])
     .exclude("text_blocks")
 )
 print(df.schema())

diff --git a/docs/modalities/audio.md b/docs/modalities/audio.md
diff --git a/docs/modalities/documents.md b/docs/modalities/documents.md
diff --git a/docs/modalities/embeddings.md b/docs/modalities/embeddings.md
@@ -0,0 +1,158 @@
+# Working with Embeddings
+
+Embeddings transform text, images, and other data into dense vector representations that capture semantic meaning—enabling similarity search, retrieval-augmented generation (RAG), and AI-powered discovery. Daft makes it easy to generate, store, and query embeddings at scale.
+
+With the native [`daft.DataType.embedding`](../api/datatypes/embedding.md) type and [`embed_text`](../api/functions/embed_text.md) function, you can:
+
+- **Generate embeddings** from any text column using providers like OpenAI, Cohere, or local models
+- **Compute similarity** with built-in distance functions like `cosine_distance`
+- **Build search pipelines** that scale from local development to distributed clusters
+- **Write to vector databases** like Turbopuffer, Pinecone, or LanceDB
+
+## Semantic Search Example
+
+The following example creates a simple semantic search pipeline—embedding documents, comparing them to a query, and ranking by similarity:
+
+```python
+import daft
+from daft.functions import embed_text, cosine_distance
+
+# Create a knowledge base with documents
+documents = daft.from_pydict(
+    {
+        "text": [
+            "Python is a high-level programming language",
+            "Machine learning models require training data",
+            "Daft is a distributed dataframe library",
+            "Embeddings capture semantic meaning of text",
+        ],
+    }
+)
+
+# Embed all documents
+documents = documents.with_column(
+    "embedding",
+    embed_text(
+        daft.col("text"),
+        provider="openai",
+        model="text-embedding-3-small",
+    ),
+)
+
+# Create a query
+query = daft.from_pydict({"query_text": ["What is Daft?"]})
+
+# Embed the query
+query = query.with_column(
+    "query_embedding",
+    embed_text(
+        daft.col("query_text"),
+        provider="openai",
+        model="text-embedding-3-small",
+    ),
+)
+
+# Cross join to compare query against all documents
+results = query.join(documents, how="cross")
+
+# Calculate cosine distance (lower is more similar)
+results = results.with_column(
+    "distance", cosine_distance(daft.col("query_embedding"), daft.col("embedding"))
+)
+
+# Sort by distance and show top results
+results = results.sort("distance").select("query_text", "text", "distance", "embedding")
+results.show()
+```
+
+```{title="Output"}
+╭───────────────┬────────────────────────────────┬────────────────────┬──────────────────────────╮
+│ query_text    ┆ text                           ┆ distance           ┆ embedding                │
+│ ---           ┆ ---                            ┆ ---                ┆ ---                      │
+│ String        ┆ String                         ┆ Float64            ┆ Embedding[Float32; 1536] │
+╞═══════════════╪════════════════════════════════╪════════════════════╪══════════════════════════╡
+│ What is Daft? ┆ Daft is a distributed datafra… ┆ 0.3621492191359764 ┆ ▄▇▆▅▄▄█▆▄▄▃▂▄▃▃▃▁▄▃▃▄▄▃▂ │
+├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
+│ What is Daft? ┆ Python is a high-level progra… ┆ 0.9163975397319742 ┆ ▇▆▅▇▅▆█▇▃▄▆▄▄▁▅▄▅▃▁▃▃▂▅▃ │
+├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
+│ What is Daft? ┆ Embeddings capture semantic m… ┆ 0.9374004015203741 ┆ ▄█▅▄▅▅▅▇▄▃▂▁▃▄▄▁▃▃▂▂▂▂▁▃ │
+├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
+│ What is Daft? ┆ Machine learning models requi… ┆ 0.9696998373223874 ┆ ▇▇▆▃▄▆▅█▆▂▄▃▄▄▂▄▂▁▂▂▁▃▂▁ │
+╰───────────────┴────────────────────────────────┴────────────────────┴──────────────────────────╯
+
+(Showing first 4 of 4 rows)
+```
+
+## Building a Document Search Pipeline
+
+For production use cases, you'll typically combine embeddings with LLM-powered metadata extraction and write the results to a vector database.
+
+This example shows an end-to-end pipeline that:
+
+1. Loads PDF documents from cloud storage
+2. Extracts structured metadata using an LLM
+3. Generates vector embeddings from the abstracts
+4. Writes everything to Turbopuffer for semantic search
+
+```python
+# /// script
+# description = "This example shows how using LLMs and embedding models, Daft chunks documents, extracts metadata, generates vectors, and writes them to any vector database..."
+# dependencies = ["daft[openai, turbopuffer]", "pymupdf"]
+# ///
+import os
+import daft
+from daft import col, lit
+from daft.functions import embed_text, prompt, file, unnest, monotonically_increasing_id
+from pydantic import BaseModel
+
+class Classifier(BaseModel):
+    title: str
+    author: str
+    year: int
+    keywords: list[str]
+    abstract: str
+
+daft.set_execution_config(enable_dynamic_batching=True)
+daft.set_provider("openai", api_key=os.environ.get("OPENAI_API_KEY"))
+
+# Load documents and generate vector embeddings
+df = (
+    daft.from_glob_path("hf://datasets/Eventual-Inc/sample-files/papers/*.pdf").limit(10)
+    .with_column(
+        "metadata",
+        prompt(
+            messages=file(col("path")),
+            system_message="Read the paper and extract the classifier metadata.",
+            return_format=Classifier,
+            model="gpt-5-mini",
-            model="gpt-5-mini",
+            model="gpt-4o-mini",
-            model="gpt-5-mini",
+            model="gpt-4o-mini",
+        )
+    )
+    .with_column(
+        "abstract_embedding",
+        embed_text(
+            daft.col("metadata")["abstract"],
+            model="text-embedding-3-large"
+        )
+    )
+    .with_column("id", monotonically_increasing_id())
+    .select("id", "path", unnest(col("metadata")), "abstract_embedding")
+)
+
+# Write to Turbopuffer
+df.write_turbopuffer(
+    namespace="ai_papers",
+    api_key=os.environ.get("TURBOPUFFER_API_KEY"),
+    distance_metric="cosine_distance",
+    region='us-west-2',
+    schema={
+        "id": "int64",
+        "path": "string",
+        "title": "string",
+        "author": "string",
+        "year": "int",
+        "keywords": "list[string]",
+        "abstract": "string",
+        "abstract_embedding": "vector",
+    }
+)
+```