docs: Add daft.File usage throughout modalities #6074

everettVT · 2026-01-21T19:56:06Z

This pull request updates and expands the documentation for working with audio and file data types in Daft. The changes reorganize and improve the guides for handling audio files, clarify schema usage, and update references to file types for consistency. The most important changes are grouped below.

Audio Modality Documentation Improvements:

Rewrote and expanded the docs/modalities/audio.md guide to focus on using daft.AudioFile and daft.File, including new sections on indexing, preprocessing, reading, writing, and transcribing audio files with Faster Whisper. Added practical code examples and clarified best practices for memory-efficient audio processing. [1] [2]
Updated the transcription section to use Faster Whisper, improved explanations of the transcription schema, and provided a more concise summary of advanced use cases and next steps. [1] [2] [3]
Updated example output to use a generic username for privacy and clarity.

File Type Documentation Updates:

Renamed and consolidated the file types documentation: removed the old docs/api/datatypes/daft_file_types.md and replaced it with docs/api/datatypes/file_types.md, updating references and clarifying the role of File, AudioFile, and VideoFile data types. [1] [2] [3]

General Documentation Organization:

Updated docs/SUMMARY.md to add new sections for documents and embeddings, renamed and reordered "Files and URLs", and updated references to the new file types documentation for clarity and consistency. [1] [2]

Code Example Improvements:

Updated document-processing examples to use more idiomatic and concise column access patterns, improving clarity and consistency with Daft's current API. [1] [2]## Changes Made

…umentation - Updated error messages in `AudioFile` and related functions to provide clearer instructions for installing required modules (`soundfile` and `librosa`). - Enhanced docstrings to include detailed metadata descriptions for audio files. - Added new documentation files for `daft.File` types and updated existing documentation links for better navigation. - Improved test cases to ensure informative error handling when dependencies are missing. This change aims to improve user experience by providing clearer guidance on required dependencies and enhancing the overall documentation structure.

Removed links to Documents and Code modalities from the summary.

Removed duplicate entry for 'Functions' in SUMMARY.md

Removed unnecessary comments from the audio file tests.

Removed unnecessary import checks for 'soundfile' in audio tests.

Added missing 'soundfile' imports in multiple test functions within test_audio.py to ensure proper functionality and avoid import errors.

- Added new sections for working with documents and embeddings, detailing their usage and examples. - Updated the file handling documentation to include a new `Files and URLs` section, replacing the previous `URLs and Files` section. - Removed the outdated `urls.md` file and consolidated its content into the new `files.md`. - Improved the overview of modalities to reflect recent changes and additions, ensuring clarity on supported data types. - Updated examples across various modalities, including audio, video, and document processing, to enhance usability and understanding.

greptile-apps · 2026-01-21T19:58:13Z

Greptile Summary

This PR significantly expands and improves documentation for working with different data modalities in Daft, with a particular focus on the new daft.File, daft.AudioFile, and daft.VideoFile types.

Major additions:

Three new comprehensive guides: Documents, Embeddings, and Files/URLs modalities
Extensive rewrite of the Audio guide with practical examples for indexing, preprocessing, and transcription using Faster Whisper
Enhanced Video guide covering both read_video_frames and VideoFile class
Restructured modalities overview with modern card-based layout

Key improvements:

Consolidated file types documentation from daft_file_types.md to file_types.md for clarity
Updated code examples to use more idiomatic bracket notation (col("field")["subfield"]) instead of verbose .struct.get() calls
Anonymized username in audio transcription output examples for privacy
Updated navigation structure in SUMMARY.md to include new sections

All changes are documentation-only with no code modifications. The examples appear well-structured with clear explanations and practical use cases.

Confidence Score: 5/5

This PR is safe to merge - it contains only documentation improvements with no code changes
All changes are documentation-only, improving clarity and adding new guides. No functional code is modified, eliminating any risk of bugs or regressions.
No files require special attention

Important Files Changed

Filename	Overview
docs/modalities/audio.md	Comprehensive rewrite focusing on `daft.AudioFile` and `daft.File`, added indexing and preprocessing section, updated transcription to use Faster Whisper, anonymized output examples
docs/examples/document-processing.md	Improved struct field access patterns from verbose `.struct.get()` calls to more idiomatic bracket notation
docs/modalities/documents.md	New comprehensive guide for working with documents using `daft.File`, including LLM prompting and PDF extraction examples
docs/modalities/embeddings.md	New guide covering embeddings, semantic search, and vector database integration
docs/modalities/files.md	New comprehensive guide for working with files and URLs using `daft.File` type, replacing the old urls.md
docs/modalities/overview.md	Restructured with modern card-based layout, updated descriptions to emphasize AI-native data processing, added new modality sections

Sequence Diagram

sequenceDiagram
    participant User
    participant Daft
    participant Storage as Remote Storage
    participant AudioFile as daft.AudioFile
    participant File as daft.File
    participant Whisper as Faster Whisper

    User->>Daft: from_glob_path("*.mp3")
    Daft->>Storage: Discover audio files
    Storage-->>Daft: File metadata (paths, sizes)
    
    User->>Daft: with_column(audio_file(path))
    Daft->>AudioFile: Convert path to AudioFile reference
    AudioFile-->>Daft: AudioFile instances
    
    User->>Daft: with_column(audio_metadata(file))
    Daft->>AudioFile: Extract metadata
    AudioFile-->>Daft: Sample rate, channels, frames
    
    User->>Daft: with_column(resample(file, 16000))
    Daft->>AudioFile: Resample audio
    AudioFile->>Storage: Stream audio data
    Storage-->>AudioFile: Audio chunks
    AudioFile-->>Daft: Resampled tensor
    
    User->>Daft: Apply transcription UDF
    Daft->>File: Open audio file
    File->>Storage: Read audio data
    Storage-->>File: Audio bytes
    File-->>Whisper: Audio data
    Whisper-->>Daft: Transcription with timestamps
    
    Daft-->>User: DataFrame with processed audio data

Copilot

Pull request overview

This pull request significantly enhances Daft's documentation for working with multimodal data by consolidating file handling guides, expanding audio processing documentation, and adding new guides for embeddings and document processing.

Changes:

Rewrote the audio modality guide to focus on daft.AudioFile and daft.File with practical examples for indexing, preprocessing, and transcription
Created new comprehensive guides for working with generic files/URLs, embeddings, and document processing (PDFs, Markdown)
Reorganized the modalities documentation structure and consolidated file type references into a single location

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
docs/modalities/files.md	New comprehensive guide for working with files and URLs, replacing urls.md with enhanced content on `daft.File` usage
docs/modalities/embeddings.md	New guide covering semantic search and vector embeddings with examples for RAG pipelines
docs/modalities/documents.md	New guide for document processing including PDF extraction and Markdown parsing
docs/modalities/audio.md	Major rewrite focusing on `daft.AudioFile`, audio metadata extraction, and Faster Whisper transcription
docs/modalities/videos.md	Expanded with new `daft.VideoFile` examples and metadata extraction
docs/modalities/overview.md	Reorganized with card-based layout and updated descriptions
docs/examples/document-processing.md	Updated to use more idiomatic column access patterns
docs/api/datatypes/file_types.md	New consolidated API reference for File, AudioFile, and VideoFile types
docs/SUMMARY.md	Updated navigation with new document sections and renamed file references
docs/use-case/batch-inference.md	Updated link reference to files.md

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

docs/modalities/files.md

Copilot · 2026-01-22T16:02:50Z

docs/modalities/files.md

+df = daft.from_pydict({"path": ["path/to/file.txt"]})
+df = df.select(read_file(daft.col("path")))


The example shows df.select(read_file(daft.col("path"))) but the function read_file expects a daft.File parameter while daft.col("path") likely returns a string path. The example should either convert the path to a File first using file() function, or the example output is incorrect. The correct approach would be: df = df.with_column("file", file(daft.col("path"))).select(read_file(daft.col("file")))

Copilot · 2026-01-22T16:02:50Z

docs/modalities/files.md

+                "name": daft.DataType.string(),
+                "signature": daft.DataType.string(),
+                "docstring": daft.DataType.string(),
+                "start_line": daft.DataType.int64(),


The return_dtype declaration is missing the "end_line" field that is being returned in the function body (line 217). The struct definition should include "end_line": daft.DataType.int64() to match the actual return value.

Suggested change

"start_line": daft.DataType.int64(),

"start_line": daft.DataType.int64(),

"end_line": daft.DataType.int64(),

Copilot · 2026-01-22T16:02:51Z

docs/modalities/embeddings.md

+            messages=file(col("path")),
+            system_message="Read the paper and extract the classifier metadata.",
+            return_format=Classifier,
+            model="gpt-5-mini",


The model name "gpt-5-mini" does not exist. This should likely be "gpt-4o-mini" or another valid OpenAI model. GPT-5 has not been released as of the knowledge cutoff.

Suggested change

model="gpt-5-mini",

model="gpt-4o-mini",

Co-authored-by: Copilot <[email protected]>

everettVT and others added 16 commits December 24, 2025 18:11

Merge branch 'main' into everettVT/daft_file_docs

1131743

merge origin/main into daft_file_docs

697dc6b

fix style and doctests

241935b

ignore doctest.

b0f2932

Remove Documents and Code from SUMMARY.md

9fcff6a

Removed links to Documents and Code modalities from the summary.

Reorganize API documentation structure

5326194

Update SUMMARY.md to reflect API changes

c85835c

Update SUMMARY.md to include Window and Schema sections

b5abad7

Add Series section to SUMMARY.md

5d51d25

Remove duplicate 'Functions' entry from SUMMARY.md

4e50a94

Removed duplicate entry for 'Functions' in SUMMARY.md

Clean up comments in test_audio.py

4de55f4

Removed unnecessary comments from the audio file tests.

Clean up soundfile import checks in tests

7754a15

Removed unnecessary import checks for 'soundfile' in audio tests.

Enhance audio tests with soundfile imports

5b03786

Added missing 'soundfile' imports in multiple test functions within test_audio.py to ensure proper functionality and avoid import errors.

Merge remote-tracking branch 'origin/main' into everettVT/daft_file_docs

435fb1e

github-actions bot added the docs label Jan 21, 2026

everettVT requested review from Copilot, universalmind303 and ykdojo and removed request for universalmind303 and ykdojo January 21, 2026 20:20

Copilot started reviewing on behalf of everettVT January 22, 2026 15:57 View session

Copilot AI reviewed Jan 22, 2026

View reviewed changes

Update files.md

2b10e24

Co-authored-by: Copilot <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: Add daft.File usage throughout modalities #6074

docs: Add daft.File usage throughout modalities #6074

Uh oh!

everettVT commented Jan 21, 2026

Uh oh!

greptile-apps bot commented Jan 21, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI Jan 22, 2026

Uh oh!

Copilot AI Jan 22, 2026

Uh oh!

Copilot AI Jan 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		df = daft.from_pydict({"path": ["path/to/file.txt"]})
		df = df.select(read_file(daft.col("path")))

	"start_line": daft.DataType.int64(),
	"start_line": daft.DataType.int64(),
	"end_line": daft.DataType.int64(),

docs: Add daft.File usage throughout modalities #6074

Are you sure you want to change the base?

docs: Add daft.File usage throughout modalities #6074

Uh oh!

Conversation

everettVT commented Jan 21, 2026

Uh oh!

greptile-apps bot commented Jan 21, 2026

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants