Skip to content

Conversation

@everettVT
Copy link
Contributor

This pull request updates and expands the documentation for working with audio and file data types in Daft. The changes reorganize and improve the guides for handling audio files, clarify schema usage, and update references to file types for consistency. The most important changes are grouped below.

Audio Modality Documentation Improvements:

  • Rewrote and expanded the docs/modalities/audio.md guide to focus on using daft.AudioFile and daft.File, including new sections on indexing, preprocessing, reading, writing, and transcribing audio files with Faster Whisper. Added practical code examples and clarified best practices for memory-efficient audio processing. [1] [2]
  • Updated the transcription section to use Faster Whisper, improved explanations of the transcription schema, and provided a more concise summary of advanced use cases and next steps. [1] [2] [3]
  • Updated example output to use a generic username for privacy and clarity.

File Type Documentation Updates:

  • Renamed and consolidated the file types documentation: removed the old docs/api/datatypes/daft_file_types.md and replaced it with docs/api/datatypes/file_types.md, updating references and clarifying the role of File, AudioFile, and VideoFile data types. [1] [2] [3]

General Documentation Organization:

  • Updated docs/SUMMARY.md to add new sections for documents and embeddings, renamed and reordered "Files and URLs", and updated references to the new file types documentation for clarity and consistency. [1] [2]

Code Example Improvements:

  • Updated document-processing examples to use more idiomatic and concise column access patterns, improving clarity and consistency with Daft's current API. [1] [2]## Changes Made

everettVT and others added 16 commits December 24, 2025 18:11
…umentation

- Updated error messages in `AudioFile` and related functions to provide clearer instructions for installing required modules (`soundfile` and `librosa`).
- Enhanced docstrings to include detailed metadata descriptions for audio files.
- Added new documentation files for `daft.File` types and updated existing documentation links for better navigation.
- Improved test cases to ensure informative error handling when dependencies are missing.

This change aims to improve user experience by providing clearer guidance on required dependencies and enhancing the overall documentation structure.
Removed links to Documents and Code modalities from the summary.
Removed duplicate entry for 'Functions' in SUMMARY.md
Removed unnecessary comments from the audio file tests.
Removed unnecessary import checks for 'soundfile' in audio tests.
Added missing 'soundfile' imports in multiple test functions within test_audio.py to ensure proper functionality and avoid import errors.
- Added new sections for working with documents and embeddings, detailing their usage and examples.
- Updated the file handling documentation to include a new `Files and URLs` section, replacing the previous `URLs and Files` section.
- Removed the outdated `urls.md` file and consolidated its content into the new `files.md`.
- Improved the overview of modalities to reflect recent changes and additions, ensuring clarity on supported data types.
- Updated examples across various modalities, including audio, video, and document processing, to enhance usability and understanding.
@github-actions github-actions bot added the docs label Jan 21, 2026
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 21, 2026

Greptile Summary

This PR significantly expands and improves documentation for working with different data modalities in Daft, with a particular focus on the new daft.File, daft.AudioFile, and daft.VideoFile types.

Major additions:

  • Three new comprehensive guides: Documents, Embeddings, and Files/URLs modalities
  • Extensive rewrite of the Audio guide with practical examples for indexing, preprocessing, and transcription using Faster Whisper
  • Enhanced Video guide covering both read_video_frames and VideoFile class
  • Restructured modalities overview with modern card-based layout

Key improvements:

  • Consolidated file types documentation from daft_file_types.md to file_types.md for clarity
  • Updated code examples to use more idiomatic bracket notation (col("field")["subfield"]) instead of verbose .struct.get() calls
  • Anonymized username in audio transcription output examples for privacy
  • Updated navigation structure in SUMMARY.md to include new sections

All changes are documentation-only with no code modifications. The examples appear well-structured with clear explanations and practical use cases.

Confidence Score: 5/5

  • This PR is safe to merge - it contains only documentation improvements with no code changes
  • All changes are documentation-only, improving clarity and adding new guides. No functional code is modified, eliminating any risk of bugs or regressions.
  • No files require special attention

Important Files Changed

Filename Overview
docs/modalities/audio.md Comprehensive rewrite focusing on daft.AudioFile and daft.File, added indexing and preprocessing section, updated transcription to use Faster Whisper, anonymized output examples
docs/examples/document-processing.md Improved struct field access patterns from verbose .struct.get() calls to more idiomatic bracket notation
docs/modalities/documents.md New comprehensive guide for working with documents using daft.File, including LLM prompting and PDF extraction examples
docs/modalities/embeddings.md New guide covering embeddings, semantic search, and vector database integration
docs/modalities/files.md New comprehensive guide for working with files and URLs using daft.File type, replacing the old urls.md
docs/modalities/overview.md Restructured with modern card-based layout, updated descriptions to emphasize AI-native data processing, added new modality sections

Sequence Diagram

sequenceDiagram
    participant User
    participant Daft
    participant Storage as Remote Storage
    participant AudioFile as daft.AudioFile
    participant File as daft.File
    participant Whisper as Faster Whisper

    User->>Daft: from_glob_path("*.mp3")
    Daft->>Storage: Discover audio files
    Storage-->>Daft: File metadata (paths, sizes)
    
    User->>Daft: with_column(audio_file(path))
    Daft->>AudioFile: Convert path to AudioFile reference
    AudioFile-->>Daft: AudioFile instances
    
    User->>Daft: with_column(audio_metadata(file))
    Daft->>AudioFile: Extract metadata
    AudioFile-->>Daft: Sample rate, channels, frames
    
    User->>Daft: with_column(resample(file, 16000))
    Daft->>AudioFile: Resample audio
    AudioFile->>Storage: Stream audio data
    Storage-->>AudioFile: Audio chunks
    AudioFile-->>Daft: Resampled tensor
    
    User->>Daft: Apply transcription UDF
    Daft->>File: Open audio file
    File->>Storage: Read audio data
    Storage-->>File: Audio bytes
    File-->>Whisper: Audio data
    Whisper-->>Daft: Transcription with timestamps
    
    Daft-->>User: DataFrame with processed audio data
Loading

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request significantly enhances Daft's documentation for working with multimodal data by consolidating file handling guides, expanding audio processing documentation, and adding new guides for embeddings and document processing.

Changes:

  • Rewrote the audio modality guide to focus on daft.AudioFile and daft.File with practical examples for indexing, preprocessing, and transcription
  • Created new comprehensive guides for working with generic files/URLs, embeddings, and document processing (PDFs, Markdown)
  • Reorganized the modalities documentation structure and consolidated file type references into a single location

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
docs/modalities/files.md New comprehensive guide for working with files and URLs, replacing urls.md with enhanced content on daft.File usage
docs/modalities/embeddings.md New guide covering semantic search and vector embeddings with examples for RAG pipelines
docs/modalities/documents.md New guide for document processing including PDF extraction and Markdown parsing
docs/modalities/audio.md Major rewrite focusing on daft.AudioFile, audio metadata extraction, and Faster Whisper transcription
docs/modalities/videos.md Expanded with new daft.VideoFile examples and metadata extraction
docs/modalities/overview.md Reorganized with card-based layout and updated descriptions
docs/examples/document-processing.md Updated to use more idiomatic column access patterns
docs/api/datatypes/file_types.md New consolidated API reference for File, AudioFile, and VideoFile types
docs/SUMMARY.md Updated navigation with new document sections and renamed file references
docs/use-case/batch-inference.md Updated link reference to files.md

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +162 to +163
df = daft.from_pydict({"path": ["path/to/file.txt"]})
df = df.select(read_file(daft.col("path")))
Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example shows df.select(read_file(daft.col("path"))) but the function read_file expects a daft.File parameter while daft.col("path") likely returns a string path. The example should either convert the path to a File first using file() function, or the example output is incorrect. The correct approach would be: df = df.with_column("file", file(daft.col("path"))).select(read_file(daft.col("file")))

Copilot uses AI. Check for mistakes.
"name": daft.DataType.string(),
"signature": daft.DataType.string(),
"docstring": daft.DataType.string(),
"start_line": daft.DataType.int64(),
Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return_dtype declaration is missing the "end_line" field that is being returned in the function body (line 217). The struct definition should include "end_line": daft.DataType.int64() to match the actual return value.

Suggested change
"start_line": daft.DataType.int64(),
"start_line": daft.DataType.int64(),
"end_line": daft.DataType.int64(),

Copilot uses AI. Check for mistakes.
messages=file(col("path")),
system_message="Read the paper and extract the classifier metadata.",
return_format=Classifier,
model="gpt-5-mini",
Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The model name "gpt-5-mini" does not exist. This should likely be "gpt-4o-mini" or another valid OpenAI model. GPT-5 has not been released as of the knowledge cutoff.

Suggested change
model="gpt-5-mini",
model="gpt-4o-mini",

Copilot uses AI. Check for mistakes.
Co-authored-by: Copilot <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants