-
Notifications
You must be signed in to change notification settings - Fork 392
docs: Add daft.File usage throughout modalities #6074
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…umentation - Updated error messages in `AudioFile` and related functions to provide clearer instructions for installing required modules (`soundfile` and `librosa`). - Enhanced docstrings to include detailed metadata descriptions for audio files. - Added new documentation files for `daft.File` types and updated existing documentation links for better navigation. - Improved test cases to ensure informative error handling when dependencies are missing. This change aims to improve user experience by providing clearer guidance on required dependencies and enhancing the overall documentation structure.
Removed links to Documents and Code modalities from the summary.
Removed duplicate entry for 'Functions' in SUMMARY.md
Removed unnecessary comments from the audio file tests.
Removed unnecessary import checks for 'soundfile' in audio tests.
Added missing 'soundfile' imports in multiple test functions within test_audio.py to ensure proper functionality and avoid import errors.
- Added new sections for working with documents and embeddings, detailing their usage and examples. - Updated the file handling documentation to include a new `Files and URLs` section, replacing the previous `URLs and Files` section. - Removed the outdated `urls.md` file and consolidated its content into the new `files.md`. - Improved the overview of modalities to reflect recent changes and additions, ensuring clarity on supported data types. - Updated examples across various modalities, including audio, video, and document processing, to enhance usability and understanding.
Greptile SummaryThis PR significantly expands and improves documentation for working with different data modalities in Daft, with a particular focus on the new Major additions:
Key improvements:
All changes are documentation-only with no code modifications. The examples appear well-structured with clear explanations and practical use cases. Confidence Score: 5/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant User
participant Daft
participant Storage as Remote Storage
participant AudioFile as daft.AudioFile
participant File as daft.File
participant Whisper as Faster Whisper
User->>Daft: from_glob_path("*.mp3")
Daft->>Storage: Discover audio files
Storage-->>Daft: File metadata (paths, sizes)
User->>Daft: with_column(audio_file(path))
Daft->>AudioFile: Convert path to AudioFile reference
AudioFile-->>Daft: AudioFile instances
User->>Daft: with_column(audio_metadata(file))
Daft->>AudioFile: Extract metadata
AudioFile-->>Daft: Sample rate, channels, frames
User->>Daft: with_column(resample(file, 16000))
Daft->>AudioFile: Resample audio
AudioFile->>Storage: Stream audio data
Storage-->>AudioFile: Audio chunks
AudioFile-->>Daft: Resampled tensor
User->>Daft: Apply transcription UDF
Daft->>File: Open audio file
File->>Storage: Read audio data
Storage-->>File: Audio bytes
File-->>Whisper: Audio data
Whisper-->>Daft: Transcription with timestamps
Daft-->>User: DataFrame with processed audio data
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request significantly enhances Daft's documentation for working with multimodal data by consolidating file handling guides, expanding audio processing documentation, and adding new guides for embeddings and document processing.
Changes:
- Rewrote the audio modality guide to focus on
daft.AudioFileanddaft.Filewith practical examples for indexing, preprocessing, and transcription - Created new comprehensive guides for working with generic files/URLs, embeddings, and document processing (PDFs, Markdown)
- Reorganized the modalities documentation structure and consolidated file type references into a single location
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| docs/modalities/files.md | New comprehensive guide for working with files and URLs, replacing urls.md with enhanced content on daft.File usage |
| docs/modalities/embeddings.md | New guide covering semantic search and vector embeddings with examples for RAG pipelines |
| docs/modalities/documents.md | New guide for document processing including PDF extraction and Markdown parsing |
| docs/modalities/audio.md | Major rewrite focusing on daft.AudioFile, audio metadata extraction, and Faster Whisper transcription |
| docs/modalities/videos.md | Expanded with new daft.VideoFile examples and metadata extraction |
| docs/modalities/overview.md | Reorganized with card-based layout and updated descriptions |
| docs/examples/document-processing.md | Updated to use more idiomatic column access patterns |
| docs/api/datatypes/file_types.md | New consolidated API reference for File, AudioFile, and VideoFile types |
| docs/SUMMARY.md | Updated navigation with new document sections and renamed file references |
| docs/use-case/batch-inference.md | Updated link reference to files.md |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| df = daft.from_pydict({"path": ["path/to/file.txt"]}) | ||
| df = df.select(read_file(daft.col("path"))) |
Copilot
AI
Jan 22, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The example shows df.select(read_file(daft.col("path"))) but the function read_file expects a daft.File parameter while daft.col("path") likely returns a string path. The example should either convert the path to a File first using file() function, or the example output is incorrect. The correct approach would be: df = df.with_column("file", file(daft.col("path"))).select(read_file(daft.col("file")))
| "name": daft.DataType.string(), | ||
| "signature": daft.DataType.string(), | ||
| "docstring": daft.DataType.string(), | ||
| "start_line": daft.DataType.int64(), |
Copilot
AI
Jan 22, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The return_dtype declaration is missing the "end_line" field that is being returned in the function body (line 217). The struct definition should include "end_line": daft.DataType.int64() to match the actual return value.
| "start_line": daft.DataType.int64(), | |
| "start_line": daft.DataType.int64(), | |
| "end_line": daft.DataType.int64(), |
| messages=file(col("path")), | ||
| system_message="Read the paper and extract the classifier metadata.", | ||
| return_format=Classifier, | ||
| model="gpt-5-mini", |
Copilot
AI
Jan 22, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The model name "gpt-5-mini" does not exist. This should likely be "gpt-4o-mini" or another valid OpenAI model. GPT-5 has not been released as of the knowledge cutoff.
| model="gpt-5-mini", | |
| model="gpt-4o-mini", |
Co-authored-by: Copilot <[email protected]>
This pull request updates and expands the documentation for working with audio and file data types in Daft. The changes reorganize and improve the guides for handling audio files, clarify schema usage, and update references to file types for consistency. The most important changes are grouped below.
Audio Modality Documentation Improvements:
docs/modalities/audio.mdguide to focus on usingdaft.AudioFileanddaft.File, including new sections on indexing, preprocessing, reading, writing, and transcribing audio files with Faster Whisper. Added practical code examples and clarified best practices for memory-efficient audio processing. [1] [2]File Type Documentation Updates:
docs/api/datatypes/daft_file_types.mdand replaced it withdocs/api/datatypes/file_types.md, updating references and clarifying the role ofFile,AudioFile, andVideoFiledata types. [1] [2] [3]General Documentation Organization:
docs/SUMMARY.mdto add new sections for documents and embeddings, renamed and reordered "Files and URLs", and updated references to the new file types documentation for clarity and consistency. [1] [2]Code Example Improvements: