diff --git a/docs/SUMMARY.md b/docs/SUMMARY.md index a53b2bfea1..9c9fd2e2a4 100644 --- a/docs/SUMMARY.md +++ b/docs/SUMMARY.md @@ -14,8 +14,10 @@ * [Images](modalities/images.md) * [Audio](modalities/audio.md) * [Videos](modalities/videos.md) + * [Documents](modalities/documents.md) * [JSON and Nested Data](modalities/json.md) - * [URLs and Files](modalities/urls.md) + * [Files and URLs](modalities/files.md) + * [Embeddings](modalities/embeddings.md) * [Custom Modalities](modalities/custom.md) * Scale Custom Python Code * [New UDF Overview](custom-code/index.md) @@ -90,7 +92,7 @@ * [User-Defined Functions](api/udf.md) * Data Types * [DataType](api/datatypes/all_datatypes.md) - * [daft.File Types](api/datatypes/daft_file_types.md) + * [File Types](api/datatypes/file_types.md) * [Type Conversions](api/datatypes/type_conversions.md) * [Casting](api/datatypes/casting.md) * [Window](api/window.md) diff --git a/docs/api/datatypes/daft_file_types.md b/docs/api/datatypes/daft_file_types.md deleted file mode 100644 index 91a9121814..0000000000 --- a/docs/api/datatypes/daft_file_types.md +++ /dev/null @@ -1,13 +0,0 @@ -The daft.File DataType provides first-class support for handling file data across local and remote storage, enabling seamless file operations in distributed environments. (See the [daft.File Guide](../../modalities/files.md)) - -::: daft.file.File - options: - filters: ["!^_"] - -::: daft.file.AudioFile - options: - filters: ["!^_"] - -::: daft.file.VideoFile - options: - filters: ["!^_"] diff --git a/docs/api/datatypes/file_types.md b/docs/api/datatypes/file_types.md new file mode 100644 index 0000000000..7badb2149f --- /dev/null +++ b/docs/api/datatypes/file_types.md @@ -0,0 +1,13 @@ +The `File` DataType provides first-class support for handling file data across local and remote storage, enabling seamless file operations in distributed environments. + +::: daft.file.File + options: + filters: ["!^_"] + +::: daft.file.AudioFile + options: + filters: ["!^_"] + +::: daft.file.VideoFile + options: + filters: ["!^_"] diff --git a/docs/examples/document-processing.md b/docs/examples/document-processing.md index 70279228e4..101f06208d 100644 --- a/docs/examples/document-processing.md +++ b/docs/examples/document-processing.md @@ -803,9 +803,9 @@ print(df.schema()) #### Explaining Structure Access Expressions -Note that we're using [`.struct`](../api/expressions.md#daft.expressions.struct) to construct an expression that allows Daft to extract individual field values from our complex document structure. +Note that we're using `col("indexed_texts")["text"]` to construct an expression that allows Daft to extract individual field values from our complex document structure. -When write `col("text_blocks").struct.get("bounding_box")`, we're telling Daft that we want to access the `bounding_box` field of each element from the `text_blocks` column. From this, we can provide additional field-selecting logic (e.g. `["x"]` to get the value for field `x` on the `bounding_box` value from each structure in `text_blocks`). +When we write `col("text_blocks").struct.get("bounding_box")`, we're telling Daft that we want to access the `bounding_box` field of each element from the `text_blocks` column. From this, we can provide additional field-selecting logic (e.g. `["x"]` to get the value for field `x` on the `bounding_box` value from each structure in `text_blocks`). The last part of our text box processing step is to extract the text and bounding box coordinates into their own columns. We also want to preserve the reading order index as its own column too. @@ -815,14 +815,14 @@ This format makes it easier to form follow up queries on our data, such as: ```python df = ( - df.with_column("text_blocks", col("indexed_texts").struct.get("text")) - .with_column("reading_order_index", col("indexed_texts").struct.get("index")) + df.with_column("text_blocks", col("indexed_texts")["text"]) + .with_column("reading_order_index", col("indexed_texts")["index"]) .exclude("indexed_texts") - .with_column("text", col("text_blocks").struct.get("text")) - .with_column("x", col("text_blocks").struct.get("bounding_box")["x"]) - .with_column("y", col("text_blocks").struct.get("bounding_box")["y"]) - .with_column("h", col("text_blocks").struct.get("bounding_box")["h"]) - .with_column("w", col("text_blocks").struct.get("bounding_box")["w"]) + .with_column("text", col("text_blocks")["text"]) + .with_column("x", col("text_blocks")["bounding_box"]["x"]) + .with_column("y", col("text_blocks")["bounding_box"]["y"]) + .with_column("h", col("text_blocks")["bounding_box"]["h"]) + .with_column("w", col("text_blocks")["bounding_box"]["w"]) .exclude("text_blocks") ) print(df.schema()) diff --git a/docs/modalities/audio.md b/docs/modalities/audio.md index 6204b55906..cb973e5165 100644 --- a/docs/modalities/audio.md +++ b/docs/modalities/audio.md @@ -1,22 +1,73 @@ # Working with Audio -Audio isn't just a collection of bytes or waveforms—it's speech, music, ambient sound with meaning you can extract, transcribe, and analyze. Daft is built to handle audio data at scale, making it easy to process recordings, transcribe speech, and transform audio in distributed pipelines. +Daft supports working with audio natively via the new `daft.AudioFile` and its parent `daft.File`. -This guide shows you how to accomplish common audio processing tasks with Daft: +Audio data is usually stored in file formats such as MP3, WAV, or OGG. As a continuous waveform, the length of an audio file is directly proportional to its size. This can consume large amounts of RAM if processed all at once, so a common practice to reduce memory overhead is to process audio in buffered chunks via streaming. Similar to images, typically teams index audio files in tables instead of storing audio data as bytes directly. -- [Read and write audio files](#reading-and-writing-audio-files) -- [Transcribe audio with Voice Activity Detection](#transcription-with-voice-activity-detection-plus-segment-and-word-timestamps) -- [Extract segment and word-level timestamps](#transcription-with-voice-activity-detection-plus-segment-and-word-timestamps) +!!! note "Contribute to `daft.AudioFile`" + If you'd like to contribute new features to `daft.AudioFile`, please open an issue on [GitHub](https://github.com/Eventual-Inc/Daft/issues) or join our [Daft Slack Community](https://join.slack.com/t/dist-data/shared_invite/zt-2e77olvxw-uyZcPPV1SRchhi8ah6ZCtg) and send us a message in #daft-dev. Daft team members will be happy to assign any issue to you and provide any guidance if needed. There are also dedicated discussion threads for `daft.AudioFile` in the [Discussions](https://github.com/Eventual-Inc/Daft/discussions/categories/audio-file). + +In this guide, we'll cover how to work with audio using `daft.AudioFile` and `daft.File` to perform common use cases like: + +1. [Indexing and preprocessing Audio Files](#indexing-and-preprocessing-audio-files) - Discover and extract metadata from audio files in remote storage. +2. [Reading and writing Audio Files](#reading-and-writing-audio-files) - Build custom read/write pipelines with `daft.File`. +3. [Transcription with Faster Whisper](#transcription-with-faster-whisper) - Run a Whisper-based pipeline with timestamps. + +## Indexing and Preprocessing Audio Files + +The following example demonstrates how to use [`audio_file`](../api/functions/audio_file.md) to read an audio file and extract the metadata and resample it to 16000 Hz with the [`audio_metadata`](../api/functions/audio_metadata.md) and [`resample`](../api/functions/resample.md) functions. + +- The [`audio_file`](../api/functions/audio_file.md) function converts a string containing a file reference to a [`daft.AudioFile`](../api/datatypes/file_types.md) reference. +- The [`audio_metadata`](../api/functions/audio_metadata.md) function extracts the sample rate, channels, frames, format, and subtype. +- The [`resample`](../api/functions/resample.md) function resamples an audio file to a given sample rate and returns a tensor of floats. + +```python +import daft +from daft.functions import audio_file, audio_metadata, resample + +df = ( + daft.from_glob_path("hf://datasets/Eventual-Inc/sample-files/audio/*.mp3") + .with_column("file", audio_file(daft.col("path"))) + .with_column("metadata", audio_metadata(daft.col("file"))) + .with_column("resampled", resample(daft.col("file"), sample_rate=16000)) + .select("path", "file", "size", "metadata", "resampled") +) + +df.show(3) +``` + +``` {title="Output"} +╭────────────────────────────────┬────────────────────────────────┬─────────┬─────────────────────────────────────────────┬──────────────────────────╮ +│ path ┆ file ┆ size ┆ metadata ┆ resampled │ +│ --- ┆ --- ┆ --- ┆ --- ┆ --- │ +│ String ┆ File[Audio] ┆ Int64 ┆ Struct[sample_rate: Int64, channels: Int64, ┆ Tensor[Float64] │ +│ ┆ ┆ ┆ frames: Float64, format: String, subtype: ┆ │ +│ ┆ ┆ ┆ String] ┆ │ +╞════════════════════════════════╪════════════════════════════════╪═════════╪═════════════════════════════════════════════╪══════════════════════════╡ +│ hf://datasets/Eventual-Inc/sa… ┆ Audio(path: hf://datasets/Eve… ┆ 822924 ┆ {sample_rate: 16000, ┆ │ +│ ┆ ┆ ┆ channels… ┆ │ +├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ +│ hf://datasets/Eventual-Inc/sa… ┆ Audio(path: hf://datasets/Eve… ┆ 618408 ┆ {sample_rate: 16000, ┆ │ +│ ┆ ┆ ┆ channels… ┆ │ +├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ +│ hf://datasets/Eventual-Inc/sa… ┆ Audio(path: hf://datasets/Eve… ┆ 1190736 ┆ {sample_rate: 16000, ┆ │ +│ ┆ ┆ ┆ channels… ┆ │ +╰────────────────────────────────┴────────────────────────────────┴─────────┴─────────────────────────────────────────────┴──────────────────────────╯ + +(Showing first 3 rows) +``` ## Reading and Writing Audio Files -Audio files come in various formats and sample rates. With `daft.File`, you can read audio data into numpy arrays, resample to a target sampling rate, and write back to disk—all in parallel across your dataset. +The `daft.AudioFile` type is new, and is still in development. For custom audio use-cases like format conversion, file writing, or other audio-specific operations, use `daft.File` inside a `daft.func` or `daft.method` UDF. + +For example, the following code demonstrates how to read an audio file, resample it to 16000 Hz, and save it as an MP3 file. With `daft.File`, you can read audio data into numpy arrays, resample to a target sampling rate, and write back to disk. ```python # /// script # description = "Read audio, resample, and save as mp3" # requires-python = ">=3.10, <3.13" -# dependencies = ["daft", "soundfile", "numpy", "scipy"] +# dependencies = ["daft[audio]"] # /// import pathlib @@ -175,12 +226,10 @@ This example demonstrates several key patterns: The `soundfile` library supports many audio formats including WAV, FLAC, OGG, and MP3. You can easily adapt the code above to work with your preferred format by changing the `format` and `subtype` parameters in `sf.write()`. -## Transcription with Voice Activity Detection plus Segment and Word Timestamps +## Transcription with Faster Whisper Transcription is one of the most powerful use cases for audio processing in AI pipelines. Whether you're building voice assistants, generating subtitles, or analyzing customer calls, accurate transcription with timestamps is essential. -### How to transcribe audio files - This example shows how to transcribe audio files using [faster-whisper](https://github.com/SYSTRAN/faster-whisper) with Voice Activity Detection (VAD) to filter out silence, and extract both segment-level and word-level timestamps. ```python @@ -251,16 +300,16 @@ df_transcript.show(3, format="fancy", max_width=40) ╭────────────────────────────────────────┬─────────────────────────┬────────────────────────────────────────┬────┬───────┬────────┬────────┬────────────────────────────────────────┬────────────────────────────────────────┬───────────────────────┬────────────────────┬──────────────────────┬──────────────────┬─────────────╮ │ path ┆ info ┆ transcript ┆ id ┆ seek ┆ start ┆ end ┆ text ┆ tokens ┆ avg_logprob ┆ compression_ratio ┆ no_speech_prob ┆ words ┆ temperature │ ╞════════════════════════════════════════╪═════════════════════════╪════════════════════════════════════════╪════╪═══════╪════════╪════════╪════════════════════════════════════════╪════════════════════════════════════════╪═══════════════════════╪════════════════════╪══════════════════════╪══════════════════╪═════════════╡ -│ file:///Users/everettkleven/Desktop/N… ┆ {language: en, ┆ Okay, so I have a cluster running wi… ┆ 1 ┆ 0 ┆ 0 ┆ 29.46 ┆ Okay, so I have a cluster running wi… ┆ [1033, 11, 370, 286, 362, 257, 13630,… ┆ -0.06497006558853646 ┆ 1.619672131147541 ┆ 0.01067733857780695 ┆ [{start: 0, ┆ 0 │ +│ file:///Users/myusername007/Desktop/N… ┆ {language: en, ┆ Okay, so I have a cluster running wi… ┆ 1 ┆ 0 ┆ 0 ┆ 29.46 ┆ Okay, so I have a cluster running wi… ┆ [1033, 11, 370, 286, 362, 257, 13630,… ┆ -0.06497006558853646 ┆ 1.619672131147541 ┆ 0.01067733857780695 ┆ [{start: 0, ┆ 0 │ │ ┆ language_probability: … ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ end: 0.48, ┆ │ │ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ word: Okay,, ┆ │ │ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ … ┆ │ ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤ -│ file:///Users/everettkleven/Desktop/N… ┆ {language: en, ┆ Okay, so I have a cluster running wi… ┆ 2 ┆ 2940 ┆ 30.04 ┆ 56.41 ┆ Usually they want to do this to put … ┆ [11419, 436, 528, 281, 360, 341, 281,… ┆ -0.03574741696222471 ┆ 1.6338028169014085 ┆ 0.009307598695158958 ┆ [{start: 30.04, ┆ 0 │ +│ file:///Users/myusername007/Desktop/N… ┆ {language: en, ┆ Okay, so I have a cluster running wi… ┆ 2 ┆ 2940 ┆ 30.04 ┆ 56.41 ┆ Usually they want to do this to put … ┆ [11419, 436, 528, 281, 360, 341, 281,… ┆ -0.03574741696222471 ┆ 1.6338028169014085 ┆ 0.009307598695158958 ┆ [{start: 30.04, ┆ 0 │ │ ┆ language_probability: … ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ end: 30.48, ┆ │ │ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ word: Us… ┆ │ ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤ -│ file:///Users/everettkleven/Desktop/N… ┆ {language: en, ┆ Okay, so I have a cluster running wi… ┆ 3 ┆ 5622 ┆ 56.83 ┆ 79.56 ┆ And go number two, I want this script ┆ [400, 352, 1230, 732, 11, 286, 528, 3… ┆ -0.07241094582492397 ┆ 1.5330396475770924 ┆ 0.005606058984994888 ┆ [{start: 56.83, ┆ 0 │ +│ file:///Users/myusername007/Desktop/N… ┆ {language: en, ┆ Okay, so I have a cluster running wi… ┆ 3 ┆ 5622 ┆ 56.83 ┆ 79.56 ┆ And go number two, I want this script ┆ [400, 352, 1230, 732, 11, 286, 528, 3… ┆ -0.07241094582492397 ┆ 1.5330396475770924 ┆ 0.005606058984994888 ┆ [{start: 56.83, ┆ 0 │ │ ┆ language_probability: … ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ end: 57.27, ┆ │ │ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ word: An… ┆ │ ╰────────────────────────────────────────┴─────────────────────────┴────────────────────────────────────────┴────┴───────┴────────┴────────┴────────────────────────────────────────┴────────────────────────────────────────┴───────────────────────┴────────────────────┴──────────────────────┴──────────────────┴─────────────╯ @@ -279,7 +328,7 @@ The output shows rich transcription data including: VAD filtering automatically removes silent portions of audio before transcription, which improves accuracy and reduces processing time. Adjust the `min_silence_duration_ms` parameter to control how much silence is required before a segment break. -### Understanding the transcription schema +**Understanding the transcription schema** The transcription result is a structured object with nested data. Here's the schema definition using Daft's type system. You can save this as `transcription_schema.py` and import it in your scripts: @@ -383,8 +432,6 @@ TranscriptionResult = DataType.struct( Using strongly-typed schemas ensures that your data pipeline is robust and catches errors early. Daft's support for complex nested structures makes it easy to work with rich transcription data without flattening everything into primitive types. -## Working with transcription results - Once you have transcriptions with timestamps, you can: - **Generate subtitles**: Use the word-level timestamps to create precise subtitle files (SRT, VTT) @@ -392,10 +439,4 @@ Once you have transcriptions with timestamps, you can: - **Analyze speech patterns**: Calculate speaking rates, pause durations, or word frequencies - **Join with metadata**: Combine transcription data with speaker information, recording metadata, or other datasets -## More examples - -For more advanced audio processing workflows, check out: - -- [Custom User-Defined Functions (UDFs)](../custom-code/udfs.md) for building your own audio processing functions -- [Working with Files and URLs](urls.md) for discovering and downloading audio from remote storage -- [Distributed Processing](../distributed/index.md) for scaling audio processing across multiple machines +For more advanced audio processing workflows, check out [Voice AI Analytics with Faster Whisper and embed_text](../examples/voice-ai-analytics.md) for a more advanced example of using Faster Whisper with Daft. diff --git a/docs/modalities/documents.md b/docs/modalities/documents.md new file mode 100644 index 0000000000..ae8658cd33 --- /dev/null +++ b/docs/modalities/documents.md @@ -0,0 +1,282 @@ +# Working with Documents + +Documents are a common type of data that can be found in many different formats. The `daft.File` type is particularly useful for working with documents in a distributed manner. + + +## Prompting LLMs with Text Documents as Context + +The `prompt` function supports multiple file input methods depending on the provider and file type: + +- **PDF files**: Passed directly as file inputs (native OpenAI support) +- **Text files** (Markdown, HTML, CSV, etc.): Content is automatically extracted and injected into prompts +- **Images**: Supported via `daft.DataType.Image()` or file paths + + +For example, the following code demonstrates how to use `prompt` with `daft.File` to read a PDF file to search the web for closely related papers. + +```python +import daft +from daft.functions import embed_text, prompt, file, regexp_split, unnest +from pydantic import BaseModel, Field + + +class Citation(BaseModel): + url: str = Field(description="The URL of the source") + title: str = Field(description="The title of the source") + snippet: str = Field(description="A snippet of the source text") + + +class SearchResults(BaseModel): + summary: str = Field(description="A summary of the search results") + citations: list[Citation] = Field(description="A list of citations") + + +df = ( + daft.from_glob_path("hf://datasets/Eventual-Inc/sample-files/papers/*.pdf").limit(1) + .with_column( + "results", + prompt( + messages=[ + daft.lit("Find 5 closely related papers to the one attached"), + file(daft.col("path")), + ], + model="gpt-4-turbo", + tools=[{"type": "web_search"}], + return_format=SearchResults, + provider="openai", + unnest=True, + ), + ) +) +results = df.to_pydict() +print(results) +``` + +``` {title="Output"} +{ + 'path': ['hf://datasets/Eventual-Inc/sample-files/papers/2102.04074v1.pdf'], + 'summary': [ + 'Here are 5 closely related papers on scaling laws and learning-curve theory that complement Hutter (2021): + + - Deep Learning Scaling is Predictable, Empirically (Hestness et al., 2017). Early large-scale empirical study showing power-law error decreases with data/model/compute across multiple domains—motivating theory like Hutter’s. ([arxiv.org](https://arxiv.org/abs/1712.00409?utm_source=openai)) + - Scaling Laws for Neural Language Models (Kaplan et al., 2020). Establishes power-law scaling of loss with parameters, dataset size, and compute; provides simple formulas for compute-optimal tradeoffs. ([arxiv.org](https://arxiv.org/abs/2001.08361?utm_source=openai)) + - Scaling Laws for Autoregressive Generative Modeling (Henighan et al., 2020). Extends empirical scaling laws beyond text to images, video, and multimodal settings, reinforcing near-universality of power-law behavior. ([arxiv.org](https://arxiv.org/abs/2010.14701?utm_source=openai)) + - Explaining Neural Scaling Laws (Bahri, Dyer, Kaplan, Lee, Sharma, 2021). Provides a theoretical framework (variance‑limited vs. resolution‑limited regimes) that explains when and why power-law scaling with data/model size emerges—conceptually close to Hutter’s theory focus. ([arxiv.org](https://arxiv.org/abs/2102.06701?utm_source=openai)) + - Scaling Laws from the Data Manifold Dimension (Sharma & Kaplan, JMLR 2022). Theoretical account linking scaling exponents to intrinsic data‑manifold dimension; offers explicit predictions for exponents observed empirically. ([jmlr.org](https://jmlr.org/papers/v23/20-1111.html?utm_source=openai))' + ], + 'citations': [ + [ + { + 'url': 'https://arxiv.org/abs/1712.00409', + 'title': 'Deep Learning Scaling is Predictable, Empirically', + 'snippet': 'Empirical study showing power-law generalization error scaling across data, model, and compute.' + }, { + 'url': 'https://arxiv.org/abs/2001.08361', + 'title': 'Scaling Laws for Neural Language Models', + 'snippet': 'Power-law scaling of cross-entropy loss with parameters, data, and compute; compute-optimal tradeoffs.' + }, { + 'url': 'https://openai.com/research/scaling-laws-for-neural-language-models', + 'title': 'Scaling laws for neural language models | OpenAI', + 'snippet': 'Project page summarizing results and implications.' + }, { + 'url': 'https://arxiv.org/abs/2010.14701', + 'title': 'Scaling Laws for Autoregressive Generative Modeling', + 'snippet': 'Empirical scaling across images, video, multimodal, and math domains.' + } + ] + ] +} +``` + +## Reading a PDF file to extract text and image content from each page + +For example, you can use `daft.File` to read a PDF file and extract the text and images from it. + +```python +import daft +import pymupdf + +@daft.func( + return_dtype=daft.DataType.list( + daft.DataType.struct( + { + "page_number": daft.DataType.uint8(), + "page_text": daft.DataType.string(), + "page_image_bytes": daft.DataType.binary(), + } + ) + ) +) +def extract_pdf(file: daft.File): + """Extracts the content of a PDF file.""" + pymupdf.TOOLS.mupdf_display_errors(False) # Suppress non-fatal MuPDF warnings + content = [] + with file.to_tempfile() as tmp: + doc = pymupdf.Document(filename=str(tmp.name), filetype="pdf") + for pno, page in enumerate(doc): + row = { + "page_number": pno, + "page_text": page.get_text("text"), + "page_image_bytes": page.get_pixmap().tobytes(), + } + content.append(row) + return content + +if __name__ == "__main__": + # Discover and download pdfs + df = ( + daft.from_glob_path("hf://datasets/Eventual-Inc/sample-files/papers/*.pdf") + .with_column("pdf_file", daft.functions.file(daft.col("path"))) + .with_column("pages", extract_pdf(daft.col("pdf_file"))) + .explode("pages") + .select(daft.col("path"), daft.functions.unnest(daft.col("pages"))) + ) + df.show(3) +``` + +``` {title="Output"} +╭────────────────────────────────┬─────────────┬────────────────────────────────┬────────────────────────────────╮ +│ path ┆ page_number ┆ page_text ┆ page_image_bytes │ +│ --- ┆ --- ┆ --- ┆ --- │ +│ String ┆ UInt8 ┆ String ┆ Binary │ +╞════════════════════════════════╪═════════════╪════════════════════════════════╪════════════════════════════════╡ +│ hf://datasets/Eventual-Inc/sa… ┆ 0 ┆ Learning Curve Theory ┆ b"\x89PNG\r\n\x1a\n\x00\x00\x… │ +│ ┆ ┆ Marcus … ┆ │ +├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ +│ hf://datasets/Eventual-Inc/sa… ┆ 1 ┆ 1 ┆ b"\x89PNG\r\n\x1a\n\x00\x00\x… │ +│ ┆ ┆ Introduction ┆ │ +│ ┆ ┆ Power laws in … ┆ │ +├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ +│ hf://datasets/Eventual-Inc/sa… ┆ 2 ┆ Theory: Scaling with data si… ┆ b"\x89PNG\r\n\x1a\n\x00\x00\x… │ +╰────────────────────────────────┴─────────────┴────────────────────────────────┴────────────────────────────────╯ + +(Showing first 3 rows) +``` + +## Extracting Structure and Content from Markdown + +The following example demonstrates how to use `daft.File` to extract the structure and content from a Markdown file. + +```python +import daft +from daft import DataType +import re + + +@daft.func( + return_dtype=DataType.struct( + { + "title": DataType.string(), + "content": DataType.string(), + "headings": DataType.list( + DataType.struct( + { + "level": DataType.int64(), + "text": DataType.string(), + "line": DataType.int64(), + } + ) + ), + "code_blocks": DataType.list( + DataType.struct( + { + "language": DataType.string(), + "code": DataType.string(), + } + ) + ), + "links": DataType.list( + DataType.struct( + { + "text": DataType.string(), + "url": DataType.string(), + } + ) + ), + } + ) +) +def extract_markdown(file: daft.File): + """Extract structure and content from a Markdown file.""" + with file.open() as f: + content = f.read().decode("utf-8") + + # Extract title (first h1 heading) + title_match = re.search(r"^#\s+(.+)$", content, re.MULTILINE) + title = title_match.group(1).strip() if title_match else None + + # Extract all headings with their levels and line numbers + headings = [] + for i, line in enumerate(content.split("\n"), start=1): + heading_match = re.match(r"^(#{1,6})\s+(.+)$", line) + if heading_match: + headings.append({ + "level": len(heading_match.group(1)), + "text": heading_match.group(2).strip(), + "line": i, + }) + + # Extract code blocks with language + code_blocks = [] + code_pattern = re.compile(r"```(\w*)\n(.*?)```", re.DOTALL) + for match in code_pattern.finditer(content): + code_blocks.append({ + "language": match.group(1) or "text", + "code": match.group(2).strip(), + }) + + # Extract links [text](url) + links = [] + link_pattern = re.compile(r"\[([^\]]+)\]\(([^)]+)\)") + for match in link_pattern.finditer(content): + links.append({ + "text": match.group(1), + "url": match.group(2), + }) + + return { + "title": title, + "content": content, + "headings": headings, + "code_blocks": code_blocks, + "links": links, + } + + +if __name__ == "__main__": + from daft import col + from daft.functions import file, unnest + + # Discover Markdown files + df = ( + daft.from_glob_path("~/git/Daft/**/*.md") + .with_column("file", file(col("path"))) + .with_column("markdown", extract_markdown(col("file"))) + .select(col("path"), unnest(col("markdown"))) + ) + + df.show(3) +``` + +``` {title="Output"} +╭─────────────────────────┬─────────────────────────┬────────────────────────┬────────────────────────┬────────────────────────┬────────────────────────╮ +│ path ┆ title ┆ content ┆ headings ┆ code_blocks ┆ links │ +│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ +│ String ┆ String ┆ String ┆ List[Struct[level: ┆ List[Struct[language: ┆ List[Struct[text: │ +│ ┆ ┆ ┆ Int64, text: String, ┆ String, code: String]] ┆ String, url: String]] │ +│ ┆ ┆ ┆ line: Int64]] ┆ ┆ │ +╞═════════════════════════╪═════════════════════════╪════════════════════════╪════════════════════════╪════════════════════════╪════════════════════════╡ +│ file:///Users/everettkl ┆ Contributor Covenant ┆ # Contributor Covenant ┆ [{level: 1, ┆ [] ┆ [{text: Mozilla's code │ +│ even/g… ┆ Code of … ┆ Code o… ┆ text: Contributor… ┆ ┆ of con… │ +├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ +│ file:///Users/everettkl ┆ Contributing to Daft ┆ # Contributing to Daft ┆ [{level: 1, ┆ [] ┆ [{text: Report it │ +│ even/g… ┆ ┆ ┆ text: Contrib… ┆ ┆ here, │ +│ ┆ ┆ Daft … ┆ ┆ ┆ url: … │ +├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ +│ file:///Users/everettkl ┆ Resources ┆ # Resources ┆ [{level: 1, ┆ [] ┆ [{text: Testing │ +│ even/g… ┆ ┆ ┆ text: Resources, ┆ ┆ Details, │ +│ ┆ ┆ - https://docs.d… ┆ … ┆ ┆ url:… │ +╰─────────────────────────┴─────────────────────────┴────────────────────────┴────────────────────────┴────────────────────────┴────────────────────────╯ + +(Showing first 3 rows) +``` diff --git a/docs/modalities/embeddings.md b/docs/modalities/embeddings.md new file mode 100644 index 0000000000..ce274edd60 --- /dev/null +++ b/docs/modalities/embeddings.md @@ -0,0 +1,158 @@ +# Working with Embeddings + +Embeddings transform text, images, and other data into dense vector representations that capture semantic meaning—enabling similarity search, retrieval-augmented generation (RAG), and AI-powered discovery. Daft makes it easy to generate, store, and query embeddings at scale. + +With the native [`daft.DataType.embedding`](../api/datatypes/embedding.md) type and [`embed_text`](../api/functions/embed_text.md) function, you can: + +- **Generate embeddings** from any text column using providers like OpenAI, Cohere, or local models +- **Compute similarity** with built-in distance functions like `cosine_distance` +- **Build search pipelines** that scale from local development to distributed clusters +- **Write to vector databases** like Turbopuffer, Pinecone, or LanceDB + +## Semantic Search Example + +The following example creates a simple semantic search pipeline—embedding documents, comparing them to a query, and ranking by similarity: + +```python +import daft +from daft.functions import embed_text, cosine_distance + +# Create a knowledge base with documents +documents = daft.from_pydict( + { + "text": [ + "Python is a high-level programming language", + "Machine learning models require training data", + "Daft is a distributed dataframe library", + "Embeddings capture semantic meaning of text", + ], + } +) + +# Embed all documents +documents = documents.with_column( + "embedding", + embed_text( + daft.col("text"), + provider="openai", + model="text-embedding-3-small", + ), +) + +# Create a query +query = daft.from_pydict({"query_text": ["What is Daft?"]}) + +# Embed the query +query = query.with_column( + "query_embedding", + embed_text( + daft.col("query_text"), + provider="openai", + model="text-embedding-3-small", + ), +) + +# Cross join to compare query against all documents +results = query.join(documents, how="cross") + +# Calculate cosine distance (lower is more similar) +results = results.with_column( + "distance", cosine_distance(daft.col("query_embedding"), daft.col("embedding")) +) + +# Sort by distance and show top results +results = results.sort("distance").select("query_text", "text", "distance", "embedding") +results.show() +``` + +```{title="Output"} +╭───────────────┬────────────────────────────────┬────────────────────┬──────────────────────────╮ +│ query_text ┆ text ┆ distance ┆ embedding │ +│ --- ┆ --- ┆ --- ┆ --- │ +│ String ┆ String ┆ Float64 ┆ Embedding[Float32; 1536] │ +╞═══════════════╪════════════════════════════════╪════════════════════╪══════════════════════════╡ +│ What is Daft? ┆ Daft is a distributed datafra… ┆ 0.3621492191359764 ┆ ▄▇▆▅▄▄█▆▄▄▃▂▄▃▃▃▁▄▃▃▄▄▃▂ │ +├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ +│ What is Daft? ┆ Python is a high-level progra… ┆ 0.9163975397319742 ┆ ▇▆▅▇▅▆█▇▃▄▆▄▄▁▅▄▅▃▁▃▃▂▅▃ │ +├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ +│ What is Daft? ┆ Embeddings capture semantic m… ┆ 0.9374004015203741 ┆ ▄█▅▄▅▅▅▇▄▃▂▁▃▄▄▁▃▃▂▂▂▂▁▃ │ +├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ +│ What is Daft? ┆ Machine learning models requi… ┆ 0.9696998373223874 ┆ ▇▇▆▃▄▆▅█▆▂▄▃▄▄▂▄▂▁▂▂▁▃▂▁ │ +╰───────────────┴────────────────────────────────┴────────────────────┴──────────────────────────╯ + +(Showing first 4 of 4 rows) +``` + +## Building a Document Search Pipeline + +For production use cases, you'll typically combine embeddings with LLM-powered metadata extraction and write the results to a vector database. + +This example shows an end-to-end pipeline that: + +1. Loads PDF documents from cloud storage +2. Extracts structured metadata using an LLM +3. Generates vector embeddings from the abstracts +4. Writes everything to Turbopuffer for semantic search + +```python +# /// script +# description = "This example shows how using LLMs and embedding models, Daft chunks documents, extracts metadata, generates vectors, and writes them to any vector database..." +# dependencies = ["daft[openai, turbopuffer]", "pymupdf"] +# /// +import os +import daft +from daft import col, lit +from daft.functions import embed_text, prompt, file, unnest, monotonically_increasing_id +from pydantic import BaseModel + +class Classifier(BaseModel): + title: str + author: str + year: int + keywords: list[str] + abstract: str + +daft.set_execution_config(enable_dynamic_batching=True) +daft.set_provider("openai", api_key=os.environ.get("OPENAI_API_KEY")) + +# Load documents and generate vector embeddings +df = ( + daft.from_glob_path("hf://datasets/Eventual-Inc/sample-files/papers/*.pdf").limit(10) + .with_column( + "metadata", + prompt( + messages=file(col("path")), + system_message="Read the paper and extract the classifier metadata.", + return_format=Classifier, + model="gpt-5-mini", + ) + ) + .with_column( + "abstract_embedding", + embed_text( + daft.col("metadata")["abstract"], + model="text-embedding-3-large" + ) + ) + .with_column("id", monotonically_increasing_id()) + .select("id", "path", unnest(col("metadata")), "abstract_embedding") +) + +# Write to Turbopuffer +df.write_turbopuffer( + namespace="ai_papers", + api_key=os.environ.get("TURBOPUFFER_API_KEY"), + distance_metric="cosine_distance", + region='us-west-2', + schema={ + "id": "int64", + "path": "string", + "title": "string", + "author": "string", + "year": "int", + "keywords": "list[string]", + "abstract": "string", + "abstract_embedding": "vector", + } +) +``` diff --git a/docs/modalities/files.md b/docs/modalities/files.md new file mode 100644 index 0000000000..776de2bb8f --- /dev/null +++ b/docs/modalities/files.md @@ -0,0 +1,252 @@ +# Working with Files and URLs in Daft + +Daft provides powerful capabilities for working with URLs, file paths, and remote storage systems. + +Whether you're loading data from local files, cloud storage, or the web, Daft's URL and file handling makes it seamless to work with distributed data sources. Daft supports working with: + +- **Local file paths**: `file:///path/to/file`, `/path/to/file` +- **S3**: `s3://bucket/path`, `s3a://bucket/path`, `s3n://bucket/path` +- **GCS**: `gs://bucket/path` +- **Azure**: `az://container/path`, `abfs://container/path`, `abfss://container/path` +- **HTTP/HTTPS URLs**: `http://example.com/path`, `https://example.com/path` +- **Hugging Face datasets**: `hf://dataset/name` +- **Unity Catalog volumes**: `vol+dbfs:/Volumes/unity/path` + +## Using file discovery with optimized distributed reads + +[`daft.from_glob_path`](../api/io/file_path.md) helps discover and size files, accepting wildcards and lists of paths. When paired with [`daft.functions.download`](../api/functions/download.md), the two functions enable optimized distributed reads of binary data from storage. This is ideal when your data will fit into memory or when you need the entire file content at once. + +=== "🐍 Python" + ``` python + df = daft.from_pydict({ + "urls": [ + "https://www.google.com", + "s3://daft-public-data/open-images/validation-images/0001eeaf4aed83f9.jpg", + ], + }) + df = df.with_column("data", df["urls"].download()) + df.collect() + ``` + +=== "⚙️ SQL" + ```python + df = daft.from_pydict({ + "urls": [ + "https://www.google.com", + "s3://daft-public-data/open-images/validation-images/0001eeaf4aed83f9.jpg", + ], + }) + df = daft.sql(""" + SELECT + urls, + url_download(urls) AS data + FROM df + """) + df.collect() + ``` + +``` {title="Output"} + +╭────────────────────────────────┬────────────────────────────────╮ +│ urls ┆ data │ +│ --- ┆ --- │ +│ Utf8 ┆ Binary │ +╞════════════════════════════════╪════════════════════════════════╡ +│ https://www.google.com ┆ b" str: + # Read just the first 12 bytes to identify file type + with file.open() as f: + header = f.read(12) + + # Common file signatures (magic numbers) + if header.startswith(b"\xff\xd8\xff"): + return "JPEG" + elif header.startswith(b"\x89PNG\r\n\x1a\n"): + return "PNG" + elif header.startswith(b"GIF87a") or header.startswith(b"GIF89a"): + return "GIF" + elif header.startswith(b" {ast.unparse(node.returns)}" + + results.append({ + "name": node.name, + "signature": signature, + "docstring": ast.get_docstring(node), + "start_line": node.lineno, + "end_line": node.end_lineno, + }) + + return results + + +if __name__ == "__main__": + from daft.functions import file, unnest + + # Discover Python files + df = ( + daft.from_glob_path("~/git/Daft/daft/functions/**/*.py") # Add your own path here + .with_column("file", file(daft.col("path"))) + .with_column("functions", extract_functions(daft.col("file"))) + .explode("functions") + .select(daft.col("path"), unnest(daft.col("functions"))) + ) + + df.show(3) # Show the first 3 rows of the dataframe +``` + +``` {title="Output"} +╭────────────────────────────────┬─────────────────────────────┬────────────────────────────────┬────────────────────────────────┬────────────╮ +│ path ┆ name ┆ signature ┆ docstring ┆ start_line │ +│ --- ┆ --- ┆ --- ┆ --- ┆ --- │ +│ String ┆ String ┆ String ┆ String ┆ Int64 │ +╞════════════════════════════════╪═════════════════════════════╪════════════════════════════════╪════════════════════════════════╪════════════╡ +│ file:///Users/myusername007/g… ┆ monotonically_increasing_id ┆ def monotonically_increasing_… ┆ Generates a column of monoton… ┆ 14 │ +├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤ +│ file:///Users/myusername007/g… ┆ eq_null_safe ┆ def eq_null_safe(left: Expre… ┆ Performs a null-safe equality… ┆ 52 │ +├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤ +│ file:///Users/myusername007/g… ┆ cast ┆ def cast(expr: Expression, dt… ┆ Casts an expression to the gi… ┆ 68 │ +╰────────────────────────────────┴─────────────────────────────┴────────────────────────────────┴────────────────────────────────┴────────────╯ + +(Showing first 3 rows) +``` diff --git a/docs/modalities/images.md b/docs/modalities/images.md index a80da73dd1..461b40073d 100644 --- a/docs/modalities/images.md +++ b/docs/modalities/images.md @@ -1,6 +1,5 @@ # Working with Images - Daft is built to work comfortably with images. This guide shows you how to accomplish common image processing tasks with Daft: - [Downloading and decoding images](#quickstart) @@ -409,6 +408,7 @@ Now you're ready to call this function on the `urls` column and store the output !!! note "Note" Execute in notebook to see properly rendered images. + ### Zero Shot Classification For zero shot classification, you can use our built in `classify_image` function to classify images diff --git a/docs/modalities/overview.md b/docs/modalities/overview.md index 0929b099ff..978a38a4dd 100644 --- a/docs/modalities/overview.md +++ b/docs/modalities/overview.md @@ -1,20 +1,46 @@ -# Daft is built to work with **any modality**. +# Modalities Overview -In modern AI systems, data isn’t just numbers in tables anymore - it shows up as text, images, audio, PDFs, embeddings, and beyond. Handling diverse modalities unlocks more value from more sources. And by using a single engine, you can process them all—seamlessly and efficiently—in a single pipeline. Easy to develop, and even easier to run at scale. +**Daft is designed to work with any modality.** -Some of Daft's supported modalities include: +Artificial Intelligence now natively understands text, images, audio, video, and documents, but legacy engines were never designed to feed these formats to large models. Daft closes that gap, giving you one distributed engine that processes any modality, respects memory limits, and keeps GPUs fed so you can build the pipeline once and scale it anywhere. -- **[Text](text.md)**: Summarize, Embed, -- **[Images](images.md)**: Work with visual data and image processing. -- **[Audio](audio.md)**: Transcribe audio files speech with ease -- **[Videos](videos.md)**: Working with videos. -- **[PDFs](../examples/document-processing.md)**: Extract text and image data from PDF documents. -- **[JSON and Nested Data](json.md)**: Parse, query, and manipulate semi-structured and hierarchical data. -- **[Paths, URLs, & Files](urls.md)**: Discover, download, and read files from URLs and paths from remote resources. -- **Embeddings** (User Guide Coming Soon): Generate vector representations for similarity search and machine learning. -- **Tensors and Sparse Tensors** (User Guide Coming Soon): Multi-dimensional numerical data for deep learning workflows. +
-## **[Custom Modalities](custom.md)** +- 🔠 [**Text**](text.md) -The most important modality might be one we haven’t explored yet. Daft makes it easy to define your own modality with [custom connectors](../connectors/custom.md) to read and write any kind of data, and use [custom Python code](../custom-code/index.md) to process it efficiently and reliably, even at scale. + Normalize, chunk, dedupe, prompt, and embed text data. + +- 🌄 [**Images**](images.md) + + Work with visual data and image processing. + +- 🔉 [**Audio**](audio.md) + + Read, extract metadata, resample audio files. + +- 🎥 [**Video**](videos.md) + + Working with video files and metadata. + +- 📄 [**Documents**](documents.md) + + Extract text and image data from PDF documents. + +- {} [**JSON and Nested Data**](json.md) + + Parse, query, and manipulate semi-structured and hierarchical data. + +- ⊹ [**Embeddings**](embeddings.md) + + Generate vector representations for RAG and AI search. + +- 📁 [**Generic Files and URLs**](files.md) + + Take advantage of Daft's built-in URL functions and `daft.File` types + +
+ +### **[Custom Modalities](custom.md)** + +The most important modality might be one we haven't explored yet. Daft makes it easy to define your own modality with [custom connectors](../connectors/custom.md) to read and write any kind of data, and use [user-defined functions](../custom-code/index.md) to process custom Python code efficiently and reliably at scale. diff --git a/docs/modalities/urls.md b/docs/modalities/urls.md deleted file mode 100644 index dd2b3d3f65..0000000000 --- a/docs/modalities/urls.md +++ /dev/null @@ -1,129 +0,0 @@ -# Working with URLs and Files - -Daft provides powerful capabilities for working with URLs, file paths, and remote resources. Whether you're loading data from local files, cloud storage, or web URLs, Daft's URL and file handling makes it seamless to work with distributed data sources. - -Daft supports working with: - -- **Local file paths**: `file:///path/to/file`, `/path/to/file` -- **S3**: `s3://bucket/path`, `s3a://bucket/path`, `s3n://bucket/path` -- **GCS**: `gs://bucket/path` -- **Azure**: `az://container/path`, `abfs://container/path`, `abfss://container/path` -- **HTTP/HTTPS URLs**: `http://example.com/path`, `https://example.com/path` -- **Hugging Face datasets**: `hf://dataset/name` -- **Unity Catalog volumes**: `vol+dbfs:/Volumes/unity/path` - -## Two Ways to Work with Files in Daft - -### 1. URL Functions - -URL functions are ideal when your data will fit into memory or when you need the entire file content at once. Daft provides methods for working with URL strings. For example, to download data from URLs: - -=== "🐍 Python" - ``` python - df = daft.from_pydict({ - "urls": [ - "https://www.google.com", - "s3://daft-public-data/open-images/validation-images/0001eeaf4aed83f9.jpg", - ], - }) - df = df.with_column("data", df["urls"].download()) - df.collect() - ``` - -=== "⚙️ SQL" - ```python - df = daft.from_pydict({ - "urls": [ - "https://www.google.com", - "s3://daft-public-data/open-images/validation-images/0001eeaf4aed83f9.jpg", - ], - }) - df = daft.sql(""" - SELECT - urls, - url_download(urls) AS data - FROM df - """) - df.collect() - ``` - -``` {title="Output"} - -╭────────────────────────────────┬────────────────────────────────╮ -│ urls ┆ data │ -│ --- ┆ --- │ -│ Utf8 ┆ Binary │ -╞════════════════════════════════╪════════════════════════════════╡ -│ https://www.google.com ┆ b" str: - # Read just the first 12 bytes to identify file type - with file.open() as f: - header = f.read(12) - - # Common file signatures (magic numbers) - if header.startswith(b"\xff\xd8\xff"): - return "JPEG" - elif header.startswith(b"\x89PNG\r\n\x1a\n"): - return "PNG" - elif header.startswith(b"GIF87a") or header.startswith(b"GIF89a"): - return "GIF" - elif header.startswith(b" +`daft.VideoFile` is a subclass of `daft.File` that provides a specialized interface for video-specific operations. -### Reading Video Frames +- [daft.read_video_frames](../api/functions/read_video_frames.md) for reading video frames into a DataFrame +- [daft.VideoFile](../api/datatypes/file_types.md) for working with video files + - [daft.functions.video_file](../api/functions/video_file.md) for working with video files + - [daft.functions.video_metadata](../api/functions/video_metadata.md) for working with video metadata + - [daft.functions.video_keyframes](../api/functions/video_keyframes.md) for working with video keyframes -This example shows reading a video's frames into a DataFrame. +### Reading Video Frames with `daft.read_video_frames` + +This example shows reading a video's frames into a DataFrame using the `daft.read_video_frames` function. === "🐍 Python" @@ -101,3 +109,21 @@ This example shows reading the key frames of a youtube video, you can also pass (Showing first 8 rows) ``` + +### Working with daft.VideoFile +The following example demonstrates how to use `daft.VideoFile` to read a video file and extract metadata. + +```python +import daft +from daft.functions import video_file, video_metadata, video_keyframes + +df = ( + daft.from_glob_path("hf://datasets/Eventual-Inc/sample-files/videos/*.mp4") + .with_column("file", video_file(daft.col("path"))) + .with_column("metadata", video_metadata(daft.col("file"))) + .with_column("keyframes", video_keyframes(daft.col("file"))) + .select("path", "file", "size", "metadata", "keyframes") +) + +df.show(3) +``` diff --git a/docs/use-case/batch-inference.md b/docs/use-case/batch-inference.md index 0926c61d6a..f23a6fae92 100644 --- a/docs/use-case/batch-inference.md +++ b/docs/use-case/batch-inference.md @@ -5,7 +5,7 @@ Run prompts, embeddings, and model scoring over large datasets, then stream the ## When to use Daft for batch inference - **You need to run models over your data:** Express inference on a column (e.g., [`prompt`](#example-prompt-gpt-5-with-openai), [`embed_text`](../ai-functions/embed.md#text-embeddings), [`embed_image`](../ai-functions/embed.md#image-embeddings)) and let Daft handle batching, concurrency, and backpressure. -- **You have data consisting of large objects in cloud storage:** Daft has [record-setting](https://www.daft.ai/blog/announcing-daft-02) performance when reading from and writing to S3, and provides flexible APIs for working with [URLs and Files](../modalities/urls.md). +- **You have data consisting of large objects in cloud storage:** Daft has [record-setting](https://www.daft.ai/blog/announcing-daft-02) performance when reading from and writing to S3, and provides flexible APIs for working with [URLs and Files](../modalities/files.md). - **You're working with multimodal data:** Daft supports datatypes like [images](../modalities/images.md) and [videos](../modalities/videos.md), and supports the ability to define [custom data sources and sinks](../connectors/custom.md) and [custom functions over this data](../custom-code/udfs.md). - **You want end-to-end pipelines where data sizes expand and shrink:** For example, downloading images from URLs, decoding them, then embedding them; [Daft streams across stages to keep memory well-behaved](https://www.daft.ai/blog/processing-300k-images-without-oom).