|
| 1 | +--- |
| 2 | +title: Built-in LLM and Embedding UDFs |
| 3 | +sidebarTitle: Overview |
| 4 | +icon: sparkles |
| 5 | +--- |
| 6 | + |
| 7 | +Geneva ships pre-built UDFs for common LLM providers so you don't have to write custom classes |
| 8 | +for everyday embedding and generation tasks. |
| 9 | + |
| 10 | +| Provider | Embeddings | Generation | Runs locally | Install extra | |
| 11 | +|----------|:----------:|:----------:|:------------:|---------------| |
| 12 | +| [OpenAI](/geneva/udfs/providers/openai) | ✓ | ✓ | — | `geneva[udf-text-openai]` | |
| 13 | +| [Gemini](/geneva/udfs/providers/gemini) | ✓ | ✓ | — | `geneva[udf-text-gemini]` | |
| 14 | +| [Sentence Transformers](/geneva/udfs/providers/sentence-transformers) | ✓ | — | ✓ | `geneva[udf-text-sentence-transformers]` | |
| 15 | + |
| 16 | +OpenAI and Gemini UDFs make remote API calls that incur per-token costs. |
| 17 | +Sentence Transformers run locally on your workers with no API costs — see |
| 18 | +[GPU acceleration](/geneva/udfs/providers/sentence-transformers#gpu-acceleration) |
| 19 | +for performance tips. |
| 20 | + |
| 21 | +## Comparing models and prompts |
| 22 | + |
| 23 | +Because `add_columns` accepts a dictionary, you can evaluate multiple models, parameter |
| 24 | +settings, or prompts in a single pass over your data. Each entry produces its own column, |
| 25 | +so results sit side by side in the same table for easy comparison. |
| 26 | + |
| 27 | +```python |
| 28 | +from geneva.udfs import openai_udf, gemini_udf, openai_embedding_udf |
| 29 | + |
| 30 | +table.add_columns({ |
| 31 | + # Compare two embedding models |
| 32 | + "emb_small": openai_embedding_udf(column="body", model="text-embedding-3-small"), |
| 33 | + "emb_large": openai_embedding_udf(column="body", model="text-embedding-3-large"), |
| 34 | + |
| 35 | + # Compare the same task across providers |
| 36 | + "summary_openai": openai_udf( |
| 37 | + column="body", |
| 38 | + prompt="Summarize in one sentence", |
| 39 | + model="gpt-5-mini", |
| 40 | + ), |
| 41 | + "summary_gemini": gemini_udf( |
| 42 | + column="body", |
| 43 | + prompt="Summarize in one sentence", |
| 44 | + model="gemini-2.5-flash", |
| 45 | + ), |
| 46 | +}) |
| 47 | +``` |
| 48 | + |
| 49 | +This works for any combination — different models from the same provider, different providers, |
| 50 | +different prompts with the same model, or different dimensionality settings. All columns are |
| 51 | +computed in parallel during the same backfill job. |
| 52 | + |
| 53 | +To recompute columns later (e.g., after altering a UDF or adding new rows), use `backfill`: |
| 54 | + |
| 55 | +```python |
| 56 | +# Backfill a single column |
| 57 | +table.backfill("emb_small") |
| 58 | + |
| 59 | +# Backfill only rows missing a value |
| 60 | +table.backfill("summary_openai", where="summary_openai is null") |
| 61 | +``` |
| 62 | + |
| 63 | +## What's included |
| 64 | + |
| 65 | +All built-in UDFs share these capabilities: |
| 66 | + |
| 67 | +- **API key handling** — Keys are captured from your local environment at UDF creation time and serialized with the UDF. No cluster-level environment configuration required. |
| 68 | +- **Retry with backoff** — Transient API errors (rate limits, timeouts, server errors) are automatically retried with exponential backoff. |
| 69 | +- **Batch processing** — Embedding UDFs batch multiple rows per API call for better throughput. |
| 70 | +- **L2 normalization** — Embedding UDFs support optional L2 normalization via the `normalize` parameter (disabled by default since both providers return pre-normalized vectors). |
| 71 | + |
| 72 | +## See also |
| 73 | + |
| 74 | +- [Working with UDFs](/geneva/udfs/index) — Write custom scalar, batched, and stateful UDFs |
| 75 | +- [Error handling](/geneva/udfs/error_handling) — Fine-grained retry and skip policies |
| 76 | +- [Working with blobs](/geneva/udfs/blobs) — Process binary data (images, audio, video) |
0 commit comments