feat: external media references — store large media by typed payload URI (#115)#121
Merged
dcfocus merged 2 commits intoJun 28, 2026
Merged
Conversation
579aa69 to
6edd85b
Compare
…URI (lance-format#115) Add a typed external-reference payload so large media (images/audio/video) can live as objects in the configured object store and be referenced from a ContextRecord by `payload_uri` (plus optional `payload_size`/`payload_checksum`), instead of inlining bytes in `binary_payload`. `list`/`search` return the reference without materializing the bytes; an opt-in fetch resolves them on demand using the context's `storage_options`. - core: add `payload_uri`/`payload_size`/`payload_checksum` to `ContextRecord` and `RecordPatch`; new backward-compatible schema columns gated by `include_external_reference` (old datasets read them as `None`); `ContextStore::{fetch_payload,put_payload}` resolve/offload bytes through the context's object store (works for gs://, s3://, local). - api/server/client/python: thread the fields through the DTOs; add `GET /contexts/{name}/records/{id}/payload`, client `fetch_record_payload`, and Python `Context.fetch_payload`/`put_payload`. - tests: Rust roundtrip + missing-record/missing-reference cases, server route test, api serde round-trip, and a Python end-to-end suite. Also realign the shared add/upsert/update test doubles in test_search.py/test_embeddings.py with the current binding signature (they were already out of sync with tenant/source/run_id/created_at) and extend them for the new payload params. Signed-URL resolution is deferred (TODO left in `fetch_payload`). Closes lance-format#115 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`add_many` and single `upsert` forwarded `payload_uri`/`payload_size`/ `payload_checksum`, but the bulk `upsert_many` path dropped them from the normalized record dict, so external media references were silently lost on batch insert-or-replace. Include the three fields (mirroring `add_many`) and cover it with a test. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
6edd85b to
cbc7cd0
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a typed external-reference payload so large media (images/audio/video) can live as objects in the configured object store and be referenced from a
ContextRecordbypayload_uri(plus optionalpayload_size/payload_checksum), instead of inlining bytes inbinary_payload.list/searchreturn the reference without materializing the bytes.storage_options— works forgs://,s3://, and local paths through the same object-store path the dataset itself uses.binary_payloadstays the unchanged small-payload path.Closes #115
What changed
core (
lance-context-core)payload_uri/payload_size/payload_checksumonContextRecordandRecordPatch.include_external_referenceflag (mirrors the existingexternal_id/metadata/lifecyclegates) → backward compatible: old datasets read them asNone; writing a reference on a dataset created without the columns errors clearly.ContextStore::fetch_payload(id)resolvespayload_uri→ bytes, andContextStore::put_payload(uri, bytes)offloads caller bytes, both viaObjectStore::from_uri_and_paramsthreading the context'sstorage_options.api / server / client / python
AddRecordRequest/RecordDto/RecordPatchDto(allOption,skip_serializing_if→ serde back-compat).GET /api/v1/contexts/{name}/records/{id}/payload(200 octet-stream /content_type, 404 missing record, 400 record with no reference).fetch_record_payload; PythonContext.fetch_payload/put_payload.Tests
binary_payloadNone)→fetch_payloadroundtrip; missing-record (None) and missing-reference (error) cases;put_payloadroundtrip; server route test (bytes / 404 / 400); api serde round-trip + back-compat decode.put_payload→add(payload_uri=…)→list/get(no bytes) →fetch_payload; update-attaches-reference-later; missing-record/missing-reference.Notes for the reviewer
fetch_payload(and the server handler) use the list-backedget_by_idrather than the raw filteredget(id), because the latter does not see freshly-written MemWAL-buffered rows in the same process.add/upsert/updatedoubles intest_search.py/test_embeddings.pywere already out of sync with the current binding signature (missingtenant/source/run_id/created_at) and failing onmain. Since this PR extends those exact signatures, the doubles are brought back in sync and extended for the new payload params. (test_persistence.py::test_s3_*remain env-dependent and are unaffected.)TODOmarks where a presigned-URL branch would go. No transcoding/thumbnailing, no media embedding, inlinebinary_payloadunchanged.🤖 Generated with Claude Code