Skip to content

feat: external media references — store large media by typed payload URI (#115)#121

Merged
dcfocus merged 2 commits into
lance-format:mainfrom
dcfocus:feat/issue-115-external-media-refs
Jun 28, 2026
Merged

feat: external media references — store large media by typed payload URI (#115)#121
dcfocus merged 2 commits into
lance-format:mainfrom
dcfocus:feat/issue-115-external-media-refs

Conversation

@dcfocus

@dcfocus dcfocus commented Jun 28, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds a typed external-reference payload so large media (images/audio/video) can live as objects in the configured object store and be referenced from a ContextRecord by payload_uri (plus optional payload_size / payload_checksum), instead of inlining bytes in binary_payload.

  • list / search return the reference without materializing the bytes.
  • An opt-in fetch resolves the bytes on demand using the context's storage_options — works for gs://, s3://, and local paths through the same object-store path the dataset itself uses.
  • Inline binary_payload stays the unchanged small-payload path.

Closes #115

What changed

core (lance-context-core)

  • payload_uri / payload_size / payload_checksum on ContextRecord and RecordPatch.
  • New schema columns gated by an include_external_reference flag (mirrors the existing external_id / metadata / lifecycle gates) → backward compatible: old datasets read them as None; writing a reference on a dataset created without the columns errors clearly.
  • ContextStore::fetch_payload(id) resolves payload_uri → bytes, and ContextStore::put_payload(uri, bytes) offloads caller bytes, both via ObjectStore::from_uri_and_params threading the context's storage_options.

api / server / client / python

  • Fields threaded through AddRecordRequest / RecordDto / RecordPatchDto (all Option, skip_serializing_if → serde back-compat).
  • GET /api/v1/contexts/{name}/records/{id}/payload (200 octet-stream / content_type, 404 missing record, 400 record with no reference).
  • Rust client fetch_record_payload; Python Context.fetch_payload / put_payload.

Tests

  • Rust: add→list (ref present, binary_payload None)→fetch_payload roundtrip; missing-record (None) and missing-reference (error) cases; put_payload roundtrip; server route test (bytes / 404 / 400); api serde round-trip + back-compat decode.
  • Python: end-to-end put_payloadadd(payload_uri=…)list/get (no bytes) → fetch_payload; update-attaches-reference-later; missing-record/missing-reference.

Notes for the reviewer

  • MemWAL visibility: fetch_payload (and the server handler) use the list-backed get_by_id rather than the raw filtered get(id), because the latter does not see freshly-written MemWAL-buffered rows in the same process.
  • Pre-existing test doubles realigned: the shared add/upsert/update doubles in test_search.py / test_embeddings.py were already out of sync with the current binding signature (missing tenant/source/run_id/created_at) and failing on main. Since this PR extends those exact signatures, the doubles are brought back in sync and extended for the new payload params. (test_persistence.py::test_s3_* remain env-dependent and are unaffected.)
  • Out of scope: signed-URL resolution is deferred — a TODO marks where a presigned-URL branch would go. No transcoding/thumbnailing, no media embedding, inline binary_payload unchanged.

🤖 Generated with Claude Code

@dcfocus dcfocus force-pushed the feat/issue-115-external-media-refs branch from 579aa69 to 6edd85b Compare June 28, 2026 02:15
dcfocus and others added 2 commits June 28, 2026 09:54
…URI (lance-format#115)

Add a typed external-reference payload so large media (images/audio/video) can
live as objects in the configured object store and be referenced from a
ContextRecord by `payload_uri` (plus optional `payload_size`/`payload_checksum`),
instead of inlining bytes in `binary_payload`. `list`/`search` return the
reference without materializing the bytes; an opt-in fetch resolves them on
demand using the context's `storage_options`.

- core: add `payload_uri`/`payload_size`/`payload_checksum` to `ContextRecord`
  and `RecordPatch`; new backward-compatible schema columns gated by
  `include_external_reference` (old datasets read them as `None`);
  `ContextStore::{fetch_payload,put_payload}` resolve/offload bytes through the
  context's object store (works for gs://, s3://, local).
- api/server/client/python: thread the fields through the DTOs; add
  `GET /contexts/{name}/records/{id}/payload`, client `fetch_record_payload`, and
  Python `Context.fetch_payload`/`put_payload`.
- tests: Rust roundtrip + missing-record/missing-reference cases, server route
  test, api serde round-trip, and a Python end-to-end suite. Also realign the
  shared add/upsert/update test doubles in test_search.py/test_embeddings.py with
  the current binding signature (they were already out of sync with
  tenant/source/run_id/created_at) and extend them for the new payload params.

Signed-URL resolution is deferred (TODO left in `fetch_payload`).

Closes lance-format#115

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`add_many` and single `upsert` forwarded `payload_uri`/`payload_size`/
`payload_checksum`, but the bulk `upsert_many` path dropped them from the
normalized record dict, so external media references were silently lost on
batch insert-or-replace. Include the three fields (mirroring `add_many`) and
cover it with a test.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@dcfocus dcfocus force-pushed the feat/issue-115-external-media-refs branch from 6edd85b to cbc7cd0 Compare June 28, 2026 17:04
@dcfocus dcfocus merged commit 65b5ba1 into lance-format:main Jun 28, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Multi-modal: external media references — store large media in object storage (GCS/S3) by typed URI

1 participant