| license | configs | task_categories | language | tags | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
cc-by-4.0 |
|
|
|
|
This guide shows how to run LanceDB queries directly against the dataset hosted on the Hub. You can run simple filtered search queries, or vector search/FTS queries on an index, because a Lance dataset on the Hugging Face packages not only the data and multimodal assets, but also the indexes in one place.
Lance datasets are also versioned, so you can time-travel to earlier versions of the dataset as needed.
The characters table uses this Lance schema:
id: int32 not null
image: binary not null
name: string not null
role: string not null
description: string not null
stats: struct<strength: int8, courage: int8, magic: int8, wisdom: int8> not null
image_path: string not null
image_vector: fixed_size_list<float>[512]
text_for_embedding: string not null
text_vector: fixed_size_list<float>[1536]
category: string
Before running the examples below, export a Hugging Face token:
export HF_TOKEN=hf_...For text query embeddings, also export:
export OPENAI_API_KEY=...For this small dataset, rate limits are usually not an issue, but using HF_TOKEN is still recommended.
import lancedb
from lancedb.embeddings import get_registry
db = lancedb.connect("hf://datasets/lancedb/magical_kingdom")
table = db.open_table("characters")
registry = get_registry()
text_embedding = registry.get("openai").create(name="text-embedding-3-small")
image_embedding = registry.get("open-clip").create(
name="ViT-B-32",
pretrained="laion2b_s34b_b79k",
)Run simple queries with filters followed by sorting as follows:
(
table.search()
.where("category = 'knight'")
.select(["name", "role", "stats.strength"])
.limit(4)
.to_polars()
.sort("stats.strength", descending=True)
.head(1)
)The character belonging to the knight category with the greatest stats.strength
is Sir Lancelot!
┌──────────────┬───────────────────────────┬────────────────┐
│ name ┆ role ┆ stats.strength │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i8 │
╞══════════════╪═══════════════════════════╪════════════════╡
│ Sir Lancelot ┆ Knight of the Round Table ┆ 5 │
└──────────────┴───────────────────────────┴────────────────┘
The stats column is a nested field (i.e., a struct in PyArrow). LanceDB can directly
query the struct's values with minimum IOPS.
(
table.search()
.where("stats.magic >= 4")
.select(["name", "role", "stats.magic", "stats.strength"])
.limit(2)
.to_polars()
)The two characters with the most magical power are the Merlin 🧙 and Morgan Le Fay 🧙🏼.
┌───────────────┬────────────────────┬─────────────┬────────────────┐
│ name ┆ role ┆ stats.magic ┆ stats.strength │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i8 ┆ i8 │
╞═══════════════╪════════════════════╪═════════════╪════════════════╡
│ Merlin ┆ Wizard and Advisor ┆ 5 ┆ 2 │
│ Morgan le Fay ┆ Sorceress ┆ 5 ┆ 2 │
└───────────────┴────────────────────┴─────────────┴────────────────┘
You can directly query the vector index on the Hub using LanceDB without needing to download the data locally. The example below shows how to query the text emnbedding for "a brave knight with magical prowess".
query = "a disloyal knight"
query_vector = embed_texts([query])[0]
(
table.search(query_vector, vector_column_name="text_vector")
.select(["name", "role"])
.limit(1)
.to_polars()
)Mordred is the "treacherous knight who ultimately rebels against Arthur, leading to Camelot's downfall."
┌─────────┬────────────────┬───────────┐
│ name ┆ role ┆ _distance │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f32 │
╞═════════╪════════════════╪═══════════╡
│ Mordred ┆ Traitor Knight ┆ 0.865597 │
└─────────┴────────────────┴───────────┘
The dataset also comes with pre-computed image embeddings of the characters, and the image bytest are stored in this dataset alongside the embeddings themselves.
The query below shows that we want a magical character who carries a staff and has a long beard (ruling out Morgan Le Fay, a female mage).
query = "a powerful mage with a staff and a long beard"
query_vector = image_embedding.compute_query_embeddings(query)[0]
(
table.search(query_vector, vector_column_name="image_vector")
.select(["name", "role"])
.limit(2)
.to_polars()
)The most similar result based on the image embedding is, indeed, Merlin!
┌────────────┬───────────────────────────┬───────────┐
│ name ┆ role ┆ _distance │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f32 │
╞════════════╪═══════════════════════════╪═══════════╡
│ Merlin ┆ Wizard and Advisor ┆ 1.432212 │
│ Sir Gawain ┆ Knight of the Round Table ┆ 1.559731 │
└────────────┴───────────────────────────┴───────────┘
Run full-text search by keywords on the dataset just as easily as you run a vector search query.
The query below shows the result when searching for the keyqword
keyword = "treacherous"
(
table.search(keyword, query_type="fts")
.select(["name", "role", "description"])
.limit(5)
.to_polars()
)We get the Lady of the Lake as the result, who is a "mysterious supernatural figure associated with Avalon, known for giving Arthur the sword Excalibur."
┌──────────────────────┬───────────────────┬─────────────────────────────────┬─────────┐
│ name ┆ role ┆ description ┆ _score │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ f32 │
╞══════════════════════╪═══════════════════╪═════════════════════════════════╪═════════╡
│ The Lady of the Lake ┆ Mystical Guardian ┆ A mysterious supernatural figu… ┆ 2.17613 │
└──────────────────────┴───────────────────┴─────────────────────────────────┴─────────┘
Use this template to distribute and share other interesting multimodal Lance datasets to the Hub, and query them using LanceDB!
The source code to reproduce the data and the steps to upload the datasets to the Hub are in this repo.