LanceDB Hugging Face Update Demo

This project demonstrates a simple staged workflow to manage your Lance datasets on Hugging Face Hub.

To create a repo like this, first create and upload an initial table using LanceDB on a local machine, and then upload it to the Hub via a CLI command.

As the dataset evolves, you can apply a one-time schema + data update, and upload the updated version of the data back to the Hub. Only the new data is uploaded, keeping things clean.

Setup

Use uv, and export both OPENAI_API_KEY and HF_TOKEN.

uv sync
export OPENAI_API_KEY=...
export HF_TOKEN=hf_...
hf auth login --token "$HF_TOKEN"

The scripts look for a local file named .env, so to run any of them, you'll need to copy the .env.example to a new file named .env and update the respective env variables there.

Raw data layout

Source JSON and generated portraits live under raw_data/:

raw_data/magical_kingdom.json
raw_data/img/
raw_data/generate_images.py

See raw_data/README.md for details on regenerating those assets.

Optional: Regenerate character portraits

If you want to regenerate the source images using an OpenAI model, run:

uv run python raw_data/generate_images.py

Step 1: Create the initial Lance table

Start clean, then build the Lance table locally. This creates the characters table, computes embeddings in batches, and creates an FTS index.

rm -rf magical_kingdom
uv run python create_dataset.py

Step 2: Upload the Initial Snapshot to the Hub

Upload the full magical_kingdom directory to datasets/lancedb/magical_kingdom.

hf upload-large-folder magical_kingdom magical_kingdom \
  --repo-type dataset \
  --revision main

hf upload-large-folder uses a resumable multi-commit flow, which is more flexible and error-tolerant than hf upload, but it does not support custom commit messages.

Step 3: Update the dataset locally

Imagine a scenario where you want to add a new category column and backfill its values with a single merge_insert operation into your existing table.

This is both as schema update and a data update, which Lance excels at: because Lance supports incremental data evolution: it can add, remove and alter columns without rewriting any data files in the existing dataset without touching existing data, making it very I/O-efficient when updating large tables.

uv run python update_dataset.py

Over time, you can run a compaction job that calls table.optimize() to manage the number of manifests that are recorded in the history.

Step 4: Upload the updated version to the Hub

Upload the same local directory again (now a new version of the dataset).

hf upload-large-folder lancedb/magical_kingdom magical_kingdom \
  --repo-type dataset \
  --revision main

Step 5: Inspect versions and query on the Hub

inspect_dataset.py reads from hf://datasets/lancedb/magical_kingdom and prints table versions.

uv run python inspect_dataset.py

If you run update_dataset.py again without resetting, it will fail at add_columns because the category column already exists. If you want to upsert the column's data, comment out the line that adds the category column.

query.py also reads from the Hub and runs all five example queries.

uv run python query.py

Example:

import lancedb

# Scan data directly from the Hugging Face Hub
# (No need to download the dataset locally)
db = lancedb.connect("hf://datasets/lancedb/magical_kingdom")
table = db.open_table("characters")

r = table.search() \
    .where("category = 'knight'") \
    .select(["name", "role", "stats.strength"]) \
    .limit(4) \
    .to_polars() \
    .sort("stats.strength", descending=True) \
    .head(1)
print(r)

The character belonging to the knight category with the greatest strength is Sir Lancelot! 🗡️

┌──────────────┬───────────────────────────┬────────────────┐
│ name         ┆ role                      ┆ stats.strength │
│ ---          ┆ ---                       ┆ ---            │
│ str          ┆ str                       ┆ i8             │
╞══════════════╪═══════════════════════════╪════════════════╡
│ Sir Lancelot ┆ Knight of the Round Table ┆ 5              │
└──────────────┴───────────────────────────┴────────────────┘

Update the Dataset Card

The Hub dataset card allows you to communicate the schema and usage of the dataset to other developers. It sits at the repo’s root in a file named README.md on the Hub. This project keeps the source card text in HF_DATASET_CARD.md, so you can publish updates to the dataset there and upload it as README.md using the following command on the HF CLI: this requires a regular hf upload because it is a single-file upload to a specific target path -- and a custom commit message can be added.

hf upload lancedb/magical_kingdom HF_DATASET_CARD.md README.md \
  --repo-type dataset \
  --commit-message "Update dataset card"

Optional: Reset the Hub Repo

If you want to reproduce the full demo from scratch on the Hub, delete the existing repo and recreate it:

hf repos delete lancedb/magical_kingdom --repo-type dataset
hf repos create lancedb/magical_kingdom --repo-type dataset

Then, work through the steps described above. Have fun uploading your Lance datasets on Hugging Face!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LanceDB Hugging Face Update Demo

Setup

Raw data layout

Optional: Regenerate character portraits

Step 1: Create the initial Lance table

Step 2: Upload the Initial Snapshot to the Hub

Step 3: Update the dataset locally

Step 4: Upload the updated version to the Hub

Step 5: Inspect versions and query on the Hub

Update the Dataset Card

Optional: Reset the Hub Repo

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
raw_data		raw_data
.env.example		.env.example
.gitignore		.gitignore
HF_DATASET_CARD.md		HF_DATASET_CARD.md
LICENSE		LICENSE
README.md		README.md
create_dataset.py		create_dataset.py
inspect_dataset.py		inspect_dataset.py
pyproject.toml		pyproject.toml
query.py		query.py
update_dataset.py		update_dataset.py
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

LanceDB Hugging Face Update Demo

Setup

Raw data layout

Optional: Regenerate character portraits

Step 1: Create the initial Lance table

Step 2: Upload the Initial Snapshot to the Hub

Step 3: Update the dataset locally

Step 4: Upload the updated version to the Hub

Step 5: Inspect versions and query on the Hub

Update the Dataset Card

Optional: Reset the Hub Repo

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages