This project demonstrates a simple staged workflow to manage your Lance datasets on Hugging Face Hub.
To create a repo like this, first create and upload an initial table using LanceDB on a local machine, and then upload it to the Hub via a CLI command.
As the dataset evolves, you can apply a one-time schema + data update, and upload the updated version of the data back to the Hub. Only the new data is uploaded, keeping things clean.
Use uv, and export both OPENAI_API_KEY and HF_TOKEN.
uv sync
export OPENAI_API_KEY=...
export HF_TOKEN=hf_...
hf auth login --token "$HF_TOKEN"The scripts look for a local file named .env, so to run any of them, you'll need to copy the .env.example to a new file named .env and update the respective env variables there.
Source JSON and generated portraits live under raw_data/:
raw_data/magical_kingdom.jsonraw_data/img/raw_data/generate_images.py
See raw_data/README.md for details on regenerating those assets.
If you want to regenerate the source images using an OpenAI model, run:
uv run python raw_data/generate_images.pyStart clean, then build the Lance table locally.
This creates the characters table, computes embeddings in batches, and creates an FTS index.
rm -rf magical_kingdom
uv run python create_dataset.pyUpload the full magical_kingdom directory to datasets/lancedb/magical_kingdom.
hf upload-large-folder magical_kingdom magical_kingdom \
--repo-type dataset \
--revision mainhf upload-large-folder uses a resumable multi-commit flow, which is more flexible and error-tolerant than hf upload, but it does not support custom commit messages.
Imagine a scenario where you want to add a new category column and backfill its values with a single merge_insert operation into your existing table.
This is both as schema update and a data update, which Lance excels at: because Lance supports incremental data evolution: it can add, remove and alter columns without rewriting any data files in the existing dataset without touching existing data, making it very I/O-efficient when updating large tables.
uv run python update_dataset.pyOver time, you can run a compaction job that calls table.optimize() to manage the number of manifests that are recorded in the history.
Upload the same local directory again (now a new version of the dataset).
hf upload-large-folder lancedb/magical_kingdom magical_kingdom \
--repo-type dataset \
--revision maininspect_dataset.py reads from hf://datasets/lancedb/magical_kingdom and prints table versions.
uv run python inspect_dataset.pyIf you run update_dataset.py again without resetting, it will fail at add_columns
because the category column already exists. If you want to upsert the column's data,
comment out the line that adds the category column.
query.py also reads from the Hub and runs all five example queries.
uv run python query.pyExample:
import lancedb
# Scan data directly from the Hugging Face Hub
# (No need to download the dataset locally)
db = lancedb.connect("hf://datasets/lancedb/magical_kingdom")
table = db.open_table("characters")
r = table.search() \
.where("category = 'knight'") \
.select(["name", "role", "stats.strength"]) \
.limit(4) \
.to_polars() \
.sort("stats.strength", descending=True) \
.head(1)
print(r)The character belonging to the knight category with the greatest strength is Sir Lancelot! 🗡️
┌──────────────┬───────────────────────────┬────────────────┐
│ name ┆ role ┆ stats.strength │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i8 │
╞══════════════╪═══════════════════════════╪════════════════╡
│ Sir Lancelot ┆ Knight of the Round Table ┆ 5 │
└──────────────┴───────────────────────────┴────────────────┘
The Hub dataset card allows you to communicate the schema and usage of the dataset to other developers.
It sits at the repo’s root in a file named README.md on the Hub.
This project keeps the source card text in HF_DATASET_CARD.md, so you can publish updates
to the dataset there and upload it as README.md using the following command on the HF CLI:
this requires a regular hf upload because it is a single-file upload to a specific target path -- and a custom commit message can be added.
hf upload lancedb/magical_kingdom HF_DATASET_CARD.md README.md \
--repo-type dataset \
--commit-message "Update dataset card"If you want to reproduce the full demo from scratch on the Hub, delete the existing repo and recreate it:
hf repos delete lancedb/magical_kingdom --repo-type dataset
hf repos create lancedb/magical_kingdom --repo-type datasetThen, work through the steps described above. Have fun uploading your Lance datasets on Hugging Face!