Skip to content

Latest commit

 

History

History
329 lines (256 loc) · 9.63 KB

File metadata and controls

329 lines (256 loc) · 9.63 KB
title summary aliases
Hugging Face Embeddings
Learn how to use Hugging Face embedding models in TiDB Cloud.
/tidbcloud/vector-search-auto-embedding-huggingface/

Hugging Face Embeddings

This document describes how to use Hugging Face embedding models with Auto Embedding in TiDB Cloud to perform semantic searches with text queries.

Note:

Auto Embedding is only available on {{{ .starter }}} clusters hosted on AWS.

Available models

Hugging Face models are available for use with the huggingface/ prefix if you bring your own Hugging Face Inference API key (BYOK).

For your convenience, the following sections use several popular models as examples. For a full list of available models, see Hugging Face models. Note that not all models are available through Hugging Face Inference API or reliably work.

multilingual-e5-large

Example:

SET @@GLOBAL.TIDB_EXP_EMBED_HUGGINGFACE_API_KEY = 'your-huggingface-api-key-here';

CREATE TABLE sample (
  `id`        INT,
  `content`   TEXT,
  `embedding` VECTOR(1024) GENERATED ALWAYS AS (EMBED_TEXT(
                "huggingface/intfloat/multilingual-e5-large",
                `content`
              )) STORED
);


INSERT INTO sample
    (`id`, `content`)
VALUES
    (1, "Java: Object-oriented language for cross-platform development."),
    (2, "Java coffee: Bold Indonesian beans with low acidity."),
    (3, "Java island: Densely populated, home to Jakarta."),
    (4, "Java's syntax is used in Android apps."),
    (5, "Dark roast Java beans enhance espresso blends.");


SELECT `id`, `content` FROM sample
ORDER BY
  VEC_EMBED_COSINE_DISTANCE(
    embedding,
    "How to start learning Java programming?"
  )
LIMIT 2;

bge-m3

  • Name: huggingface/BAAI/bge-m3
  • Dimensions: 1024
  • Distance metric: Cosine, L2
  • Price: Charged by Hugging Face
  • Hosted by TiDB Cloud: ❌
  • Bring Your Own Key: ✅
  • Project home: https://huggingface.co/BAAI/bge-m3
SET @@GLOBAL.TIDB_EXP_EMBED_HUGGINGFACE_API_KEY = 'your-huggingface-api-key-here';

CREATE TABLE sample (
  `id`        INT,
  `content`   TEXT,
  `embedding` VECTOR(1024) GENERATED ALWAYS AS (EMBED_TEXT(
                "huggingface/BAAI/bge-m3",
                `content`
              )) STORED
);


INSERT INTO sample
    (`id`, `content`)
VALUES
    (1, "Java: Object-oriented language for cross-platform development."),
    (2, "Java coffee: Bold Indonesian beans with low acidity."),
    (3, "Java island: Densely populated, home to Jakarta."),
    (4, "Java's syntax is used in Android apps."),
    (5, "Dark roast Java beans enhance espresso blends.");


SELECT `id`, `content` FROM sample
ORDER BY
  VEC_EMBED_COSINE_DISTANCE(
    embedding,
    "How to start learning Java programming?"
  )
LIMIT 2;

all-MiniLM-L6-v2

Example:

SET @@GLOBAL.TIDB_EXP_EMBED_HUGGINGFACE_API_KEY = 'your-huggingface-api-key-here';

CREATE TABLE sample (
  `id`        INT,
  `content`   TEXT,
  `embedding` VECTOR(384) GENERATED ALWAYS AS (EMBED_TEXT(
                "huggingface/sentence-transformers/all-MiniLM-L6-v2",
                `content`
              )) STORED
);


INSERT INTO sample
    (`id`, `content`)
VALUES
    (1, "Java: Object-oriented language for cross-platform development."),
    (2, "Java coffee: Bold Indonesian beans with low acidity."),
    (3, "Java island: Densely populated, home to Jakarta."),
    (4, "Java's syntax is used in Android apps."),
    (5, "Dark roast Java beans enhance espresso blends.");


SELECT `id`, `content` FROM sample
ORDER BY
  VEC_EMBED_COSINE_DISTANCE(
    embedding,
    "How to start learning Java programming?"
  )
LIMIT 2;

all-mpnet-base-v2

SET @@GLOBAL.TIDB_EXP_EMBED_HUGGINGFACE_API_KEY = 'your-huggingface-api-key-here';

CREATE TABLE sample (
  `id`        INT,
  `content`   TEXT,
  `embedding` VECTOR(768) GENERATED ALWAYS AS (EMBED_TEXT(
                "huggingface/sentence-transformers/all-mpnet-base-v2",
                `content`
              )) STORED
);


INSERT INTO sample
    (`id`, `content`)
VALUES
    (1, "Java: Object-oriented language for cross-platform development."),
    (2, "Java coffee: Bold Indonesian beans with low acidity."),
    (3, "Java island: Densely populated, home to Jakarta."),
    (4, "Java's syntax is used in Android apps."),
    (5, "Dark roast Java beans enhance espresso blends.");


SELECT `id`, `content` FROM sample
ORDER BY
  VEC_EMBED_COSINE_DISTANCE(
    embedding,
    "How to start learning Java programming?"
  )
LIMIT 2;

Qwen3-Embedding-0.6B

Note:

Hugging Face Inference API might be unstable for this model.

  • Name: huggingface/Qwen/Qwen3-Embedding-0.6B
  • Dimensions: 1024
  • Distance metric: Cosine, L2
  • Maximum input text tokens: 512
  • Price: Charged by Hugging Face
  • Hosted by TiDB Cloud: ❌
  • Bring Your Own Key: ✅
  • Project home: https://huggingface.co/Qwen/Qwen3-Embedding-0.6B
SET @@GLOBAL.TIDB_EXP_EMBED_HUGGINGFACE_API_KEY = 'your-huggingface-api-key-here';

CREATE TABLE sample (
  `id`        INT,
  `content`   TEXT,
  `embedding` VECTOR(1024) GENERATED ALWAYS AS (EMBED_TEXT(
                "huggingface/Qwen/Qwen3-Embedding-0.6B",
                `content`
              )) STORED
);


INSERT INTO sample
    (`id`, `content`)
VALUES
    (1, "Java: Object-oriented language for cross-platform development."),
    (2, "Java coffee: Bold Indonesian beans with low acidity."),
    (3, "Java island: Densely populated, home to Jakarta."),
    (4, "Java's syntax is used in Android apps."),
    (5, "Dark roast Java beans enhance espresso blends.");


SELECT `id`, `content` FROM sample
ORDER BY
  VEC_EMBED_COSINE_DISTANCE(
    embedding,
    "How to start learning Java programming?"
  )
LIMIT 2;

Python usage example

This example shows how to create a vector table, insert documents, and run similarity search using Hugging Face embedding models.

Step 1: Connect to the database

from pytidb import TiDBClient

tidb_client = TiDBClient.connect(
    host="{gateway-region}.prod.aws.tidbcloud.com",
    port=4000,
    username="{prefix}.root",
    password="{password}",
    database="{database}",
    ensure_db=True,
)

Step 2: Configure the API key

If you're using a private model or need higher rate limits, you can configure your Hugging Face API token. You can create your token from the Hugging Face Token Settings page:

Configure the API token for Hugging Face models using the TiDB Client:

tidb_client.configure_embedding_provider(
    provider="huggingface",
    api_key="{your-huggingface-token}",
)

Step 3: Create a vector table

Create a table with a vector field that uses a Hugging Face model to generate embeddings:

from pytidb.schema import TableModel, Field
from pytidb.embeddings import EmbeddingFunction
from pytidb.datatype import TEXT

class Document(TableModel):
    __tablename__ = "sample_documents"
    id: int = Field(primary_key=True)
    content: str = Field(sa_type=TEXT)
    embedding: list[float] = EmbeddingFunction(
        model_name="huggingface/sentence-transformers/all-MiniLM-L6-v2"
    ).VectorField(source_field="content")

table = tidb_client.create_table(schema=Document, if_exists="overwrite")

Tip:

The vector dimensions depend on the model you choose. For example, huggingface/sentence-transformers/all-MiniLM-L6-v2 produces 384-dimensional vectors, while huggingface/sentence-transformers/all-mpnet-base-v2 produces 768-dimensional vectors.

Step 4: Insert data into the table

Use the table.insert() or table.bulk_insert() API to add data:

documents = [
    Document(id=1, content="Machine learning algorithms can identify patterns in data."),
    Document(id=2, content="Deep learning uses neural networks with multiple layers."),
    Document(id=3, content="Natural language processing helps computers understand text."),
    Document(id=4, content="Computer vision enables machines to interpret images."),
    Document(id=5, content="Reinforcement learning learns through trial and error."),
]
table.bulk_insert(documents)

Step 5: Search for similar documents

Use the table.search() API to perform vector search:

results = table.search("How do neural networks work?") \
    .limit(3) \
    .to_list()

for doc in results:
    print(f"ID: {doc.id}, Content: {doc.content}")

See also