Version = {PENDING}
This specification outlines a defined way of storing data in CrateDB for hybrid-search (vector + bm25).
By following the specification, the user gets pre-made full-text search, vector-search and hybrid-search queries to work out of the box, as well as to be able to reuse components, libraries and tooling based on the specification.
- Implementor: The developer(s) that uses this specification to create a hybrid search application with CrateDB.
- Specification: This document.
- vector-search: Search based on vectors, using the
KNN_MATCHstatement. - full-text search: Search based on bm25, using
MATCHstatement. - hybrid-search: vector-search + full-text search with re-ranking.
Data has to be conformed in a table that looks like this:
| column_name | type | nullable | note |
|---|---|---|---|
| id | |||
| hierarchy | TEXT | True | optional |
| title_fs | TEXT | False | has full-text index |
| content_fs | TEXT | False | has full-text index |
| content_pretty | TEXT | True | optional |
| ref | TEXT | True | optional |
| xs | FLOAT_VECTOR(SIZE) | False | max size is 2048 |
| metadata | OBJECT | False | optional, empty by default |
The DDL statement for the table is:
CREATE TABLE IF NOT EXISTS "doc"."search"
(
"id" INTEGER
"hierarchy" TEXT,
"title_fs" TEXT INDEX using fulltext NOT NULL,
"content_fs" TEXT INDEX using fulltext NOT NULL,
"content_pretty" TEXT,
"ref" TEXT,
"xs" FLOAT_VECTOR(2048) NOT NULL,
"metadata" OBJECT
)The position of the content within the context of all the data. It will depend on the data.
Examples:
- A file 'myfile.parquet', hierarchy=
/data/production/products/sample/myfile.parquet - A html section 'PHONE 16GB RAM - 400€', hierarchy=
/shop/offers/phones/phone-400 - A district 'Leopoldstadt', hierarchy=
Europe-Austria-Vienna
It is composed of elements joined by a character, it is recommended to use either / or -, but it
is left to the implementor, as the end client can take care of any character by splitting the string.
The title of the content. It has a full-text index, the type of full-text index will depend on the implementor use case.
The full-text search part of the hybrid search might be done in two different fields (title_fs and content_fs) to give more weight to title matches if it makes sense for the implementor use case.
Examples:
- A file 'summer_products_2025.parquet', title=
Summer Products 2025 - A html section 'PHONE 16GB RAM - 400€', title=
PHONE 16GB RAM - 400€ - A district 'Leopoldstadt', title=
District of LeopoldstadtorLeopoldstadt
The main content. It has a full-text index. It is the main data point to perform
full-text search against, the full-text part of the hybrid-search might be done
with this table and another one, like title_fs.
The main content prettified for the end application. If the user is never meant to see the content, it must be null.
Examples:
- Code that is formatted, where in
content_fsis not formatted properly. - HTML, where in
content_fsis pure text, without html tags. - A link to the image of a movie search.
The reference to point the content within the search page, typically an html ref id.
The vector representation of the content, using a text embedding.
An object containing all needed metadata attributes or attributes that exist in the dataset that do not fit the columns.
Example: In a movie dataset: Genres, Director, Actors, publishing date...
If more columns are needed, first consider using the metadata object if it is not enough, e.g.
new column with a full-text index is needed, consider opening an issue to improve the specification;
perhaps the use case was not taken into account. If the use case is too unique to be generalized,
it is ok to extend the specification.
The specification supports three different queries: Keyword search, Semantic Search, and Hybrid Search.
See https://cratedb.com/blog/hybrid-search-explained for more indepth explanation on Hybrid Search queries.
The query for doing hybrid search over this specification uses RRF as Re-ranking method.
where:
- XS
- SEARCH_TERM: the search input by the user
WITH bm25 as (
SELECT
_score,
RANK() OVER (
ORDER BY
_score DESC
) as rank,
title
FROM
fs_search
WHERE
MATCH("content", 'knn search')
ORDER BY
_score DESC
),
vector as (
SELECT
_score,
RANK() OVER (
ORDER BY
_score DESC
) as rank,
title
FROM
fs_search
WHERE
KNN_MATCH(
xs,
[vector...],
15
)
)
SELECT
TRUNC((1.0 / (bm25.rank + 60)) + (1.0 / (vector.rank + 60)), 6) as final_rank,
bm25.rank as bm25_rank,
vector.rank as vector_rank,
bm25.title
FROM
bm25,
vector
WHERE
bm25.title = vector.title
ORDER BY final_rank DESCNOTES: Make hierarchy array(text)?