Sifts – Simple Full Text & Semantic Search

🔎 Sifts is a simple but powerful Python package for managing and querying document collections with support for both SQLite and PostgreSQL databases.

It is designed to efficiently handle full-text search and vector search, making it ideal for applications that involve large-scale text data retrieval.

Features

Dual Database Support: Sifts works with both SQLite and PostgreSQL, offering the simplicity of SQLite for lightweight applications and the scalability of PostgreSQL for larger, production environments.
Full-Text Search (FTS): Perform advanced text search queries with full-text search support.
Vector Search: Integrate with embedding models to perform vector-based similarity searches, perfect for applications involving natural language processing.
Flexible Querying: Supports complex queries with filtering, ordering, and pagination.

Background

The main idea of Sifts is to leverage the built-in full-text search capabilities in SQLite and PostgreSQL and to make them available via a unified, Pythonic API. You can use SQLite for small projects or development and trivially switch to PostgreSQL to scale your application.

For vector search, cosine similarity is computed in PostgreSQL via the pgvector extension, while with SQLite similarity is calculated in memory.

Sifts does not come with a server mode as it's meant as a library to be imported by other apps. The original motivation for its development was to replace whoosh as search backend in Gramps Web, which is based on Flask.

Installation

You can install Sifts via pip:

pip install sifts

Usage

Full-text search

import sifts

# by default, creates a new SQLite database in the working directory
collection = sifts.Collection(name="my_collection")

# Add docs to the index. Can also update and delete.
collection.add(
    documents=["Lorem ipsum dolor", "sit amet"],
    metadatas=[{"foo": "bar"}, {"foo": "baz"}], # otpional, can filter on these
    ids=["doc1", "doc2"], # unique for each doc. Uses UUIDs if omitted
)

results = collection.query(
    "Lorem",
    # limit=2,  # optionally limit the number of results
    # where={"foo": "bar"},  # optional filter
    # order_by="foo",  # sort by metadata key (rather than rank)
)

The API is inspired by chroma.

Full-text search syntax

Sifts supports the following search syntax:

Search for individual words
Search for multiple words (will match documents where all words are present)
and operator
or operator
* wildcard (in SQLite, supported anywhere in the search term, in PostgreSQL only at the end of the search term)

The search syntax is the same regardless of backend.

Special Character Handling

Sifts automatically handles special characters that have meaning in SQLite FTS5 and PostgreSQL tsquery syntax, preventing syntax errors when searching for terms containing these characters.

Automatically Quoted Characters (SQLite)

The following characters are automatically quoted when found in search terms:

Parentheses () - used for grouping in FTS5
Brackets [] - used for column filters in FTS5
Curly braces {} - used for advanced FTS5 syntax
Colon : - used for column-specific searches
Comma , - used as token separator in FTS5
Quote " - used for phrase searches
Hyphen - - in hyphenated words like "test-word"
Apostrophe ' - in contractions like "it's"

Examples

# Search for city with comma - automatically quoted
collection.query("Bydgoszcz, Poland")
# Internal query: "Bydgoszcz," Poland

# Search for time with colon - automatically quoted
collection.query("time:12:00")
# Internal query: "time:12:00"

# Search with parentheses - automatically quoted
collection.query("test (example)")
# Internal query: test "(example)"

# Wildcards work with special characters
collection.query("city:*")
# Internal query: "city:"*

Manual Quoting

You can still use explicit quotes for exact phrase matching:

# Exact phrase match
collection.query('"exact phrase"')

To include a literal quote in your search, use FTS5's double-quote escape:

# Search for: test"value
collection.query('"test""value"')

PostgreSQL Differences

PostgreSQL tsquery handles special characters differently than SQLite FTS5:

SQLite: Quotes special characters to avoid FTS5 syntax errors
PostgreSQL: Replaces special characters with spaces to preserve term boundaries

Example:

# SQLite
collection.query("time:12:00")
# Internal: "time:12:00" (quoted to avoid syntax error)

# PostgreSQL
collection.query("time:12:00")
# Internal: time & 12 & 00 (split into separate words)

Both approaches work because full-text search tokenization typically strips punctuation during indexing, so documents containing these characters can still be found by searching for the words they contain.

Vector search (semantic search)

Sifts can also be used as vector store, used for semantic search engines or retrieval-augmented generation (RAG) with large language models (LLMs).

Simply pass the embedding_function to the Collection factory to enable vector storage and set vector_search=True in the query method. For instance, using the Sentence Transformers library,

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("intfloat/multilingual-e5-small")

def embedding_function(queries: list[str]):
    return model.encode(queries)

collection = sifts.Collection(
    db_url="sqlite:///vector_store.db",
    name="my_vector_store",
    embedding_function=embedding_function
)

# Adding vector data to the collection
collection.add(["This is a test sentence.", "Another example query."])

# Querying the collection with semantic search
results = collection.query("Find similar sentences.", vector_search=True)

PostgreSQL collections require installing and enabling the pgvector extension.

Updating and Deleting Documents

Documents can be updated or deleted using their IDs.

# Update a document
collection.update(ids=["document_id"], contents=["Updated content"])

# Delete a document
collection.delete(ids=["document_id"])

Contributing

Contributions are welcome! Feel free to create an issue if you encounter problems or have an improvement suggestion, and even better submit a PR along with it!

License

Sifts is licensed under the MIT License. See the LICENSE file for details.

Happy Sifting! 🚀

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
.github/workflows		.github/workflows
docker		docker
src/sifts		src/sifts
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
mypy.cfg		mypy.cfg
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sifts – Simple Full Text & Semantic Search

Features

Background

Installation

Usage

Full-text search

Full-text search syntax

Special Character Handling

Automatically Quoted Characters (SQLite)

Examples

Manual Quoting

PostgreSQL Differences

Vector search (semantic search)

Updating and Deleting Documents

Contributing

License

About

Uh oh!

Releases 13

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

DavidMStraub/sifts

Folders and files

Latest commit

History

Repository files navigation

Sifts – Simple Full Text & Semantic Search

Features

Background

Installation

Usage

Full-text search

Full-text search syntax

Special Character Handling

Automatically Quoted Characters (SQLite)

Examples

Manual Quoting

PostgreSQL Differences

Vector search (semantic search)

Updating and Deleting Documents

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 13

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages