Skip to content

[Feature]: RFC: Biological embeddings support #597

@nleroy917

Description

@nleroy917

What feature would you like to request?

RFC: Biological Sequence Embeddings for Fastembed

Abstract

This RFC proposes adding support for biological sequence embeddings (DNA, RNA, protein, small molecules) to fastembed. The implementation would follow existing architectural patterns, require no new core dependencies, and provide a consistent API for embedding biological data using ONNX-converted foundation models.

Motivation

Biological foundation models have emerged as powerful tools for computational biology:

  • DNA/RNA models (DNABERT-2, Nucleotide Transformer) enable genome analysis, variant effect prediction, and regulatory element identification
  • Protein models (ESM-2, ProtTrans) power structure prediction, function annotation, and drug discovery
  • Molecular models (ChemBERTa) support drug screening and molecular property prediction

Currently, using these models requires:

  1. Installing PyTorch/TensorFlow
  2. Managing model-specific dependencies
  3. Writing custom inference code

Fastembed could provide the same lightweight, ONNX-based experience it offers for text/image embeddings, making biological AI accessible to a broader audience.

Use Cases

  1. Bioinformatics pipelines: Embed DNA sequences for similarity search without heavy ML dependencies
  2. Drug discovery: Embed protein targets and small molecules for screening
  3. Genomics applications: Vector search over genomic databases
  4. Research tools: Quick prototyping without PyTorch setup

Detailed Design

API

from fastembed.bio import DNAEmbedding, ProteinEmbedding, MoleculeEmbedding

# dna sequences
dna = DNAEmbedding("DNABERT-2-117M")
embeddings = list(dna.embed(["ACGTACGTACGT", "GCTAGCTAGCTA"]))
# Returns: list of np.ndarray, shape (768,)

# protein sequences
protein = ProteinEmbedding("ESM2-35M")
embeddings = list(protein.embed(["MKTVRQERLKS", "GKGDPKKPRGKM"]))
# Returns: list of np.ndarray, shape (480,)

# small molecules (SMILES notation)
mol = MoleculeEmbedding("ChemBERTa-zinc-base-v1")
embeddings = list(mol.embed(["CCO", "CC(=O)O"]))  # ethanol, acetic acid
# Returns: list of np.ndarray, shape (768,)

# standard fastembed patterns work
DNAEmbedding.list_supported_models()
dna.embed(sequences, batch_size=32, parallel=4)

Module Structure

fastembed/
└── bio/
    ├── __init__.py              # Public exports
    ├── base.py                  # BiologicalEmbeddingBase
    ├── dna_embedding.py         # DNAEmbedding, OnnxDNAEmbedding
    ├── protein_embedding.py     # ProteinEmbedding, OnnxProteinEmbedding
    └── molecule_embedding.py    # MoleculeEmbedding, OnnxMoleculeEmbedding

Class Hierarchy

ModelManagement[DenseModelDescription]
└── BiologicalEmbeddingBase
    ├── DNAEmbedding
    │   └── OnnxDNAEmbedding
    ├── ProteinEmbedding
    │   └── OnnxProteinEmbedding
    └── MoleculeEmbedding
        └── OnnxMoleculeEmbedding

This mirrors the existing TextEmbeddingOnnxTextEmbedding pattern.

Supported Models

Phase 1: Core Models

Model Type Parameters Dim License Priority
DNABERT-2-117M DNA/RNA 117M 768 MIT P0
ESM2-8M Protein 8M 320 MIT P0
ESM2-35M Protein 35M 480 P0
ChemBERTa-zinc-base-v1 Molecule 46M 768 Apache 2.0 P1

Phase 2: Extended Models

Model Type Parameters Dim License Priority
ESM2-150M Protein 150M 640 MIT P1
ESM2-650M Protein 650M 1280 MIT P2
Nucleotide-Transformer-50M DNA 50M 512 CC BY-NC-SA P2

Tokenization

Biological sequences use simpler vocabularies than natural language:

Type Alphabet Tokenization Strategy
DNA A, C, G, T (4 chars) BPE or k-mer
Protein 20 amino acids Single character
SMILES ~40 characters BPE

Implementation approach: Use model-provided tokenizer.json files loaded via the existing tokenizers library (already a fastembed dependency).

For models without standard tokenizers, implement lightweight alternatives:

class KmerTokenizer:
    """
    Simple k-mer tokenizer for DNA sequences.
    """

    def __init__(self, k: int = 6, stride: int = 1, vocab: dict[str, int] | None = None):
        self.k = k
        self.stride = stride
        self.vocab = vocab or self._build_vocab()

    def _build_vocab(self) -> dict[str, int]:
        from itertools import product
        bases = "ACGT"
        kmers = ["".join(p) for p in product(bases, repeat=self.k)]
        return {kmer: i for i, kmer in enumerate(kmers)}

    def encode(self, sequence: str) -> list[int]:
        return [
            self.vocab.get(sequence[i:i+self.k], 0)
            for i in range(0, len(sequence) - self.k + 1, self.stride)
        ]

ONNX Model Preparation

Models will be exported using HuggingFace Optimum and hosted on HuggingFace Hub:

# Export DNABERT-2
optimum-cli export onnx \
  --model zhihan1996/DNABERT-2-117M \
  --trust-remote-code \
  --task feature-extraction \
  dnabert2-117m-onnx/

# Export ESM-2
optimum-cli export onnx \
  --model facebook/esm2_t12_35M_UR50D \
  --task feature-extraction \
  esm2-35m-onnx/

Exported models will be hosted at qdrant/DNABERT-2-117M-onnx, qdrant/ESM2-35M-onnx, etc.

Sequence Length Handling

Biological sequences vary widely in length:

Type Typical Length Model Max Length
DNA fragments 100-10,000 bp 512-2048 tokens
Proteins 100-1000 aa 1024 tokens
SMILES 20-200 chars 512 tokens

Strategy:

  1. Default: Truncate to model's max length with warning
  2. Optional: Chunking with aggregation for long sequences
def embed(
    self,
    sequences: Iterable[str],
    batch_size: int = 32,
    long_sequence_strategy: Literal["truncate", "chunk"] = "truncate",
    chunk_overlap: int = 64,
) -> Iterable[np.ndarray]:
    ...

Model Description

Extend DenseModelDescription or create a biological-specific variant:

@dataclass(frozen=True)
class BiologicalModelDescription(DenseModelDescription):
    sequence_type: Literal["dna", "rna", "protein", "molecule"]
    max_sequence_length: int
    alphabet: str | None = None  # e.g., "ACGT" for DNA

Dependencies

No new required dependencies. The implementation uses:

Dependency Status Usage
tokenizers Existing BPE/WordPiece tokenization
numpy Existing Array operations
onnxruntime Existing Model inference
huggingface-hub Existing Model download

Optional dependencies for extended functionality:

[project.optional-dependencies]
bio-molecules = ["rdkit>=2023.0.0"]  # Molecular fingerprints, SMILES validation

Drawbacks

1. Maintenance Burden

Adding a new modality increases the surface area for bugs and compatibility issues. Biological models may have less stable ONNX export support than mainstream NLP models.

Mitigation: Start with well-tested models (ESM-2 is in HuggingFace Transformers) and add others incrementally.

2. Model Size

Some biological models are large:

  • ESM2-650M: ~2.5 GB
  • DNABERT-2: ~470 MB

Mitigation: Prioritize smaller models (ESM2-8M: ~32MB) and document memory requirements.

3. Niche Use Case

Biological embeddings serve a smaller audience than text/image embeddings.

Mitigation: The implementation cost is low (reuses existing infrastructure), and the bioinformatics community is underserved by lightweight embedding tools.

4. ONNX Export Compatibility

Some models use custom architectures that may not export cleanly:

  • DNABERT-2 uses ALiBi positional encoding
  • Some models require trust_remote_code=True

Mitigation: Validate exports thoroughly before adding models. Start with models known to export well.

Alternatives Considered

Alternative 1: Don't Add Biological Support

Users could convert models to ONNX themselves and use TextEmbedding.add_custom_model().

Rejected because:

  • High barrier to entry for non-ML users
  • No standardized API for biological sequences
  • Tokenization handling left to users

Alternative 2: Separate Package (fastembed-bio)

Create a standalone package that depends on fastembed.

Rejected because:

  • Fragments the ecosystem
  • Complicates installation
  • The implementation fits naturally in the main package

Alternative 3: Support Single-Cell Models

Include models like Tahoe-x1, scGPT, and Geneformer.

Rejected because:

  • Heavy dependencies (scanpy, anndata, FlashAttention)
  • Non-standard input formats (h5ad sparse matrices)
  • GPU requirements (A100+ for FlashAttention)
  • Conflicts with fastembed's lightweight philosophy

Users needing single-cell embeddings should use native model interfaces.

Is there any additional information you would like to provide?

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions