-
Notifications
You must be signed in to change notification settings - Fork 177
Description
What feature would you like to request?
RFC: Biological Sequence Embeddings for Fastembed
Abstract
This RFC proposes adding support for biological sequence embeddings (DNA, RNA, protein, small molecules) to fastembed. The implementation would follow existing architectural patterns, require no new core dependencies, and provide a consistent API for embedding biological data using ONNX-converted foundation models.
Motivation
Biological foundation models have emerged as powerful tools for computational biology:
- DNA/RNA models (DNABERT-2, Nucleotide Transformer) enable genome analysis, variant effect prediction, and regulatory element identification
- Protein models (ESM-2, ProtTrans) power structure prediction, function annotation, and drug discovery
- Molecular models (ChemBERTa) support drug screening and molecular property prediction
Currently, using these models requires:
- Installing PyTorch/TensorFlow
- Managing model-specific dependencies
- Writing custom inference code
Fastembed could provide the same lightweight, ONNX-based experience it offers for text/image embeddings, making biological AI accessible to a broader audience.
Use Cases
- Bioinformatics pipelines: Embed DNA sequences for similarity search without heavy ML dependencies
- Drug discovery: Embed protein targets and small molecules for screening
- Genomics applications: Vector search over genomic databases
- Research tools: Quick prototyping without PyTorch setup
Detailed Design
API
from fastembed.bio import DNAEmbedding, ProteinEmbedding, MoleculeEmbedding
# dna sequences
dna = DNAEmbedding("DNABERT-2-117M")
embeddings = list(dna.embed(["ACGTACGTACGT", "GCTAGCTAGCTA"]))
# Returns: list of np.ndarray, shape (768,)
# protein sequences
protein = ProteinEmbedding("ESM2-35M")
embeddings = list(protein.embed(["MKTVRQERLKS", "GKGDPKKPRGKM"]))
# Returns: list of np.ndarray, shape (480,)
# small molecules (SMILES notation)
mol = MoleculeEmbedding("ChemBERTa-zinc-base-v1")
embeddings = list(mol.embed(["CCO", "CC(=O)O"])) # ethanol, acetic acid
# Returns: list of np.ndarray, shape (768,)
# standard fastembed patterns work
DNAEmbedding.list_supported_models()
dna.embed(sequences, batch_size=32, parallel=4)Module Structure
fastembed/
└── bio/
├── __init__.py # Public exports
├── base.py # BiologicalEmbeddingBase
├── dna_embedding.py # DNAEmbedding, OnnxDNAEmbedding
├── protein_embedding.py # ProteinEmbedding, OnnxProteinEmbedding
└── molecule_embedding.py # MoleculeEmbedding, OnnxMoleculeEmbedding
Class Hierarchy
ModelManagement[DenseModelDescription]
└── BiologicalEmbeddingBase
├── DNAEmbedding
│ └── OnnxDNAEmbedding
├── ProteinEmbedding
│ └── OnnxProteinEmbedding
└── MoleculeEmbedding
└── OnnxMoleculeEmbedding
This mirrors the existing TextEmbedding → OnnxTextEmbedding pattern.
Supported Models
Phase 1: Core Models
| Model | Type | Parameters | Dim | License | Priority |
|---|---|---|---|---|---|
| DNABERT-2-117M | DNA/RNA | 117M | 768 | MIT | P0 |
| ESM2-8M | Protein | 8M | 320 | MIT | P0 |
| ESM2-35M | Protein | 35M | 480 | P0 | |
| ChemBERTa-zinc-base-v1 | Molecule | 46M | 768 | Apache 2.0 | P1 |
Phase 2: Extended Models
| Model | Type | Parameters | Dim | License | Priority |
|---|---|---|---|---|---|
| ESM2-150M | Protein | 150M | 640 | MIT | P1 |
| ESM2-650M | Protein | 650M | 1280 | MIT | P2 |
| Nucleotide-Transformer-50M | DNA | 50M | 512 | CC BY-NC-SA | P2 |
Tokenization
Biological sequences use simpler vocabularies than natural language:
| Type | Alphabet | Tokenization Strategy |
|---|---|---|
| DNA | A, C, G, T (4 chars) | BPE or k-mer |
| Protein | 20 amino acids | Single character |
| SMILES | ~40 characters | BPE |
Implementation approach: Use model-provided tokenizer.json files loaded via the existing tokenizers library (already a fastembed dependency).
For models without standard tokenizers, implement lightweight alternatives:
class KmerTokenizer:
"""
Simple k-mer tokenizer for DNA sequences.
"""
def __init__(self, k: int = 6, stride: int = 1, vocab: dict[str, int] | None = None):
self.k = k
self.stride = stride
self.vocab = vocab or self._build_vocab()
def _build_vocab(self) -> dict[str, int]:
from itertools import product
bases = "ACGT"
kmers = ["".join(p) for p in product(bases, repeat=self.k)]
return {kmer: i for i, kmer in enumerate(kmers)}
def encode(self, sequence: str) -> list[int]:
return [
self.vocab.get(sequence[i:i+self.k], 0)
for i in range(0, len(sequence) - self.k + 1, self.stride)
]ONNX Model Preparation
Models will be exported using HuggingFace Optimum and hosted on HuggingFace Hub:
# Export DNABERT-2
optimum-cli export onnx \
--model zhihan1996/DNABERT-2-117M \
--trust-remote-code \
--task feature-extraction \
dnabert2-117m-onnx/
# Export ESM-2
optimum-cli export onnx \
--model facebook/esm2_t12_35M_UR50D \
--task feature-extraction \
esm2-35m-onnx/Exported models will be hosted at qdrant/DNABERT-2-117M-onnx, qdrant/ESM2-35M-onnx, etc.
Sequence Length Handling
Biological sequences vary widely in length:
| Type | Typical Length | Model Max Length |
|---|---|---|
| DNA fragments | 100-10,000 bp | 512-2048 tokens |
| Proteins | 100-1000 aa | 1024 tokens |
| SMILES | 20-200 chars | 512 tokens |
Strategy:
- Default: Truncate to model's max length with warning
- Optional: Chunking with aggregation for long sequences
def embed(
self,
sequences: Iterable[str],
batch_size: int = 32,
long_sequence_strategy: Literal["truncate", "chunk"] = "truncate",
chunk_overlap: int = 64,
) -> Iterable[np.ndarray]:
...Model Description
Extend DenseModelDescription or create a biological-specific variant:
@dataclass(frozen=True)
class BiologicalModelDescription(DenseModelDescription):
sequence_type: Literal["dna", "rna", "protein", "molecule"]
max_sequence_length: int
alphabet: str | None = None # e.g., "ACGT" for DNADependencies
No new required dependencies. The implementation uses:
| Dependency | Status | Usage |
|---|---|---|
tokenizers |
Existing | BPE/WordPiece tokenization |
numpy |
Existing | Array operations |
onnxruntime |
Existing | Model inference |
huggingface-hub |
Existing | Model download |
Optional dependencies for extended functionality:
[project.optional-dependencies]
bio-molecules = ["rdkit>=2023.0.0"] # Molecular fingerprints, SMILES validationDrawbacks
1. Maintenance Burden
Adding a new modality increases the surface area for bugs and compatibility issues. Biological models may have less stable ONNX export support than mainstream NLP models.
Mitigation: Start with well-tested models (ESM-2 is in HuggingFace Transformers) and add others incrementally.
2. Model Size
Some biological models are large:
- ESM2-650M: ~2.5 GB
- DNABERT-2: ~470 MB
Mitigation: Prioritize smaller models (ESM2-8M: ~32MB) and document memory requirements.
3. Niche Use Case
Biological embeddings serve a smaller audience than text/image embeddings.
Mitigation: The implementation cost is low (reuses existing infrastructure), and the bioinformatics community is underserved by lightweight embedding tools.
4. ONNX Export Compatibility
Some models use custom architectures that may not export cleanly:
- DNABERT-2 uses ALiBi positional encoding
- Some models require
trust_remote_code=True
Mitigation: Validate exports thoroughly before adding models. Start with models known to export well.
Alternatives Considered
Alternative 1: Don't Add Biological Support
Users could convert models to ONNX themselves and use TextEmbedding.add_custom_model().
Rejected because:
- High barrier to entry for non-ML users
- No standardized API for biological sequences
- Tokenization handling left to users
Alternative 2: Separate Package (fastembed-bio)
Create a standalone package that depends on fastembed.
Rejected because:
- Fragments the ecosystem
- Complicates installation
- The implementation fits naturally in the main package
Alternative 3: Support Single-Cell Models
Include models like Tahoe-x1, scGPT, and Geneformer.
Rejected because:
- Heavy dependencies (scanpy, anndata, FlashAttention)
- Non-standard input formats (h5ad sparse matrices)
- GPU requirements (A100+ for FlashAttention)
- Conflicts with fastembed's lightweight philosophy
Users needing single-cell embeddings should use native model interfaces.
Is there any additional information you would like to provide?
No response