[Feature]: RFC: Biological embeddings support

### What feature would you like to request?

# RFC: Biological Sequence Embeddings for Fastembed
## Abstract

This RFC proposes adding support for biological sequence embeddings (DNA, RNA, protein, small molecules) to fastembed. The implementation would follow existing architectural patterns, require no new core dependencies, and provide a consistent API for embedding biological data using ONNX-converted foundation models.

## Motivation

Biological foundation models have emerged as powerful tools for computational biology:

- **DNA/RNA models** (DNABERT-2, Nucleotide Transformer) enable genome analysis, variant effect prediction, and regulatory element identification
- **Protein models** (ESM-2, ProtTrans) power structure prediction, function annotation, and drug discovery
- **Molecular models** (ChemBERTa) support drug screening and molecular property prediction

Currently, using these models requires:
1. Installing PyTorch/TensorFlow
2. Managing model-specific dependencies
3. Writing custom inference code

Fastembed could provide the same lightweight, ONNX-based experience it offers for text/image embeddings, making biological AI accessible to a broader audience.

### Use Cases

1. **Bioinformatics pipelines**: Embed DNA sequences for similarity search without heavy ML dependencies
2. **Drug discovery**: Embed protein targets and small molecules for screening
3. **Genomics applications**: Vector search over genomic databases
4. **Research tools**: Quick prototyping without PyTorch setup

## Detailed Design

### API

```python
from fastembed.bio import DNAEmbedding, ProteinEmbedding, MoleculeEmbedding

# dna sequences
dna = DNAEmbedding("DNABERT-2-117M")
embeddings = list(dna.embed(["ACGTACGTACGT", "GCTAGCTAGCTA"]))
# Returns: list of np.ndarray, shape (768,)

# protein sequences
protein = ProteinEmbedding("ESM2-35M")
embeddings = list(protein.embed(["MKTVRQERLKS", "GKGDPKKPRGKM"]))
# Returns: list of np.ndarray, shape (480,)

# small molecules (SMILES notation)
mol = MoleculeEmbedding("ChemBERTa-zinc-base-v1")
embeddings = list(mol.embed(["CCO", "CC(=O)O"]))  # ethanol, acetic acid
# Returns: list of np.ndarray, shape (768,)

# standard fastembed patterns work
DNAEmbedding.list_supported_models()
dna.embed(sequences, batch_size=32, parallel=4)
```

### Module Structure

```
fastembed/
└── bio/
    ├── __init__.py              # Public exports
    ├── base.py                  # BiologicalEmbeddingBase
    ├── dna_embedding.py         # DNAEmbedding, OnnxDNAEmbedding
    ├── protein_embedding.py     # ProteinEmbedding, OnnxProteinEmbedding
    └── molecule_embedding.py    # MoleculeEmbedding, OnnxMoleculeEmbedding
```

### Class Hierarchy

```
ModelManagement[DenseModelDescription]
└── BiologicalEmbeddingBase
    ├── DNAEmbedding
    │   └── OnnxDNAEmbedding
    ├── ProteinEmbedding
    │   └── OnnxProteinEmbedding
    └── MoleculeEmbedding
        └── OnnxMoleculeEmbedding
```

This mirrors the existing `TextEmbedding` → `OnnxTextEmbedding` pattern.

### Supported Models

#### Phase 1: Core Models

| Model | Type | Parameters | Dim | License | Priority |
|-------|------|------------|-----|---------|----------|
| DNABERT-2-117M | DNA/RNA | 117M | 768 | MIT | P0 |
| ESM2-8M | Protein | 8M | 320 | MIT | P0 |
| ESM2-35M | Protein | 35M | 480 | P0 |
| ChemBERTa-zinc-base-v1 | Molecule | 46M | 768 | Apache 2.0 | P1 |

#### Phase 2: Extended Models

| Model | Type | Parameters | Dim | License | Priority |
|-------|------|------------|-----|---------|----------|
| ESM2-150M | Protein | 150M | 640 | MIT | P1 |
| ESM2-650M | Protein | 650M | 1280 | MIT | P2 |
| Nucleotide-Transformer-50M | DNA | 50M | 512 | CC BY-NC-SA | P2 |

### Tokenization

Biological sequences use simpler vocabularies than natural language:

| Type | Alphabet | Tokenization Strategy |
|------|----------|----------------------|
| DNA | A, C, G, T (4 chars) | BPE or k-mer |
| Protein | 20 amino acids | Single character |
| SMILES | ~40 characters | BPE |

**Implementation approach**: Use model-provided `tokenizer.json` files loaded via the existing `tokenizers` library (already a fastembed dependency).

For models without standard tokenizers, implement lightweight alternatives:

```python
class KmerTokenizer:
    """
    Simple k-mer tokenizer for DNA sequences.
    """

    def __init__(self, k: int = 6, stride: int = 1, vocab: dict[str, int] | None = None):
        self.k = k
        self.stride = stride
        self.vocab = vocab or self._build_vocab()

    def _build_vocab(self) -> dict[str, int]:
        from itertools import product
        bases = "ACGT"
        kmers = ["".join(p) for p in product(bases, repeat=self.k)]
        return {kmer: i for i, kmer in enumerate(kmers)}

    def encode(self, sequence: str) -> list[int]:
        return [
            self.vocab.get(sequence[i:i+self.k], 0)
            for i in range(0, len(sequence) - self.k + 1, self.stride)
        ]
```

### ONNX Model Preparation

Models will be exported using HuggingFace Optimum and hosted on HuggingFace Hub:

```bash
# Export DNABERT-2
optimum-cli export onnx \
  --model zhihan1996/DNABERT-2-117M \
  --trust-remote-code \
  --task feature-extraction \
  dnabert2-117m-onnx/

# Export ESM-2
optimum-cli export onnx \
  --model facebook/esm2_t12_35M_UR50D \
  --task feature-extraction \
  esm2-35m-onnx/
```

Exported models will be hosted at `qdrant/DNABERT-2-117M-onnx`, `qdrant/ESM2-35M-onnx`, etc.

### Sequence Length Handling

Biological sequences vary widely in length:

| Type | Typical Length | Model Max Length |
|------|----------------|------------------|
| DNA fragments | 100-10,000 bp | 512-2048 tokens |
| Proteins | 100-1000 aa | 1024 tokens |
| SMILES | 20-200 chars | 512 tokens |

**Strategy**:
1. **Default**: Truncate to model's max length with warning
2. **Optional**: Chunking with aggregation for long sequences

```python
def embed(
    self,
    sequences: Iterable[str],
    batch_size: int = 32,
    long_sequence_strategy: Literal["truncate", "chunk"] = "truncate",
    chunk_overlap: int = 64,
) -> Iterable[np.ndarray]:
    ...
```

### Model Description

Extend `DenseModelDescription` or create a biological-specific variant:

```python
@dataclass(frozen=True)
class BiologicalModelDescription(DenseModelDescription):
    sequence_type: Literal["dna", "rna", "protein", "molecule"]
    max_sequence_length: int
    alphabet: str | None = None  # e.g., "ACGT" for DNA
```

### Dependencies

**No new required dependencies.** The implementation uses:

| Dependency | Status | Usage |
|------------|--------|-------|
| `tokenizers` | Existing | BPE/WordPiece tokenization |
| `numpy` | Existing | Array operations |
| `onnxruntime` | Existing | Model inference |
| `huggingface-hub` | Existing | Model download |

**Optional dependencies** for extended functionality:

```toml
[project.optional-dependencies]
bio-molecules = ["rdkit>=2023.0.0"]  # Molecular fingerprints, SMILES validation
```

## Drawbacks

### 1. Maintenance Burden
Adding a new modality increases the surface area for bugs and compatibility issues. Biological models may have less stable ONNX export support than mainstream NLP models.

**Mitigation**: Start with well-tested models (ESM-2 is in HuggingFace Transformers) and add others incrementally.

### 2. Model Size
Some biological models are large:
- ESM2-650M: ~2.5 GB
- DNABERT-2: ~470 MB

**Mitigation**: Prioritize smaller models (ESM2-8M: ~32MB) and document memory requirements.

### 3. Niche Use Case
Biological embeddings serve a smaller audience than text/image embeddings.

**Mitigation**: The implementation cost is low (reuses existing infrastructure), and the bioinformatics community is underserved by lightweight embedding tools.

### 4. ONNX Export Compatibility
Some models use custom architectures that may not export cleanly:
- DNABERT-2 uses ALiBi positional encoding
- Some models require `trust_remote_code=True`

**Mitigation**: Validate exports thoroughly before adding models. Start with models known to export well.

## Alternatives Considered

### Alternative 1: Don't Add Biological Support

Users could convert models to ONNX themselves and use `TextEmbedding.add_custom_model()`.

**Rejected because**:
- High barrier to entry for non-ML users
- No standardized API for biological sequences
- Tokenization handling left to users

### Alternative 2: Separate Package (fastembed-bio)

Create a standalone package that depends on fastembed.

**Rejected because**:
- Fragments the ecosystem
- Complicates installation
- The implementation fits naturally in the main package

### Alternative 3: Support Single-Cell Models

Include models like Tahoe-x1, scGPT, and Geneformer.

**Rejected because**:
- Heavy dependencies (scanpy, anndata, FlashAttention)
- Non-standard input formats (h5ad sparse matrices)
- GPU requirements (A100+ for FlashAttention)
- Conflicts with fastembed's lightweight philosophy

Users needing single-cell embeddings should use native model interfaces.


### Is there any additional information you would like to provide?

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: RFC: Biological embeddings support #597

What feature would you like to request?

RFC: Biological Sequence Embeddings for Fastembed

Abstract

Motivation

Use Cases

Detailed Design

API

Module Structure

Class Hierarchy

Supported Models

Phase 1: Core Models

Phase 2: Extended Models

Tokenization

ONNX Model Preparation

Sequence Length Handling

Model Description

Dependencies

Drawbacks

1. Maintenance Burden

2. Model Size

3. Niche Use Case

4. ONNX Export Compatibility

Alternatives Considered

Alternative 1: Don't Add Biological Support

Alternative 2: Separate Package (fastembed-bio)

Alternative 3: Support Single-Cell Models

Is there any additional information you would like to provide?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Model	Type	Parameters	Dim	License	Priority
DNABERT-2-117M	DNA/RNA	117M	768	MIT	P0
ESM2-8M	Protein	8M	320	MIT	P0
ESM2-35M	Protein	35M	480	P0
ChemBERTa-zinc-base-v1	Molecule	46M	768	Apache 2.0	P1

Model	Type	Parameters	Dim	License	Priority
ESM2-150M	Protein	150M	640	MIT	P1
ESM2-650M	Protein	650M	1280	MIT	P2
Nucleotide-Transformer-50M	DNA	50M	512	CC BY-NC-SA	P2

Type	Alphabet	Tokenization Strategy
DNA	A, C, G, T (4 chars)	BPE or k-mer
Protein	20 amino acids	Single character
SMILES	~40 characters	BPE

Type	Typical Length	Model Max Length
DNA fragments	100-10,000 bp	512-2048 tokens
Proteins	100-1000 aa	1024 tokens
SMILES	20-200 chars	512 tokens

Dependency	Status	Usage
`tokenizers`	Existing	BPE/WordPiece tokenization
`numpy`	Existing	Array operations
`onnxruntime`	Existing	Model inference
`huggingface-hub`	Existing	Model download

[Feature]: RFC: Biological embeddings support #597

Description

What feature would you like to request?

RFC: Biological Sequence Embeddings for Fastembed

Abstract

Motivation

Use Cases

Detailed Design

API

Module Structure

Class Hierarchy

Supported Models

Phase 1: Core Models

Phase 2: Extended Models

Tokenization

ONNX Model Preparation

Sequence Length Handling

Model Description

Dependencies

Drawbacks

1. Maintenance Burden

2. Model Size

3. Niche Use Case

4. ONNX Export Compatibility

Alternatives Considered

Alternative 1: Don't Add Biological Support

Alternative 2: Separate Package (fastembed-bio)

Alternative 3: Support Single-Cell Models

Is there any additional information you would like to provide?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions