Multimodal Protein Language Model
This documentation provides an overview, installation instructions, usage examples, and API reference for the multimodal_protein_language_model
repository by ayyucedemirbas. It supports sequence-to-structure/function prediction using transformer-based encoder-decoder architecture with a mixture-of-experts and optional structural image input.
The MultimodalProteinModel
integrates:
- Protein Sequence Encoder based on transformer layers with mixture-of-experts routing.
- Protein Structure/Function Decoder generating structural tokens.
- Image Encoder for optional 2D structural data to perform multimodal fusion.
- Custom learning rate scheduler following the "Attention Is All You Need" warmup strategy.
Use cases include predicting protein secondary/tertiary structures, binding sites, or functional motifs, optionally guided by structural images.
multimodal_protein_language_model/
├── README.md # Minimal original readme
├── LICENSE # MIT License
├── encoder.py # Transformer encoder with MoE layers
├── decoder.py # Transformer decoder with MoE layers
├── layers.py # Core MultiheadAttention, MixtureOfExperts, positional encoding
├── model.py # Complete MultimodalProteinModel class
├── preprocessing.py # Sequence and structure tokenization utilities
└── training.py # High-level training routine and entry point
-
Clone the repository
git clone https://github.com/ayyucedemirbas/multimodal_protein_language_model.git cd multimodal_protein_language_model
-
Create a virtual environment (recommended)
python3 -m venv venv source venv/bin/activate
-
Install dependencies
pip install tensorflow numpy
Two helper functions in preprocessing.py
:
-
preprocess_protein_sequence(sequence: str, max_length: int, vocab: dict) -> tf.Tensor
Converts an amino acid sequence to integer tokens, pads/truncates tomax_length
. -
preprocess_structure_data(structure_data: List[str], max_length: int, vocab: dict) -> tf.Tensor
Converts structure tokens (e.g., secondary structure labels) to integers, adds start/end tokens, pads/truncates.
Example:
from preprocessing import preprocess_protein_sequence, preprocess_structure_data
# Sample vocab
aa_vocab = {aa: i+3 for i, aa in enumerate("ACDEFGHIKLMNPQRSTVWY")}
aa_vocab.update({"<PAD>":0, "<START>":1, "<END>":2, "<UNK>":3})
seq_tensor = preprocess_protein_sequence("ACDIPK", max_length=10, vocab=aa_vocab)
- Layers: Embedding, positional encoding,
num_layers
ofEncoderLayer
. - EncoderLayer: Multi-head self-attention (with dropout & layer norm) + Mixture-of-Experts feed-forward.
from encoder import ProteinEncoder
encoder = ProteinEncoder(
num_layers=6, d_model=512, num_heads=8,
d_ff=2048, num_experts=8, k=2,
amino_acid_vocab_size=24, max_position=1024,
dropout_rate=0.1
)
enc_output = encoder(input_seq_tensor)
- Layers: Embedding, positional encoding,
num_layers
ofDecoderLayer
. - DecoderLayer: Masked self-attention + encoder-decoder cross-attention + MoE feed-forward.
from decoder import ProteinDecoder
decoder = ProteinDecoder(
num_layers=6, d_model=512, num_heads=8,
d_ff=2048, num_experts=8, k=2,
target_vocab_size=structure_vocab_size,
max_position=1024
)
logits, attn_weights = decoder(target_tokens, enc_output)
- Image Encoder: 3 Conv2D + MaxPool blocks, Flatten, Dense to
d_model
. - Fusion: Concatenate sequence features and repeated image features, project via
Dense(d_model)
.
from model import CustomLearningRateScheduler
lr_schedule = CustomLearningRateScheduler(d_model=512, warmup_steps=4000)
optimizer = tf.keras.optimizers.Adam(lr_schedule)
train_multimodal_protein_model(...)
orchestrates preprocessing, dataset creation, model compilation, and training.
protein_seqs
: List of strings (amino acid sequences).structure_data
: List of lists/strings of structure labels.structural_images
: Optional array of image tensors.batch_size
,epochs
, model hyperparameters,checkpoint_path
.
Example Usage:
from training import train_multimodal_protein_model
# Dummy data
protein_seqs = ["ACDEFGHIKLMNPQRS"]
structure_data = [["H","E","C","C"]]
# Train
model, history, aa_vocab, struct_vocab = train_multimodal_protein_model(
protein_seqs, structure_data, epochs=5, batch_size=2
)
- MultiheadAttention:
call([q,k,v], mask=None, training=None)
→(output, attn_weights)
- ExpertLayer: Feed-forward sub-layer.
- MixtureOfExperts:
call(x, training=None)
→ gated MoE output. - **positional_encoding(position, d_model)
→ Tensor of shape
(1, position, d_model)`
-
MultimodalProteinModel:
call((protein_seq, structure_targets, structural_image), training)
→(logits, attention_weights)
train_step(data)
→ dict with'loss'
and'accuracy'
.create_masks(inp, tar)
→(enc_padding_mask, combined_mask, dec_padding_mask)
.metrics
property →[loss_tracker, accuracy_metric]
- **train_multimodal_protein_model(...)
** →
(model, history, amino_acid_vocab, structure_vocab)`
This project is licensed under the GNU GENERAL PUBLIC LICENSE Version 3. Feel free to use and modify.