Julien Guinot*,1,2, Alain Riou*,3, Elio Quinton2, György Fazekas1
1 Centre for Digital Music, Queen Mary University of London, U.K.
2 Music & Audio Machine Learning Lab, Universal Music Group, London, U.K.
3 LTCI, Télécom-Paris, Institut Polytechnique de Paris, France
*Equal contribution, correspondence to [email protected]
Joint embedding spaces have significantly advanced music understanding and generation by linking text and audio through multimodal contrastive learning. However, these approaches face large memory requirement limitations due to relying on large batch sizes to effectively utilize negative samples. Further, multimodal joint embedding spaces suffer from a modality gap wherein embeddings from different modalities lie in different manifolds of the embedding space.
To address these challenges, we propose Siamese Language-Audio Pretraining (SLAP), a novel multimodal pretraining framework that allows learning powerful representations without negative samples. SLAP adapts the Bootstrap Your Own Latent (BYOL) paradigm for multimodal audio-text training, promoting scalability in training multimodal embedding spaces. SLAP outperforms CLAP on text-music retrieval and zero-shot classification, and is robust to batch size and modality gap issues.
- No negatives needed: BYOL-style training for audio-text, removing the need for large batch sizes and negative pairs.
- Scalable: Enables large-scale training on a single GPU via gradient accumulation.
- Reduced modality gap: Embeddings from different modalities are better aligned.
- Strong performance: Outperforms CLAP on text-music retrieval and zero-shot classification.
- PyTorch Lightning + Hydra: Modular, reproducible, and easy to configure.
To train a model, define your configuration in a YAML file (see configs/
for examples) and run:
python src/train.py data=<your_data_config> model=<your_model_config> model/audio_encoder=<your_audio_encoder> model/text_encoder=<your_text_encoder> logger=<your_logger>
You can override any config option from the command line, e.g.:
python src/train.py model/audio_encoder=htsat_slap
See the Hydra docs for more info on configuration overrides.
This repo is implemented with Pytorch Lightning. For configuration files, it uses Hydra. The structure follows this lightning-hydra-template.
This repo is divided in the following subpackages:
configs
contains the recursively defined Hydra configs as YAML files.src
contains the main code:callbacks
contains home-made callbacks, e.g. for custom logging and visualization.data
contains data-related stuff, it is divided indatamodules
(for loading data) andtransforms
for pre-processing, data augmentations, etc.models
contains main structures implemented asLightningModule
s. Training procedures should be implemented here. ABaseModule
is already implemented to handle basics.networks
contains the architectures of neural networks provided as standard PyTorchnn.Module
s. In practice, those networks shouldn't be used directly but given as arguments of aLightningModule
implemented inmodels
.utils
contains almost everything else. In particular, losses and colorful loggers are defined here.
- Train locally:
python src/train.py data=<your_data_config> model=<your_model_config> model/audio_encoder=<your_audio_encoder> model/text_encoder=<your_text_encoder> logger=<your_logger>
- Config customization:
Edit YAML files in
configs/
or override from the command line, e.g.model/audio_encoder=htsat_slap
.
- Evaluate on test set:
python src/train.py test=true ckpt_path=path/to/checkpoint.ckpt
- Resume training:
python src/train.py resume=true ckpt_path=path/to/checkpoint.ckpt
- All experiment settings are controlled via Hydra YAML configs in
configs/
. - See
configs/train.yaml
for the main entry point. - Paths, model, data, and trainer settings are modular and can be overridden from the command line.
You can use the model to extract features or perform zero-shot similarity as follows:
Extract audio/text features:
# Assume 'model' is a loaded SLAP or MusCALL model
# audio: torch.Tensor of shape [batch, ...], text: list of strings
_, audio_emb = model.encode_audio(audio)
_, text_emb = model.encode_text(text)
Zero-shot similarity (e.g., retrieval or probing):
# Normalize and compute similarity
import torch.nn.functional as F
sim = F.cosine_similarity(audio_emb, text_emb) # or use (audio_emb @ text_emb.T)
# sim: similarity scores between audio and text
SLAP is designed to be flexible: you can use the provided AudioCSVDataset
(see src/data/audio_csv.py
) or define your own PyTorch Dataset
or Lightning DataModule
for custom data formats.
Default CSV-based dataset:
- Prepare a CSV file with columns:
npy_path
,caption
,set
(whereset
is one oftrain
,val
, ortest
). - Audio or spectrogram files should be referenced by
npy_path
(relative to your data root).
Example directory structure:
data/
├── audio/
│ ├── track_1.wav
│ ├── track_2.wav
│ └── ...
├── captions.csv # or .json, with columns: npy_path, caption, set
- Update
configs/paths/example.yaml
with your data paths. - See
configs/data/default.yaml
for expected config fields and how to point to your CSV. - To use a custom dataset, implement your own
LightningDataModule
and update the_target_
in your data config.
- Text-Music Retrieval: SLAP outperforms CLAP on standard retrieval benchmarks.
- Zero-Shot Classification: SLAP achieves competitive results on genre and instrument classification, auto-tagging.
- Modality Gap: SLAP reduces the modality gap compared to contrastive approaches.
- Robustness: SLAP is robust to batch size and enables large-scale training on a single GPU.
Model | Pretrained | Dataset | Direction | R@1 | R@5 | R@10 | Median Rank | Mean Rank |
---|---|---|---|---|---|---|---|---|
SLAP | Yes | Song Describer | A → T | 5.7 | 18.1 | 26.6 | 3.2 | 8.9 |
SLAP | Yes | Song Describer | T → A | 6.0 | 18.1 | 26.4 | 3.6 | 10.5 |
CLAP | Yes | Song Describer | A → T | 5.3 | 14.9 | 22.2 | 4.3 | 9.2 |
CLAP | Yes | Song Describer | T → A | 5.7 | 16.8 | 24.1 | 4.3 | 9.7 |
SLAP | Yes | MusicCaps | A → T | 3.1 | 10.1 | 15.4 | 1.9 | 7.7 |
SLAP | Yes | MusicCaps | T → A | 3.0 | 9.6 | 15.4 | 1.9 | 7.7 |
CLAP | Yes | MusicCaps | A → T | 2.8 | 8.3 | 10.4 | 3.0 | 10.0 |
CLAP | Yes | MusicCaps | T → A | 2.8 | 8.7 | 14.0 | 2.2 | 7.6 |
Metrics: R@k = Recall@k (%), Median/Mean Rank (lower is better). All results from the paper, pretrained models.
The following UMAP visualizations illustrate how SLAP better aligns audio and text embeddings in the joint space, reducing the modality gap compared to contrastive approaches.
If you use this code, please cite:
@inproceedings{SLAP,
author = {Guinot, Julien and Riou, Alain and Quinton, Elio and Fazekas, György},
booktitle = {Proceedings of the 26th International Society for Music Information Retrieval Conference, ISMIR 2025},
publisher = {International Society for Music Information Retrieval},
title = {SLAP: Siamese Language-Audio Pretraining Without Negative Samples for Music Understanding},
year = {2025}
}
This repository is released under the MIT License. See LICENSE for details.
- Some code adapted from CLAP and MusCALL.
- Built with PyTorch Lightning and Hydra, following this template.
For questions, contact: [email protected]