SonicVerse is a multi-task music captioning model that integrates caption generation with auxiliary music feature detection tasks such as key detection, vocals detection, and more. The model directly captures both low-level acoustic details as well as high-level musical attributes through a novel projection-based architecture that transforms audio input into natural language captions while simultaneously detecting music features through dedicated auxiliary heads. Additionally, SonicVerse enables the generation of temporally informed long captions for extended music pieces by chaining outputs from short segments using large language models, providing detailed time-informed descriptions that capture the evolving musical narrative.
🔥 Live demo available on Huggingface
- Multi-Task Learning: Combines caption generation with music feature detection (key detection, vocals detection, etc.)
- Projection-Based Architecture: Transforms audio input into language tokens while maintaining feature detection capabilities
- Enhanced Captioning: Produces rich, descriptive captions that incorporate detected music features
- Long-Form Description: Enables detailed time-informed descriptions for longer music pieces through LLM chaining
git clone https://github.com/AMAAI-Lab/SonicVerse.git
cd SonicVerse
pip install -r requirements.txt
pip install -e .
python scripts/app.py
# Run the training
deepspeed scripts/train_model.py \
--model_name_or_path mistralai/Mistral-7B-Instruct-v0.1 \
--model_cls MistralLMMForCausalLM \
--modality_builder audio_descript \
--train_dataset_path {path to dataset train split} \
--evaluation_dataset_path {path to dataset val split} \
--output_dir {path to output checkpoints directory} \
--pretrain_projectors \
--lora_enable True \
--bf16 True \
--tf32 True \
--num_train_epochs 3 \
--gradient_checkpointing True \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 32 \
--model_max_length 2048 \
--evaluation_strategy "steps" \
--eval_steps 1300 \
--save_strategy "steps" \
--save_steps 450 \
--save_total_limit 3 \
--learning_rate 1e-4 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--dataloader_num_workers 2 \
--logging_steps 1 \
--use_multi_task 2 \
--tasks_config configs/tasks.json \
--report_to none \
--deepspeed ./configs/zero2.json
Switch configs to configs/tasks_ft.json
, for finetuning only the projector
If you use SonicVerse in your work, please cite our paper:
SonicVerse: Multi-Task Learning for Music Feature-Informed Captioning Anuradha Chopra, Abhinaba Roy, Dorien Herremans Accepted to AIMC 2025
@article{chopra2025sonicverse,
title={SonicVerse: Multi-Task Learning for Music Feature-Informed Captioning},
author={Chopra, Anuradha and Roy, Abhinaba and Herremans, Dorien},
journal={Proceedings of the 6th Conference on AI Music Creativity (AIMC 2025)},
year={2025},
address={Brussels, Belgium},
month={September},
url={https://arxiv.org/abs/2506.15154},
}
Read the paper here: arXiv:2506.15154 DOI: 10.48550/arXiv.2506.15154