SonicVerse: Multi-Task Learning for Music Feature-Informed Captioning

Overview

SonicVerse is a multi-task music captioning model that integrates caption generation with auxiliary music feature detection tasks such as key detection, vocals detection, and more. The model directly captures both low-level acoustic details as well as high-level musical attributes through a novel projection-based architecture that transforms audio input into natural language captions while simultaneously detecting music features through dedicated auxiliary heads. Additionally, SonicVerse enables the generation of temporally informed long captions for extended music pieces by chaining outputs from short segments using large language models, providing detailed time-informed descriptions that capture the evolving musical narrative.

Figure 1: SonicVerse architecture for music captioning with feature detection.

🔥 Live demo available on Huggingface

Key Features

Multi-Task Learning: Combines caption generation with music feature detection (key detection, vocals detection, etc.)
Projection-Based Architecture: Transforms audio input into language tokens while maintaining feature detection capabilities
Enhanced Captioning: Produces rich, descriptive captions that incorporate detected music features
Long-Form Description: Enables detailed time-informed descriptions for longer music pieces through LLM chaining

Installation

git clone https://github.com/AMAAI-Lab/SonicVerse.git
cd SonicVerse
pip install -r requirements.txt
pip install -e .

Quick App

python scripts/app.py

How to Train Locally

# Run the training
deepspeed scripts/train_model.py \
  --model_name_or_path mistralai/Mistral-7B-Instruct-v0.1 \
  --model_cls MistralLMMForCausalLM \
  --modality_builder audio_descript \
  --train_dataset_path {path to dataset train split} \
  --evaluation_dataset_path {path to dataset val split} \
  --output_dir {path to output checkpoints directory} \
  --pretrain_projectors \
  --lora_enable True \
  --bf16 True \
  --tf32 True \
  --num_train_epochs 3 \
  --gradient_checkpointing True \
  --per_device_train_batch_size 1 \
  --per_device_eval_batch_size 1 \
  --gradient_accumulation_steps 32 \
  --model_max_length 2048 \
  --evaluation_strategy "steps" \
  --eval_steps 1300 \
  --save_strategy "steps" \
  --save_steps 450 \
  --save_total_limit 3 \
  --learning_rate 1e-4 \
  --weight_decay 0. \
  --warmup_ratio 0.03 \
  --lr_scheduler_type "cosine" \
  --dataloader_num_workers 2 \
  --logging_steps 1 \
  --use_multi_task 2 \
  --tasks_config configs/tasks.json \
  --report_to none \
  --deepspeed ./configs/zero2.json

Switch configs to configs/tasks_ft.json, for finetuning only the projector

Citation

If you use SonicVerse in your work, please cite our paper:

SonicVerse: Multi-Task Learning for Music Feature-Informed Captioning Anuradha Chopra, Abhinaba Roy, Dorien Herremans Accepted to AIMC 2025

@article{chopra2025sonicverse,
  title={SonicVerse: Multi-Task Learning for Music Feature-Informed Captioning},
  author={Chopra, Anuradha and Roy, Abhinaba and Herremans, Dorien},
  journal={Proceedings of the 6th Conference on AI Music Creativity (AIMC 2025)},
  year={2025},
  address={Brussels, Belgium},
  month={September},
  url={https://arxiv.org/abs/2506.15154},
}

Read the paper here: arXiv:2506.15154 DOI: 10.48550/arXiv.2506.15154

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.github/workflows		.github/workflows
configs		configs
docs		docs
media		media
scripts		scripts
sonicverse		sonicverse
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SonicVerse: Multi-Task Learning for Music Feature-Informed Captioning

Overview

Key Features

Installation

Quick App

How to Train Locally

Citation

Built With

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

License

AMAAI-Lab/SonicVerse

Folders and files

Latest commit

History

Repository files navigation

SonicVerse: Multi-Task Learning for Music Feature-Informed Captioning

Overview

Key Features

Installation

Quick App

How to Train Locally

Citation

Built With

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages