Skip to content

Commit 36bc2d1

Browse files
authored
Merge pull request #61 from deezer/ismir25
Ismir25 + Waspaa + ACL
2 parents a7afbaf + 557b1c8 commit 36bc2d1

File tree

7 files changed

+116
-0
lines changed

7 files changed

+116
-0
lines changed

_posts/2025-06-27-acl-mfrohmann

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
---
2+
layout: post
3+
title: "Double Entendre: Robust Audio-Based AI-Generated Lyrics Detection via Multi-View Fusion"
4+
date: 2025-06-27 18:00:00 +0200
5+
category: Publication
6+
author: eepure
7+
readtime: 1
8+
domains:
9+
- MIR
10+
people:
11+
- eepure
12+
- gmeseguerbrocal
13+
- rhennequin
14+
publication_type: conference
15+
publication_title: "Double Entendre: Robust Audio-Based AI-Generated Lyrics Detection via Multi-View Fusion"
16+
publication_year: 2025
17+
publication_authors: Markus Frohmann, Elena V Epure, Gabriel Meseguer-Brocal, Markus Schedl, Romain Hennequin
18+
publication_conference: ACL
19+
publication_code: "https://github.com/deezer/robust-AI-lyrics-detection"
20+
publication_preprint: "https://arxiv.org/abs/2506.15981"
21+
---
22+
23+
The rapid advancement of AI-based music generation tools is revolutionizing the music industry but also posing challenges to artists, copyright holders, and providers alike. This necessitates reliable methods for detecting such AI-generated content. However, existing detectors, relying on either audio or lyrics, face key practical limitations: audio-based detectors fail to generalize to new or unseen generators and are vulnerable to audio perturbations; lyrics-based methods require cleanly formatted and accurate lyrics, unavailable in practice. To overcome these limitations, we propose a novel, practically grounded approach: a multimodal, modular late-fusion pipeline that combines automatically transcribed sung lyrics and speech features capturing lyrics-related information within the audio. By relying on lyrical aspects directly from audio, our method enhances robustness, mitigates susceptibility to low-level artifacts, and enables practical applicability. Experiments show that our method, DE-detect, outperforms existing lyrics-based detectors while also being more robust to audio perturbations. Thus, it offers an effective, robust solution for detecting AI-generated music in real-world scenarios.

_posts/2025-09-25-ismir-dafchar

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
---
2+
layout: post
3+
title: "A Fourier Explanation of AI-music Artifacts"
4+
date: 2025-09-25 18:00:00 +0200
5+
category: Publication
6+
author: dafchar
7+
readtime: 1
8+
domains:
9+
- MIR
10+
people:
11+
- dafchar
12+
- gmeseguerbrocal
13+
- kakesbi
14+
- rhennequin
15+
publication_type: conference
16+
publication_title: "A Fourier Explanation of AI-music Artifacts"
17+
publication_year: 2025
18+
publication_authors: Darius Afchar, Gabriel Meseguer-Brocal, Kamil Akesbi, Romain Hennequin
19+
publication_conference: ISMIR
20+
publication_code: "https://github.com/deezer/ismir25-ai-music-detector"
21+
publication_preprint: "https://arxiv.org/abs/2506.19108"
22+
---
23+
24+
The rapid rise of generative AI has transformed music creation, with millions of users engaging in AI-generated music. Despite its popularity, concerns regarding copyright infringement, job displacement, and ethical implications have led to growing scrutiny and legal challenges. In parallel, AI-detection services have emerged, yet these systems remain largely opaque and privately controlled, mirroring the very issues they aim to address. This paper explores the fundamental properties of synthetic content and how it can be detected. Specifically, we analyze deconvolution modules commonly used in generative models and mathematically prove that their outputs exhibit systematic frequency artifacts -- manifesting as small yet distinctive spectral peaks. This phenomenon, related to the well-known checkerboard artifact, is shown to be inherent to a chosen model architecture rather than a consequence of training data or model weights. We validate our theoretical findings through extensive experiments on open-source models, as well as commercial AI-music generators such as Suno and Udio. We use these insights to propose a simple and interpretable detection criterion for AI-generated music. Despite its simplicity, our method achieves detection accuracy on par with deep learning-based approaches, surpassing 99% accuracy on several scenarios.

_posts/2025-09-25-ismir-mfrohmann

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
---
2+
layout: post
3+
title: "AI-Generated Song Detection via Lyrics Transcripts"
4+
date: 2025-09-25 18:00:00 +0200
5+
category: Publication
6+
author: eepure
7+
readtime: 1
8+
domains:
9+
- MIR
10+
people:
11+
- eepure
12+
- gmeseguerbrocal
13+
- rhennequin
14+
publication_type: conference
15+
publication_title: "AI-Generated Song Detection via Lyrics Transcripts"
16+
publication_year: 2025
17+
publication_authors: Markus Frohmann, Elena V Epure, Gabriel Meseguer-Brocal, Markus Schedl, Romain Hennequin
18+
publication_conference: ISMIR
19+
publication_code: "https://github.com/deezer/robust-AI-lyrics-detection"
20+
publication_preprint: "https://arxiv.org/abs/2506.18488"
21+
---
22+
23+
The recent rise in capabilities of AI-based music generation tools has created an upheaval in the music industry, necessitating the creation of accurate methods to detect such AI-generated content. This can be done using audio-based detectors; however, it has been shown that they struggle to generalize to unseen generators or when the audio is perturbed. Furthermore, recent work used accurate and cleanly formatted lyrics sourced from a lyrics provider database to detect AI-generated music. However, in practice, such perfect lyrics are not available (only the audio is); this leaves a substantial gap in applicability in real-life use cases. In this work, we instead propose solving this gap by transcribing songs using general automatic speech recognition (ASR) models. We do this using several detectors. The results on diverse, multi-genre, and multi-lingual lyrics show generally strong detection performance across languages and genres, particularly for our best-performing model using Whisper large-v2 and LLM2Vec embeddings. In addition, we show that our method is more robust than state-of-the-art audio-based ones when the audio is perturbed in different ways and when evaluated on different music generators.

_posts/2025-09-25-ismir-ykong

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
---
2+
layout: post
3+
title: "Emergent musical properties of a transformer under contrastive self-supervised learning"
4+
date: 2025-09-25 18:00:00 +0200
5+
category: Publication
6+
author: ykong
7+
readtime: 1
8+
domains:
9+
- MIR
10+
people:
11+
- ykong
12+
- gmeseguerbrocal
13+
- rhennequin
14+
publication_type: conference
15+
publication_title: "Emergent musical properties of a transformer under contrastive self-supervised learning"
16+
publication_year: 2025
17+
publication_authors: Yuexuan Kong, Gabriel Meseguer-Brocal, Vincent Lostanlen, Mathieu Lagrange, Romain Hennequin
18+
publication_conference: ISMIR
19+
publication_code: "https://github.com/deezer/emergentmusical-properties-transformer"
20+
publication_preprint: "https://arxiv.org/abs/2506.23873"
21+
---
22+
23+
In music information retrieval (MIR), contrastive self-supervised learning for general-purpose representation models is effective for global tasks such as automatic tagging. However, for local tasks such as chord estimation, it is widely assumed that contrastively trained general-purpose self-supervised models are inadequate and that more sophisticated SSL is necessary; e.g., masked modeling. Our paper challenges this assumption by revealing the potential of contrastive SSL paired with a transformer in local MIR tasks. We consider a lightweight vision transformer with one-dimensional patches in the time--frequency domain (ViT-1D) and train it with simple contrastive SSL through normalized temperature-scaled cross-entropy loss (NT-Xent). Although NT-Xent operates only over the class token, we observe that, potentially thanks to weight sharing, informative musical properties emerge in ViT-1D's sequence tokens. On global tasks, the temporal average of class and sequence tokens offers a performance increase compared to the class token alone, showing useful properties in the sequence tokens. On local tasks, sequence tokens perform unexpectedly well, despite not being specifically trained for. Furthermore, high-level musical features such as onsets emerge from layer-wise attention maps and self-similarity matrices show different layers capture different musical dimensions. Our paper does not focus on improving performance but advances the musical interpretation of transformers and sheds light on some overlooked abilities of contrastive SSL paired with transformers for sequence modeling in MIR.

_posts/2025-10-12-waspaa-ykong

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
---
2+
layout: post
3+
title: "Multi-Class-Token Transformer for Multitask Self-supervised Music Information Retrieval"
4+
date: 2025-10-12 18:00:00 +0200
5+
category: Publication
6+
author: ykong
7+
readtime: 1
8+
domains:
9+
- MIR
10+
people:
11+
- ykong
12+
- rhennequin
13+
- gmeseguerbrocal
14+
publication_type: conference
15+
publication_title: "Multi-Class-Token Transformer for Multitask Self-supervised Music Information Retrieval"
16+
publication_year: 2025
17+
publication_authors: Yuexuan Kong, Vincent Lostanlen, Romain Hennequin, Mathieu Lagrange, Gabriel Meseguer-Brocal
18+
publication_conference: WASPAA
19+
publication_code: "https://github.com/deezer/mt2"
20+
publication_preprint: "https://arxiv.org/abs/2507.12996"
21+
---
22+
23+
Contrastive learning and equivariant learning are effective methods for self-supervised learning (SSL) for audio content analysis. Yet, their application to music information retrieval (MIR) faces a dilemma: the former is more effective on tagging (e.g., instrument recognition) but less effective on structured prediction (e.g., tonality estimation); The latter can match supervised methods on the specific task it is designed for, but it does not generalize well to other tasks. In this article, we adopt a best-of-both-worlds approach by training a deep neural network on both kinds of pretext tasks at once. The proposed new architecture is a Vision Transformer with 1-D spectrogram patches (ViT-1D), equipped with two class tokens, which are specialized to different self-supervised pretext tasks but optimized through the same model: hence the qualification of self-supervised multi-class-token multitask (MT2). The former class token optimizes cross-power spectral density (CPSD) for equivariant learning over the circle of fifths, while the latter optimizes normalized temperature-scaled cross-entropy (NT-Xent) for contrastive learning. MT2 combines the strengths of both pretext tasks and outperforms consistently both single-class-token ViT-1D models trained with either contrastive or equivariant learning. Averaging the two class tokens further improves performance on several tasks, highlighting the complementary nature of the representations learned by each class token. Furthermore, using the same single-linear-layer probing method on the features of last layer, MT2 outperforms MERT on all tasks except for beat tracking; achieving this with 18x fewer parameters thanks to its multitasking capabilities. Our SSL benchmark demonstrates the versatility of our multi-class-token multitask learning approach for MIR applications.

static/images/photos/gabriel.jpg

191 KB
Loading

static/images/photos/yuexuan.jpg

-67.1 KB
Loading

0 commit comments

Comments
 (0)