Merge pull request #61 from deezer/ismir25

mmoussallam · web-flow · commit 36bc2d11f96f · 2025-08-27T18:09:57.000+02:00
Ismir25 + Waspaa + ACL
diff --git a/_posts/2025-06-27-acl-mfrohmann b/_posts/2025-06-27-acl-mfrohmann
@@ -0,0 +1,23 @@
+---
+layout: post
+title: "Double Entendre: Robust Audio-Based AI-Generated Lyrics Detection via Multi-View Fusion"
+date: 2025-06-27 18:00:00 +0200
+category: Publication
+author: eepure
+readtime: 1
+domains: 
+ - MIR
+people:
+ - eepure
+ - gmeseguerbrocal
+ - rhennequin
+publication_type: conference
+publication_title: "Double Entendre: Robust Audio-Based AI-Generated Lyrics Detection via Multi-View Fusion"
+publication_year: 2025
+publication_authors: Markus Frohmann, Elena V Epure, Gabriel Meseguer-Brocal, Markus Schedl, Romain Hennequin
+publication_conference: ACL
+publication_code: "https://github.com/deezer/robust-AI-lyrics-detection"
+publication_preprint: "https://arxiv.org/abs/2506.15981"
+---
+
+The rapid advancement of AI-based music generation tools is revolutionizing the music industry but also posing challenges to artists, copyright holders, and providers alike. This necessitates reliable methods for detecting such AI-generated content. However, existing detectors, relying on either audio or lyrics, face key practical limitations: audio-based detectors fail to generalize to new or unseen generators and are vulnerable to audio perturbations; lyrics-based methods require cleanly formatted and accurate lyrics, unavailable in practice. To overcome these limitations, we propose a novel, practically grounded approach: a multimodal, modular late-fusion pipeline that combines automatically transcribed sung lyrics and speech features capturing lyrics-related information within the audio. By relying on lyrical aspects directly from audio, our method enhances robustness, mitigates susceptibility to low-level artifacts, and enables practical applicability. Experiments show that our method, DE-detect, outperforms existing lyrics-based detectors while also being more robust to audio perturbations. Thus, it offers an effective, robust solution for detecting AI-generated music in real-world scenarios.
diff --git a/_posts/2025-09-25-ismir-dafchar b/_posts/2025-09-25-ismir-dafchar
@@ -0,0 +1,24 @@
+---
+layout: post
+title: "A Fourier Explanation of AI-music Artifacts"
+date: 2025-09-25 18:00:00 +0200
+category: Publication
+author: dafchar
+readtime: 1
+domains: 
+ - MIR
+people:
+ - dafchar
+ - gmeseguerbrocal
+ - kakesbi
+ - rhennequin
+publication_type: conference
+publication_title: "A Fourier Explanation of AI-music Artifacts"
+publication_year: 2025
+publication_authors: Darius Afchar, Gabriel Meseguer-Brocal, Kamil Akesbi, Romain Hennequin
+publication_conference: ISMIR
+publication_code: "https://github.com/deezer/ismir25-ai-music-detector"
+publication_preprint: "https://arxiv.org/abs/2506.19108"
+---
+
+The rapid rise of generative AI has transformed music creation, with millions of users engaging in AI-generated music. Despite its popularity, concerns regarding copyright infringement, job displacement, and ethical implications have led to growing scrutiny and legal challenges. In parallel, AI-detection services have emerged, yet these systems remain largely opaque and privately controlled, mirroring the very issues they aim to address. This paper explores the fundamental properties of synthetic content and how it can be detected. Specifically, we analyze deconvolution modules commonly used in generative models and mathematically prove that their outputs exhibit systematic frequency artifacts -- manifesting as small yet distinctive spectral peaks. This phenomenon, related to the well-known checkerboard artifact, is shown to be inherent to a chosen model architecture rather than a consequence of training data or model weights. We validate our theoretical findings through extensive experiments on open-source models, as well as commercial AI-music generators such as Suno and Udio. We use these insights to propose a simple and interpretable detection criterion for AI-generated music. Despite its simplicity, our method achieves detection accuracy on par with deep learning-based approaches, surpassing 99% accuracy on several scenarios.
diff --git a/_posts/2025-09-25-ismir-mfrohmann b/_posts/2025-09-25-ismir-mfrohmann
@@ -0,0 +1,23 @@
+---
+layout: post
+title: "AI-Generated Song Detection via Lyrics Transcripts"
+date: 2025-09-25 18:00:00 +0200
+category: Publication
+author: eepure
+readtime: 1
+domains: 
+ - MIR
+people:
+ - eepure
+ - gmeseguerbrocal
+ - rhennequin
+publication_type: conference
+publication_title: "AI-Generated Song Detection via Lyrics Transcripts"
+publication_year: 2025
+publication_authors: Markus Frohmann, Elena V Epure, Gabriel Meseguer-Brocal, Markus Schedl, Romain Hennequin
+publication_conference: ISMIR
+publication_code: "https://github.com/deezer/robust-AI-lyrics-detection"
+publication_preprint: "https://arxiv.org/abs/2506.18488"
+---
+
+The recent rise in capabilities of AI-based music generation tools has created an upheaval in the music industry, necessitating the creation of accurate methods to detect such AI-generated content. This can be done using audio-based detectors; however, it has been shown that they struggle to generalize to unseen generators or when the audio is perturbed. Furthermore, recent work used accurate and cleanly formatted lyrics sourced from a lyrics provider database to detect AI-generated music. However, in practice, such perfect lyrics are not available (only the audio is); this leaves a substantial gap in applicability in real-life use cases. In this work, we instead propose solving this gap by transcribing songs using general automatic speech recognition (ASR) models. We do this using several detectors. The results on diverse, multi-genre, and multi-lingual lyrics show generally strong detection performance across languages and genres, particularly for our best-performing model using Whisper large-v2 and LLM2Vec embeddings. In addition, we show that our method is more robust than state-of-the-art audio-based ones when the audio is perturbed in different ways and when evaluated on different music generators.
diff --git a/_posts/2025-09-25-ismir-ykong b/_posts/2025-09-25-ismir-ykong
@@ -0,0 +1,23 @@
+---
+layout: post
+title: "Emergent musical properties of a transformer under contrastive self-supervised learning"
+date: 2025-09-25 18:00:00 +0200
+category: Publication
+author: ykong
+readtime: 1
+domains: 
+ - MIR
+people:
+ - ykong
+ - gmeseguerbrocal
+ - rhennequin
+publication_type: conference
+publication_title: "Emergent musical properties of a transformer under contrastive self-supervised learning"
+publication_year: 2025
+publication_authors: Yuexuan Kong, Gabriel Meseguer-Brocal, Vincent Lostanlen, Mathieu Lagrange, Romain Hennequin
+publication_conference: ISMIR
+publication_code: "https://github.com/deezer/emergentmusical-properties-transformer"
+publication_preprint: "https://arxiv.org/abs/2506.23873"
+---
+
+In music information retrieval (MIR), contrastive self-supervised learning for general-purpose representation models is effective for global tasks such as automatic tagging. However, for local tasks such as chord estimation, it is widely assumed that contrastively trained general-purpose self-supervised models are inadequate and that more sophisticated SSL is necessary; e.g., masked modeling. Our paper challenges this assumption by revealing the potential of contrastive SSL paired with a transformer in local MIR tasks. We consider a lightweight vision transformer with one-dimensional patches in the time--frequency domain (ViT-1D) and train it with simple contrastive SSL through normalized temperature-scaled cross-entropy loss (NT-Xent). Although NT-Xent operates only over the class token, we observe that, potentially thanks to weight sharing, informative musical properties emerge in ViT-1D's sequence tokens. On global tasks, the temporal average of class and sequence tokens offers a performance increase compared to the class token alone, showing useful properties in the sequence tokens. On local tasks, sequence tokens perform unexpectedly well, despite not being specifically trained for. Furthermore, high-level musical features such as onsets emerge from layer-wise attention maps and self-similarity matrices show different layers capture different musical dimensions. Our paper does not focus on improving performance but advances the musical interpretation of transformers and sheds light on some overlooked abilities of contrastive SSL paired with transformers for sequence modeling in MIR.
diff --git a/_posts/2025-10-12-waspaa-ykong b/_posts/2025-10-12-waspaa-ykong
@@ -0,0 +1,23 @@
+---
+layout: post
+title: "Multi-Class-Token Transformer for Multitask Self-supervised Music Information Retrieval"
+date: 2025-10-12 18:00:00 +0200
+category: Publication
+author: ykong
+readtime: 1
+domains: 
+ - MIR
+people:
+ - ykong
+ - rhennequin
+ - gmeseguerbrocal
+publication_type: conference
+publication_title: "Multi-Class-Token Transformer for Multitask Self-supervised Music Information Retrieval"
+publication_year: 2025
+publication_authors: Yuexuan Kong, Vincent Lostanlen, Romain Hennequin, Mathieu Lagrange, Gabriel Meseguer-Brocal
+publication_conference: WASPAA
+publication_code: "https://github.com/deezer/mt2"
+publication_preprint: "https://arxiv.org/abs/2507.12996"
+---
+
+Contrastive learning and equivariant learning are effective methods for self-supervised learning (SSL) for audio content analysis. Yet, their application to music information retrieval (MIR) faces a dilemma: the former is more effective on tagging (e.g., instrument recognition) but less effective on structured prediction (e.g., tonality estimation); The latter can match supervised methods on the specific task it is designed for, but it does not generalize well to other tasks. In this article, we adopt a best-of-both-worlds approach by training a deep neural network on both kinds of pretext tasks at once. The proposed new architecture is a Vision Transformer with 1-D spectrogram patches (ViT-1D), equipped with two class tokens, which are specialized to different self-supervised pretext tasks but optimized through the same model: hence the qualification of self-supervised multi-class-token multitask (MT2). The former class token optimizes cross-power spectral density (CPSD) for equivariant learning over the circle of fifths, while the latter optimizes normalized temperature-scaled cross-entropy (NT-Xent) for contrastive learning. MT2 combines the strengths of both pretext tasks and outperforms consistently both single-class-token ViT-1D models trained with either contrastive or equivariant learning. Averaging the two class tokens further improves performance on several tasks, highlighting the complementary nature of the representations learned by each class token. Furthermore, using the same single-linear-layer probing method on the features of last layer, MT2 outperforms MERT on all tasks except for beat tracking; achieving this with 18x fewer parameters thanks to its multitasking capabilities. Our SSL benchmark demonstrates the versatility of our multi-class-token multitask learning approach for MIR applications.
diff --git a/static/images/photos/gabriel.jpg b/static/images/photos/gabriel.jpg
diff --git a/static/images/photos/yuexuan.jpg b/static/images/photos/yuexuan.jpg