Skip to content

01Zhangbw/Speech-and-audio-papers-Top-Conference

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

67 Commits
 
 
 
 

Repository files navigation

Speech and audio papers@Top Conference (Update Regularly)

Welcome to star⭐ Discuss in Issues or collaborate via PRs~👏 Feel free to contact📧 me via zhangbw0102@gmail.com.

🎉 [01/23/2025] UPDATE ICLR 2025 conference papers successfully!

🎉 [01/23/2025] UPDATE ICLR 2024 conference papers successfully!

🎉 [01/29/2025] UPDATE ICML 2024 conference papers successfully!

🎉 [01/29/2025] UPDATE NeurIPS 2024 conference papers successfully!

🎉 [01/30/2025] UPDATE ICML 2023 conference papers successfully!

🎉 [01/30/2025] UPDATE NeurIPS 2023 conference papers successfully!

🎉 [01/30/2025] UPDATE ACMMM 2024 conference papers successfully!

🎉 [01/30/2025] UPDATE ICLR 2023 conference papers successfully!

🎉 [01/30/2025] UPDATE AAAI 2024 conference papers successfully!

🎉 [01/31/2025] UPDATE ACL 2024 conference papers successfully!

🎉 [01/31/2025] UPDATE EMNLP 2024 conference papers successfully!

🎉 [03/24/2025] UPDATE NAACL 2025 conference papers successfully!

🎉 [04/22/2025] UPDATE AAAI 2025 conference papers successfully!

🎉 [04/22/2025] UPDATE IJCAI 2024 conference papers successfully!

🎉 [05/16/2025] UPDATE ICML 2025 conference papers successfully!

🎉 [01/24/2026] UPDATE AAAI 2026 conference papers successfully!

Speech and audio papers@Top Conference

ICLR'25

ICLR'25 total submission: 11672; accepted: 3706 (31.75%)

Speech

It includes the papers on speech (rate is good or middle, often more than 5), not limited to accepted or not.

Total speech papers@ICLR25 number is 100+; We select 49 papers.

re denotes rejected. con denotes conditionalonethicsreview. The numbers like 5668 denotes the detailed rate is 5,6,6,8.

Paper Status Average rate
TANGO: Co-Speech Gesture Video Reenactment with Hierarchical Audio Motion Embedding and Diffusion Interpolation con 8.50
Co$^{\mathbf{3}}$Gesture: Towards Coherent Concurrent Co-speech 3D Gesture Generation with Interactive Diffusion 7.50
Scaling Transformers for Low-Bitrate High-Quality Speech Coding 7.00
Context-aware Dynamic Pruning for Speech Foundation Models 7.00
Scaling Speech-Text Pre-training with Synthetic Interleaved Data con 7.00
CR-CTC: Consistency regularization on CTC for improved speech recognition 6.75
Sylber: Syllabic Embedding Representation of Speech from Raw Audio 6.75
Three-in-One: Fast and Accurate Transducer for Hybrid-Autoregressive Speech Recognition 6.75
Multi-Task Corrupted Prediction for Learning Robust Audio-Visual Speech Representation 6.75
Joint Fine-tuning and Conversion of Pretrained Speech and Language Models towards Linear Complexity 6.75
Audio Large Language Models Can Be Descriptive Speech Quality Evaluators 6.75
Continuous Autoregressive Modeling with Stochastic Monotonic Alignment for Speech Synthesis 6.67
EcoFace: Audio-Visual Emotional Co-Disentanglement Speech-Driven 3D Talking Face Generation 6.50
LLaMA-Omni: Seamless Speech Interaction with Large Language Models 6.50
Objective Soups: Multilingual Multi-Task Acoustic Modeling for Automatic Speech Recognition not accepted but rate is good 6.50
SyllableLM: Learning Coarse Semantic Units for Speech Language Models 6.50
Improving Semantic Understanding in Speech Language Models via Brain-tuning 6.50
SonicSim: A customizable simulation platform for speech processing in moving sound source scenarios 6.50
Bridging the Data Provenance Gap Across Text, Speech, and Video 6.50
HALL-E: Hierarchical Neural Codec Language Model for Minute-Long Zero-Shot Text-to-Speech Synthesis 6.40
DiTTo-TTS: Diffusion Transformers for Scalable Text-to-Speech without Domain-Specific Factors 6.25
T2V2: A Unified Non-Autoregressive Model for Speech Recognition and Synthesis via Multitask Learning 6.25
VLAS: Vision-Language-Action Model with Speech Instructions for Customized Robot Manipulation 6.25
GenSE: Generative Speech Enhancement via Language Models using Hierarchical Modeling 6.00
UniWav: Towards Unified Pre-training for Speech Representation Learning and Generation 6.00
FIRING-Net: A filtered feature recycling network for speech enhancement 6.00
TIGER: Time-frequency Interleaved Gain Extraction and Reconstruction for Efficient Speech Separation 5.83
NIRANTAR: Continual Learning with New Languages and Domains on Real-world Speech Data 55568, rejected 5.80
Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis 5666, rejected 5.75
VChangeCodec: A High-efficiency Neural Speech Codec with Built-in Voice Changer for Real-time Communication 5666, rejected 5.75
Speech Robust Bench: A Robustness Benchmark For Speech Recognition 5666,accepted 5.75
OTTC: A differentiable alignment approach to automatic speech recognition 368, rejected 5.68
SpeechFake: A Large-Scale Multilingual Speech Deepfake Dataset Toward Cutting-Edge Speech Generation Methods 566, rejected 5.67
Realistic-Gesture: Co-Speech Gesture Video Generation through Semantic-aware Gesture Representation 35668, rejected 5.60
A$^2$-Flow: Alignment-Aware Pre-training for Speech Synthesis with Flow Matching 3568, rejected 5.50
Representing speech through autoregressive prediction of cochlear tokens 5566, rejected 5.50
F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching 3568, rejected, but have big influence! 5.50
ASROB: Measuring Automatic Speech Recognition from One Book 3568, rejected, 5.50
SSR: Alignment-Aware Modality Connector for Speech Language Models 3568, rejected, 5.50
A Variational Approach for Generative Speech Language Modeling 3568, re 5.50
SPARQ: Outlier-free SpeechLM with Fast Adaptation and Robust Quantization 5566,re 5.50
Vevo: Controllable Zero-Shot Voice Imitation with Self-Supervised Disentanglement 3568, accepted 5.50
Enhancing Zero-shot Text-to-Speech Synthesis with Human Feedback 3568,re 5.50
Time-Accurate Speech Rich Transcription with Non-Fluencies 5566 withdraw 5.50
dMel: Speech Tokenization Made Simple 35568 re 5.40
Orator: LLM-Guided Multi-Shot Speech Video Generation 35568 re 5.40
MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer 3666, accepted, have big influence! 5.25
Strategic Filtering for Content Moderation: Free Speech or Free of Distortion? 5556, re 5.25
ControlSpeech: Towards Simultaneous Zero-shot Speaker Cloning and Zero-shot Language Style Control 35558, withdraw 5.20
VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers 3368, re 5.00

Audio

It includes the papers on speech (rate is good or middle, often more than 5), not limited to accepted or not.

Total speech papers@ICLR25 number is 70+; We select 36 papers.

Paper status average rate
Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency con 8.00
CyberHost: A One-stage Diffusion Framework for Audio-driven Talking Body Generation 7.60
$\texttt{BirdSet}$: A Large-Scale Dataset for Audio Classification in Avian Bioacoustics 7.50
ADIFF: Explaining audio difference using natural language 7.50
Both Ears Wide Open: Towards Language-Driven Spatial Audio Generation 7.50
Presto! Distilling Steps and Layers for Accelerating Music Generation spotlight,music 7.25
FlowDec: A flow-based full-band general audio codec with high perceptual quality 7.00
I Can Hear You: Selective Robust Training for Deepfake Audio Detection con 7.00
SSLAM: Enhancing Self-Supervised Models with Audio Mixtures for Polyphonic Soundscapes 7.00
RFWave: Multi-band Rectified Flow for Audio Waveform Reconstruction 6.80
Enhancing Deception Detection with Cognitive Load Features: An Audio-Visual Approach 6.75
Sylber: Syllabic Embedding Representation of Speech from Raw Audio 6.75
Synthio: Augmenting Small-Scale Audio Classification Datasets with Synthetic Data 6.75
Multi-Task Corrupted Prediction for Learning Robust Audio-Visual Speech Representation 6.75
Audio Large Language Models Can Be Descriptive Speech Quality Evaluators 6.75
Fugatto 1: Foundational Generative Audio Transformer Opus 1 6.75
WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling 35810 6.50
EcoFace: Audio-Visual Emotional Co-Disentanglement Speech-Driven 3D Talking Face Generation 6.50
MuPT: A Generative Symbolic Music Pretrained Transformer music 6.50
ViSAGe: Video-to-Spatial Audio Generation 6.40
Aligned Better, Listen Better For Audio-Visual Large Language Models 6.25
Contrastive Learning from Synthetic Audio Doppelgängers 6.25
AVHBench: A Cross-Modal Hallucination Benchmark for Audio-Visual Large Language Models 6.20
Elucidating the Design Space of Text-to-Audio Models 5568, re 6.00
Hallo2: Long-Duration and High-Resolution Audio-Driven Portrait Image Animation 6.00
Discriminator-Guided Cooperative Diffusion for Joint Audio and Video Generation 6.00
Rethinking Audio-Visual Adversarial Vulnerability from Temporal and Modality Perspectives 6.00
Talking Turns: Benchmarking Audio Foundation Models on Turn-Taking Dynamics 5.80
Active Audio Cancellation with Multi-Band Mamba Network 3668, re 5.75
The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio 5666, re 5.75
AdvWave: Stealthy Adversarial Jailbreak Attack against Large Audio-Language Models 3388, accepted 5.50
NatureLM-audio: an Audio-Language Foundation Model for Bioacoustics 3,5,8, accepted 5.33
Taming Data and Transformers for Audio Generation 3666, re 5.25
AVESFormer: Efficient Transformer Design for Real-Time Audio-Visual Segmentation 5556, re 5.25
Segment, Associate, and Classify: Decoupled Audio-Visual Segmentation Framework 5556 withdraw 5.25
Reverse the auditory processing pathway: Coarse-to-fine audio reconstruction from fMRI 3558, re 5.25
Collaborative Hybrid Propagator for Temporal Misalignment in Audio-Visual Segmentation 35558, withdraw 5.20
T2A-Feedback: Improving Basic Capabilities of Text-to-Audio Generation via Fine-grained AI Feedback 3566, withdraw 5.00

Summary

The accepted(or not) status depends on rate mainly. The rate of speech/audio track is not high, which is much less than the tracks like CV, NLP, etc. The rebuttals are very important!!!

ICLR'24

Speech

It includes the papers on speech (rate is good or middle, often more than 5), not limited to accepted or not.

Total speech papers@ICLR24 number is 50+; We select 20+ papers.

Paper status average rate
NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers Spot 8.00
Large Language Models are Efficient Learners of Noise-Robust Speech Recognition Spot 8.00
Multi-resolution HuBERT: Multi-resolution Speech Self-Supervised Learning with Masked Unit Prediction Spot 8.00
Zipformer: A faster and better encoder for automatic speech recognition Oral 7.50
RTFS-Net: Recurrent Time-Frequency Modelling for Efficient Audio-Visual Speech Separation 7.50
Self-Supervised Speech Quality Estimation and Enhancement Using Only Clean Speech 7.00
Spoken Question Answering and Speech Continuation Using Spectrogram-Powered LLM 6.75
SALMONN: Towards Generic Hearing Abilities for Large Language Models 6.67
It's Never Too Late: Fusing Acoustic Information into Large Language Models for Automatic Speech Recognition 6.60
Mega-TTS 2: Boosting Prompting Mechanisms for Zero-Shot Speech Synthesis 6.50
CLaM-TTS: Improving Neural Codec Language Model for Zero-Shot Text-to-Speech 6.40
BLSP: Bootstrapping Language-Speech Pre-training via Behavior Alignment of Continuation Writing 5668, re, link: https://arxiv.org/pdf/2309.00916 6.25
TransFace: Unit-Based Audio-Visual Speech Synthesizer for Talking Head Translation 5668, desk re, accepted by ACL2024, https://aclanthology.org/2024.findings-acl.593.pdf 6.25
Multilingual Visual Speech Recognition with a Single Model using Visual Speech Unit 56668, re, link: https://arxiv.org/pdf/2401.09802v1 6.20
PromptTTS 2: Describing and Generating Voices with Text Prompt 6.00
Separate and Diffuse: Using a Pretrained Diffusion Model for Better Source Separation 6.00
PolyVoice: Language Models for Speech to Speech Translation 3588, accepted 6.00
DiffPoseTalk: Speech-Driven Stylistic 3D Facial Animation and Head Pose Generation via Diffusion Models 5568, re, accepted by SIGGRAPH 2024 (Journal Track), https://arxiv.org/pdf/2310.00434 6.00
LipVoicer: Generating Speech from Silent Videos Guided by Lip Reading 5568,accepted 6.00
Generative Pre-training for Speech with Flow Matching 3668,accepted 5.75
DiffAR: Denoising Diffusion Autoregressive Model for Raw Speech Waveform Generation 5666,accepted 5.75
SpeechTokenizer: Unified Speech Tokenizer for Speech Language Models 3668,accepted 5.75
SummaryMixing: A Linear-Complexity Alternative to Self-Attention for Speech Recognition and Understanding 3568, rem accepted by Interspeech24, https://arxiv.org/pdf/2307.07421 5.75
RepCodec: A Speech Representation Codec for Speech Tokenization 5566, re, accepted by ACL-main2024, https://arxiv.org/pdf/2309.00169 5.50
A Discrete and Variational Approach to Speech Representation Learning 33588, withdraw 5.40
Generative Pre-Trained Speech Language Model with Efficient Hierarchical Transformer 5556, re, accepted by ACL2024, https://arxiv.org/pdf/2406.00976 5.25

Audio

It includes the papers on speech (rate is good or middle, often more than 5), not limited to accepted or not.

Total speech papers@ICLR24 number is 20+; We select 17 papers.

Paper status average rate
Masked Audio Generation using a Single Non-Autoregressive Transformer 7.33
Listen, Think, and Understand 7.00
Revisiting Deep Audio-Text Retrieval Through the Lens of Transportation 6.67
Weakly-supervised Audio Separation via Bi-modal Semantic Similarity 6.67
CompA: Addressing the Gap in Compositional Reasoning in Audio-Language Models 6.50
Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis 6.00
Lifelong Audio-video Masked Autoencoder with Forget-robust Localized Alignments 55558, re 5.60
LAURAGPT: LISTEN, ATTEND, UNDERSTAND, AND REGENERATE AUDIO WITH GPT 5566, re 5.50
SoundStorm: Efficient Parallel Audio Generation 35568, re 5.40
Bootstrapping Audio-Visual Segmentation by Strengthening Audio Cues 3666, re 5.25
FINE-GRAINED AUDIO-VISUAL JOINT REPRESENTATIONS FOR MULTIMODAL LARGE LANGUAGE MODELS 3666, re 5.25
UniAudio: An Audio Foundation Model Toward Universal Audio Generation 15510, re, accept by icml24 5.25
Masked Autoencoders with Multi-Window Local-Global Attention Are Better Audio Learners 3666, re 5.25
SMILE: Audio-Visual Speech Recognition with Siamese Masked Interaction Learning 5555, re 5.00
Leveraging characteristics of the output distribution for identifying adversarial audio examples 5555, re 5.00
Rethinking Audiovisual Segmentation with Semantic Quantization and Decomposition 5555, re 5.00
WavJourney: Compositional Audio Creation with Large Language Models 35566, re 5.00

Summary

This year, the paper's number is not so large.

ICML'24

Speech

Paper status
ELF: Encoding Speaker-Specific Latent Speech Feature for Speech Synthesis link
video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models link
NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models link
InstructSpeech: Following Speech Editing Instructions via Large Language Models link
Scaling Speech Technology to 1,000+ Languages link
IIANet: An Intra- and Inter-Modality Attention Network for Audio-Visual Speech Separation link
Speech Self-Supervised Learning Using Diffusion Model Synthetic Data link
Proactive Detection of Voice Cloning with Localized Watermarking
SelfVC: Voice Conversion With Iterative Refinement using Self Transformations

Audio

Paper status
Zero-Shot Unsupervised and Text-Based Audio Editing Using DDPM Inversion link
UniAudio: Towards Universal Audio Generation with Large Language Models link
Prompt-guided Precise Audio Editing with Diffusion Models
Creative Text-to-Audio Generation via Synthesizer Programming
Fast Timing-Conditioned Latent Audio Diffusion
Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities
Listenable Maps for Audio Classifiers
STELLA: Continual Audio-Video Pre-training with SpatioTemporal Localized Alignment
From Vision to Audio and Beyond: A Unified Model for Audio-Visual Representation and Generation
AND: Audio Network Dissection for Interpreting Deep Acoustic Models
EquiAV: Leveraging Equivariance for Audio-Visual Contrastive Learning
BAT: Learning to Reason about Spatial Sounds with Large Language Models sound
Symbolic Music Generation with Non-Differentiable Rule Guided Diffusion music
DITTO: Diffusion Inference-Time T-Optimization for Music Generation
An Independence-promoting Loss for Music Generation with Language Models
LLark: A Multimodal Instruction-Following Language Model for Music
MusicFlow: Cascaded Flow Matching for Text Guided Music Generation
MusicRL: Aligning Music Generation to Human Preferences

NeurIPS'24

Speech

useful link: https://nips.cc/virtual/2024/papers.html?filter=titles&search=speech

Paper status
SSDM: Scalable Speech Dysfluency Modeling
SpeechForensics: Audio-Visual Speech Representation Learning for Face Forgery Detection
Paralinguistics-Aware Speech-Empowered Large Language Models for Natural Conversation
A Full-duplex Speech Dialogue Scheme Based On Large Language Model
CA-SSLR: Condition-Aware Self-Supervised Learning Representation for Generalized Speech Processing
Self-Taught Recognizer: Toward Unsupervised Adaptation for Speech Foundation Models
DiffNorm: Self-Supervised Normalization for Non-autoregressive Speech-to-speech Translation
SILENCE: Protecting privacy in offloaded speech understanding on resource-constrained devices
FINALLY: fast and universal speech enhancement with studio-like quality
SpeechAlign: Aligning Speech Generation to Human Preferences
Du-IN: Discrete units-guided mask modeling for decoding speech from Intracranial Neural signals
SCOREQ: Speech Quality Assessment with Contrastive Regression
RealMAN: A Real-Recorded and Annotated Microphone Array Dataset for Dynamic Speech Enhancement and Localization
TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation
CoVoMix: Advancing Zero-Shot Speech Generation for Human-like Multi-talker Conversations
Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs
IndicVoices-R: Unlocking a Massive Multilingual Multi-speaker Speech Corpus for Scaling Indian TTS
Separate and Reconstruct: Asymmetric Encoder-Decoder for Speech Separation
Comprehensive Framework for Curating Speech Datasets and Evaluating ASR Systems: A Case Study for the Polish Language
SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words dataset

Audio

useful link: https://nips.cc/virtual/2024/papers.html?filter=titles&search=audio

Paper status
Vocal Call Locator Benchmark (VCL'24) for localizing rodent vocalizations from multi-channel audio
SLIM: Style-Linguistics Mismatch Model for Generalized Audio Deepfake Detection
Tell What You Hear From What You See - Video to Audio Generation Through Text
Learning Spatially-Aware Language and Audio Embeddings
Lips Are Lying: Spotting the Temporal Inconsistency between Audio and Visual in Lip-Syncing DeepFakes
SpeechForensics: Audio-Visual Speech Representation Learning for Face Forgery Detection
Look, Listen, and Answer: Overcoming Biases for Audio-Visual Question Answering
Continual Audio-Visual Sound Separation
Mixtures of Experts for Audio-Visual Learning
Listenable Maps for Zero-Shot Audio Classifiers
Aligning Audio-Visual Joint Representations with an Agentic Workflow
AV-Cloud: Spatial Audio Rendering Through Audio-Visual Cloud Splatting
Frieren: Efficient Video-to-Audio Generation Network with Rectified Flow Matching
An eye for an ear: zero-shot audio description leveraging an image captioner with audio-visual token distribution matching
A Versatile Diffusion Transformer with Mixture of Noise Levels for Audiovisual Generation
Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs
VASA-1: Lifelike Audio-Driven Talking Faces Generated in Real Time
UniAudio 1.5: Large Language Model-Driven Audio Codec is A Few-Shot Audio Task Learner
AudioMarkBench: Benchmarking Robustness of Audio Watermarking
Differentiable Modal Synthesis for Physical Modeling of Planar String Sound and Motion Simulation sound
Spike-based Neuromorphic Model for Sound Source Localization sound
Images that Sound: Composing Images and Sounds on a Single Canvas sound
The iNaturalist Sounds Dataset sound
GTSinger: A Global Multi-Technique Singing Corpus with Realistic Music Scores for All Singing Tasks music
MoMu-Diffusion: On Learning Long-Term Motion-Music Synchronization and Correspondence music
Algorithmic Collective Action in Recommender Systems: Promoting Songs by Reordering Playlists song

ICML'23

Speech

useful link: https://icml.cc/virtual/2023/papers.html?filter=titles&search=speech

Paper status
Pre-training for Speech Translation: CTC Meets Optimal Transport Oral
Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language Oral
Robust Speech Recognition via Large-Scale Weak Supervision
Shiftable Context: Addressing Training-Inference Context Mismatch in Simultaneous Speech Translation
Self-supervised Neural Factor Analysis for Disentangling Utterance-level Speech Representations
MetricGAN-OKD: Multi-Metric Optimization of MetricGAN via Online Knowledge Distillation for Speech Enhancement
Mu$^2$SLAM: Multitask, Multilingual Speech and Language Models

Audio

useful link: https://icml.cc/virtual/2023/papers.html?filter=titles&search=audio

Paper Status
AudioLDM: Text-to-Audio Generation with Latent Diffusion Models
Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models
A Unified Audio-Visual Learning Framework for Localization, Separation, and Recognition
BEATs: Audio Pre-Training with Acoustic Tokenizers Oral
Do You Remember? Overcoming Catastrophic Forgetting for Fake Audio Detection

NeurIPS'23

Speech

Paper Status
High-Fidelity Audio Compression with Improved RVQGAN Spot
Sounding Bodies: Modeling 3D Spatial Sound of Humans Using Body Pose and Audio Spot
How to Scale Your EMA Spot
Textually Pretrained Speech Language Models
ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text Translation
DASpeech: Directed Acyclic Transformer for Fast and High-quality Speech-to-Speech Translation
StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models
Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale
DOSE: Diffusion Dropout with Adaptive Prior for Speech Enhancement
P-Flow: A Fast and Data-Efficient Zero-Shot TTS through Speech Prompting
DinoSR: Self-Distillation and Online Clustering for Self-supervised Speech Representation Learning
Parts of Speech–Grounded Subspaces in Vision-Language Models
UNSSOR: Unsupervised Neural Speech Separation by Leveraging Over-determined Training Mixtures
Learning Repeatable Speech Embeddings Using An Intra-class Correlation Regularizer
Disentangling Voice and Content with Self-Supervision for Speaker Recognition
From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion
Unified Segment-to-Segment Framework for Simultaneous Sequence Generation
Conditional Adapters: Parameter-efficient Transfer Learning with Fast Inference
Progressive Ensemble Distillation: Building Ensembles for Efficient Inference
LEACE: Perfect linear concept erasure in closed form
TART: A plug-and-play Transformer module for task-agnostic reasoning

Audio

Paper Status
Compression with Bayesian Implicit Neural Representations Spot
From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion
Pengi: An Audio Language Model for Audio Tasks
AUDIT: Audio Editing by Following Instructions with Latent Diffusion Models
MAViL: Masked Audio-Video Learners
Weakly-Supervised Audio-Visual Segmentation
Disentangled Counterfactual Learning for Physical Audiovisual Commonsense Reasoning
Diff-Foley: Synchronized Video-to-Audio Synthesis with Latent Diffusion Models
AV-NeRF: Learning Neural Fields for Real-World Audio-Visual Scene Synthesis
Simple and Controllable Music Generation music
CoLLAT: On Adding Fine-grained Audio Understanding to Language Models using Token-Level Locked-Language Tuning
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
Modality-Independent Teachers Meet Weakly-Supervised Audio-Visual Event Parser
Dual Mean-Teacher: An Unbiased Semi-Supervised Framework for Audio-Visual Source Localization
Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual Downstream Tasks
Revisit Weakly-Supervised Audio-Visual Video Parsing from the Language Perspective
Self-Supervised Visual Acoustic Matching
Connecting Multi-modal Contrastive Representations
Kiki or Bouba? Sound Symbolism in Vision-and-Language Models sound
SoundCam: A Dataset for Finding Humans Using Room Acoustics sound
DISCO-10M: A Large-Scale Music Dataset music bench
MARBLE: Music Audio Representation Benchmark for Universal Evaluation music bench
Achieving Cross Modal Generalization with Multimodal Unified Representation
Any-to-Any Generation via Composable Diffusion
Efficient Neural Music Generation music
Training Transitive and Commutative Multimodal Transformers with LoReTTa
Latent Diffusion for Language Generation
Block-State Transformers
Learning Interpretable Low-dimensional Representation via Physical Symmetry
Alternating Gradient Descent and Mixture-of-Experts for Integrated Multimodal Perception
Feature Dropout: Revisiting the Role of Augmentations in Contrastive Learning
Language Semantic Graph Guided Data-Efficient Learning

ACMMM'24

Speech

Paper Status
VoxInstruct: Expressive Human Instruction-to-Speech Generation with Unified Multilingual Codec Language Modelling Oral
UniStyle: Unified Style Modeling for Speaking Style Captioning and Stylistic Speech Synthesis Oral
Boosting Speech Recognition Robustness to Modality-Distortion with Contrast-Augmented Prompts Oral
ArtSpeech: Adaptive Text-to-Speech Synthesis with Articulatory Representations Oral
Self-Supervised Emotion Representation Disentanglement for Speech-Preserving Facial Expression Manipulation Oral
Generative Expressive Conversational Speech Synthesis
SpeechCraft: A Fine-Grained Expressive Speech Dataset with Natural Language Description
CIEASR:Contextual Image-Enhanced Automatic Speech Recognition for Improved Homophone Discrimination
EGGesture: Entropy-Guided Vector Quantized Variational AutoEncoder for Co-Speech Gesture Generation
DEITalk: Speech-Driven 3D Facial Animation with Dynamic Emotional Intensity Modeling
Contrastive Context-Speech Pretraining for Expressive Text-to-Speech Synthesis
RAVSS: Robust Audio-Visual Speech Separation in Multi-Speaker Scenarios with Missing Visual Cues
Efficient Training for Multilingual Visual Speech Recognition: Pre-training with Discretized Visual Speech Representation
Speech Reconstruction from Silent Lip and Tongue Articulation by Diffusion Models and Text-Guided Pseudo Target Generation
MDT-A2G: Exploring Masked Diffusion Transformers for Co-Speech Gesture Generation
SpeechEE: A Novel Benchmark for Speech Event Extraction
MambaGesture: Enhancing Co-Speech Gesture Generation with Mamba and Disentangled Multi-Modality Fusion
Emphasizing Semantic Consistency of Salient Posture for Speech-Driven Gesture Generation
Enabling Synergistic Full-Body Control in Prompt-Based Co-Speech Motion Generation
FlashSpeech: Efficient Zero-Shot Speech Synthesis

Audio

Paper Status
OpenAVE: Moving towards Open Set Audio-Visual Event Localization Oral
Unveiling and Mitigating Bias in Audio Visual Segmentation Oral
AV-Deepfake1M: A Large-Scale LLM-Driven Audio-Visual Deepfake Dataset Oral
Tango 2: Aligning Diffusion-based Text-to-Audio Generative Models through Direct Preference Optimization Oral
Towards Trustworthy MetaShopping: Studying Manipulative Audiovisual Designs in Virtual-Physical Commercial Platforms Oral
Open-Vocabulary Audio-Visual Semantic Segmentation Oral
Advancing Multi-grained Alignment for Contrastive Language-Audio Pre-training Oral
Toward Explainable Physical Audiovisual Commonsense Reasoning Oral
TiVA: Time-Aligned Video-to-Audio Generation Oral
Coarse-to-Fine Proposal Refinement Framework For Audio Temporal Forgery Detection and Localization Oral
SelM: Selective Mechanism based Audio-Visual Segmentation Oral
Dissecting Temporal Understanding in Text-to-Audio Retrieval
FRADE: Forgery-aware Audio-distilled Multimodal Learning for Deepfake Detection
AMG-Embedding: a Self-Supervised Embedding Approach for Audio Identification
MMAL: Multi-Modal Analytic Learning for Exemplar-Free Audio-Visual Class Incremental Tasks
Utilizing Speaker Profiles for Impersonation Audio Detection
CACE-Net: Co-guidance Attention and Contrastive Enhancement for Effective Audio-Visual Event Localization
CoPL:Parameter-Efficient Collaborative Prompt Learning for Audio-Visual Tasks
Time-Frequency Domain Fusion Enhancement for Audio Super-Resolution
Auto-ACD: A Large-scale Dataset for Audio-Language Representation Learning
Multi-grained Correspondence Learning of Audio-language Models for Few-shot Audio Recognition
Audio Deepfake Detection with Self-Supervised XLS-R and SLS Classifier
AVHash: Joint Audio-Visual Hashing for Video Retrieval
RAVSS: Robust Audio-Visual Speech Separation in Multi-Speaker Scenarios with Missing Visual Cues
EchoAudio: Efficient and High-Quality Text-to-Audio Generation with Minimal Inference Steps
Instance-Level Panoramic Audio-Visual Saliency Detection and Ranking
Audio-Driven Identity Manipulation for Face Inpainting
GROOT: Generating Robust Watermark for Diffusion-Model-Based Audio Synthesis
TAS: Personalized Text-guided Audio Spatialization
Boosting Audio Visual Question Answering via Key Semantic-Aware Cues
V2A-Mark: Versatile Deep Visual-Audio Watermarking for Manipulation Localization and Copyright Protection

ICLR'23

Speech

Paper Status
TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation
An efficient encoder-decoder architecture with top-down attention for speech separation
Jointly Learning Visual and Auditory Speech Representations from Raw Data
Bag of Tricks for Unsupervised Text-to-Speech
In-Situ Text-Only Adaptation of Speech Models with Low-Overhead Speech Imputations
Revisiting the Entropy Semiring for Neural Speech Recognition
D4AM: A General Denoising Framework for Downstream Acoustic Models
Filter-Recovery Network for Multi-Speaker Audio-Visual Speech Separation
BigVGAN: A Universal Neural Vocoder with Large-Scale Training
Continuous pseudo-labeling from the start
NANSY++: Unified Voice Synthesis with Neural Analysis and Synthesis
BAYES RISK CTC: CONTROLLABLE CTC ALIGNMENT IN SEQUENCE-TO-SEQUENCE TASKS

Audio

Paper Status
Token Merging: Your ViT But Faster Oral
Contrastive Audio-Visual Masked Autoencoder Spot
AudioGen: Textually Guided Audio Generation
Defending against Adversarial Audio via Diffusion Model
wav2tok: Deep Sequence Tokenizer for Audio Retrieval
Continual Transformers: Redundancy-Free Attention for Online Inference
GeneFace: Generalized and High-Fidelity Audio-Driven 3D Talking Face Synthesis
Words are all you need? Language as an approximation for human similarity judgments

AAAI'24

useful link: https://aaai.org/wp-content/uploads/2024/02/AAAI-24_Main_2024-02-01.pdf

https://github.com/DmitryRyumin/AAAI-2024-Papers

Speech

Paper Status
Mimic: Speaking Style Disentanglement for Speech-Driven 3D Facial Animation https://arxiv.org/abs/2312.10877
UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-Diffusion and Vocoding https://arxiv.org/abs/2306.07547
Multichannel AV-wav2vec2: A Framework for Learning Multichannel Multi-Modal Speech Representation https://arxiv.org/abs/2401.03468
Visual Hallucination Elevates Speech Recognition https://ojs.aaai.org/index.php/AAAI/article/view/29926
Spanning the Spectrum of Hatred Detection: A Persian Multi-Label Hate Speech Dataset with Annotator Rationales https://ojs.aaai.org/index.php/AAAI/article/view/29743
Restoring Speaking Lips from Occlusion for Audio-Visual Speech Recognition https://ojs.aaai.org/index.php/AAAI/article/view/29882
MM-TTS: Multi-Modal Prompt Based Style Transfer for Expressive Text-toSpeech Synthesis https://arxiv.org/abs/2312.10687
Emotion Rendering for Conversational Speech Synthesis with Heterogeneous Graph-Based Context Modeling https://arxiv.org/abs/2312.11947
Let There Be Sound: Reconstructing High Quality Speech from Silent Videos https://arxiv.org/abs/2308.15256
Divergence-Guided Simultaneous Speech Translation https://ojs.aaai.org/index.php/AAAI/article/view/29733
SECap: Speech Emotion Captioning with Large Language Model https://arxiv.org/abs/2312.10381
Self-Supervised Disentangled Representation Learning for Robust Target Speech Extraction https://arxiv.org/abs/2312.10305

Audio

Paper Status
AE-NeRF: Audio Enhanced Neural Radiance Field for Few Shot Talking Head Synthesis https://arxiv.org/abs/2312.10921
V2A-Mapper: A Lightweight Solution for Vision-to-Audio Generation by Connecting Foundation Models https://arxiv.org/abs/2308.09300
What to Remember: Self-Adaptive Continual Learning for Audio Deepfake Detection https://arxiv.org/abs/2312.09651
Audio Generation with Multiple Conditional Diffusion Model https://arxiv.org/abs/2308.11940
AVSegFormer: Audio-Visual Segmentation with Transformer https://ojs.aaai.org/index.php/AAAI/article/view/29104
Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation https://arxiv.org/abs/2309.16429
Sample-Constrained Black Box Optimization for Audio Personalization https://ojs.aaai.org/index.php/AAAI/article/view/28881
DTF-AT: Decoupled Time-Frequency Audio Transformer for Event Classification https://ojs.aaai.org/index.php/AAAI/article/view/29716
CAVEN: An Embodied Conversational Agent for Efficient Audio-Visual Navigation in Noisy Environments https://arxiv.org/abs/2306.04047
Learning Temporal Resolution in Spectrogram for Audio Classification https://arxiv.org/abs/2210.01719
SoundCount: Sound Counting from Raw Audio with Dyadic Decomposition Neural Network https://arxiv.org/abs/2312.16149
Segment beyond View: Handling Partially Missing Modality for Audio-Visual Semantic Segmentation https://arxiv.org/abs/2312.08673
Improving Audio-Visual Segmentation with Bidirectional Generation https://arxiv.org/abs/2308.08288
Audio Scanning Network: Bridging Time and Frequency Domains for Audio Classification https://ojs.aaai.org/index.php/AAAI/article/view/29015
Object-Aware Adaptive-Positivity Learning for Audio-Visual Question Answering https://arxiv.org/abs/2312.12816
Prompting Segmentation with Sound Is Generalizable Audio-Visual Source Localizer https://arxiv.org/abs/2309.07929

ACL'24

useful link: https://2024.aclweb.org/program/main_conference_papers/#long-papers

https://2024.aclweb.org/program/finding_papers/

Speech

60 papers

Paper Authorlist Status
GenTranslate: Large Language Models are Generative Multilingual Speech and Machine Translators Yuchen Hu, Chen Chen, Chao-Han Huck Yang, Ruizhe Li, Dong Zhang, Zhehuai Chen, EngSiong Chng Long, link
Wav2Gloss: Generating Interlinear Glossed Text from Speech Taiqi He, Kwanghee Choi, Lindia Tjuatja, Nathaniel Romney Robinson, Jiatong Shi, Shinji Watanabe, Graham Neubig, David R Mortensen, Lori Levin https://aclanthology.org/2024.acl-long.34.pdf
A Non-autoregressive Generation Framework for End-to-End Simultaneous Speech-to-Any Translation Zhengrui Ma, Qingkai Fang, Shaolei Zhang, Shoutao Guo, Yang Feng, Min zhang https://aclanthology.org/2024.acl-long.85.pdf
Generative Pre-trained Speech Language Model with Efficient Hierarchical Transformer Yongxin Zhu, Dan Su, Liqiang He, Linli Xu, Dong Yu https://aclanthology.org/2024.acl-long.97.pdf
Speech Translation with Speech Foundation Models and Large Language Models: What is There and What is Missing? Marco Gaido, Sara Papi, Matteo Negri, Luisa Bentivogli https://aclanthology.org/2024.acl-long.789.pdf
StreamAtt: Direct Streaming Speech-to-Text Translation with Attention-based Audio History Selection Sara Papi, Marco Gaido, Matteo Negri, Luisa Bentivogli https://aclanthology.org/2024.acl-long.202.pdf
Speech vs. Transcript: Does It Matter for Human Annotators in Speech Summarization? Roshan Sharma, Suwon Shon, Mark Lindsey, Hira Dhamyal, Bhiksha Raj https://aclanthology.org/2024.acl-long.790.pdf
LLM Knows Body Language, Too: Translating Speech Voices into Human Gestures Chenghao Xu, Guangtao Lyu, Jiexi Yan, Muli Yang, Cheng Deng https://aclanthology.org/2024.acl-long.273.pdf
RepCodec: A Speech Representation Codec for Speech Tokenization Zhichao Huang, Chutong Meng, Tom Ko https://aclanthology.org/2024.acl-long.314.pdf
Error-preserving Automatic Speech Recognition of Young English Learners’ Language Janick Michot, Manuela Hürlimann, Jan Milan Deriu, Luzia Sauer, Katsiaryna Mlynchyk, Mark Cieliebak https://aclanthology.org/2024.acl-long.348.pdf
Can We Achieve High-quality Direct Speech-to-Speech Translation without Parallel Speech Data? Qingkai Fang, Shaolei Zhang, Zhengrui Ma, Min zhang, Yang Feng https://aclanthology.org/2024.acl-long.392.pdf
Multimodal Contextualized Semantic Parsing from Speech Jordan Voas, David Harwath, Ray Mooney https://aclanthology.org/2024.acl-long.398.pdf
SpikeVoice: High-Quality Text-to-Speech Via Efficient Spiking Neural Network Kexin Wang, Jiahong Zhang, Yong Ren, Man Yao, Di Shang, Bo XU, Guoqi Li https://aclanthology.org/2024.acl-long.429.pdf
Speech Sense Disambiguation: Tackling Homophone Ambiguity in End-to-End Speech Translation Tengfei Yu, Xuebo Liu, Liang Ding, Kehai Chen, Dacheng Tao, Min Zhang https://aclanthology.org/2024.acl-long.435.pdf
Label-Synchronous Neural Transducer for E2E Simultaneous Speech Translation Keqi Deng, Phil Woodland https://aclanthology.org/2024.acl-long.448.pdf
Language Complexity and Speech Recognition Accuracy: Orthographic Complexity Hurts, Phonological Complexity Doesn’t Chihiro Taguchi, David Chiang https://aclanthology.org/2024.acl-long.827.pdf
Speech language models lack important brain-relevant semantics SUBBA REDDY OOTA, Emin Çelik, Fatma Deniz, Mariya Toneva https://aclanthology.org/2024.acl-long.462.pdf
StreamSpeech: Simultaneous Speech-to-Speech Translation with Multi-task Learning Shaolei Zhang, Qingkai Fang, Shoutao Guo, Zhengrui Ma, Min zhang, Yang Feng https://aclanthology.org/2024.acl-long.485.pdf
NaijaHate: Evaluating Hate Speech Detection on Nigerian Twitter Using Representative Data Manuel Tonneau, Pedro Vitor Quinta de Castro, Karim Lasri, Ibrahim Sambo Farouq, Lakshmi Subramanian, Victor Orozco-Olvera, Samuel Fraiberger https://aclanthology.org/2024.acl-long.488v2.pdf
Uni-Dubbing: Zero-Shot Speech Synthesis from Visual Articulation Songju Lei, Xize Cheng, Mengjiao Lyu, Jianqiao Hu, Jintao Tan, Runlin Liu, Lingyu Xiong, Tao Jin, Xiandong Li, Zhou Zhao https://aclanthology.org/2024.acl-long.543.pdf
OWSM-CTC: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification Yifan Peng, Yui Sudo, Muhammad Shakeel, Shinji Watanabe https://aclanthology.org/2024.acl-long.549.pdf
Don’t Go To Extremes: Revealing the Excessive Sensitivity and Calibration Limitations of LLMs in Implicit Hate Speech Detection Min Zhang, Jianfeng He, Taoran Ji, Chang-Tien Lu https://aclanthology.org/2024.acl-long.652.pdf
Structured Tree Alignment for Evaluation of (Speech) Constituency Parsing Freda Shi, Kevin Gimpel, Karen Livescu https://aclanthology.org/2024.acl-long.666.pdf
VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild Puyuan Peng, Po-Yao Huang, Shang-Wen Li, Abdelrahman Mohamed, David Harwath https://aclanthology.org/2024.acl-long.673.pdf
A Community-Centric Perspective for Characterizing and Detecting Anti-Asian Violence-Provoking Speech Gaurav Verma, Rynaa Grover, Jiawei Zhou, Binny Mathew, Jordan Kraemer, Munmun De Choudhury, Srijan Kumar https://aclanthology.org/2024.acl-long.684.pdf
XLAVS-R: Cross-Lingual Audio-Visual Speech Representation Learning for Noise-Robust Speech Perception HyoJung Han, Mohamed Anwar, Juan Pino, Wei-Ning Hsu, Marine Carpuat, Bowen Shi, Changhan Wang https://aclanthology.org/2024.acl-long.697.pdf
MobileSpeech: A Fast and High-Fidelity Framework for Mobile Zero-Shot Text-to-Speech Shengpeng Ji, Ziyue Jiang, Wang Hanting, Jialung Zuo, Zhou Zhao https://aclanthology.org/2024.acl-long.733.pdf
The MERSA Dataset and a Transformer-Based Approach for Speech Emotion Recognition Enshi Zhang, Rafael Trujillo, Christian Poellabauer https://aclanthology.org/2024.acl-long.752.pdf
Growing Trees on Sounds: Assessing Strategies for End-to-End Dependency Parsing of Speech Adrien Pupier, Maximin Coavoux, Jérôme Goulian, Benjamin Lecouteux Short, link
Explainability and Hate Speech: Structured Explanations Make Social Media Moderators Faster Agostina Calabrese, Leonardo Neves, Neil Shah, Maarten W. Bos, Björn Ross, Mirella Lapata, Francesco Barbieri https://aclanthology.org/2024.acl-short.38.pdf
On the Semantic Latent Space of Diffusion-Based Text-To-Speech Models Miri Varshavsky, Roy Hirsch, Regev Cohen, Tomer Golany, Daniel Freedman, Ehud Rivlin https://aclanthology.org/2024.acl-short.24.pdf
StreamVoice: Streamable Context-Aware Language Modeling for Real-time Zero-Shot Voice Conversion Zhichao Wang, Yuanzhe Chen, Xinsheng Wang, Lei Xie, Yuping Wang voice
Robust Singing Voice Transcription Serves Synthesis Ruiqi Li, Yu Zhang, Yongqi Wang, Zhiqing Hong, Rongjie Huang, Zhou Zhao voice
Make-A-Voice: Revisiting Voice Large Language Models as Scalable Multilingual and Multitask Learners Rongjie Huang, Chunlei Zhang, Yongqi Wang, Dongchao Yang, Jinchuan Tian, Zhenhui Ye, Luping Liu, Zehan Wang, Ziyue Jiang, Xuankai Chang, Jiatong Shi, CHAO WENG, Zhou Zhao, Dong Yu voice
Codec-SUPERB: An In-Depth Analysis of Sound Codec Models Haibin Wu, Ho-Lam Chung, Yi-Cheng Lin, Yuan-Kuei Wu, Xuanjun Chen, Yu-Chi Pai, Hsiu-Hsuan Wang, Kai-Wei Chang, Alexander H. Liu, Hung-yi Lee Findings
Listen Again and Choose the Right Answer: A New Paradigm for Automatic Speech Recognition with Large Language Models Findings,
Wav2SQL: Direct Generalizable Speech-To-SQL Parsing
Multi-Modal Retrieval For Large Language Model Based Speech Recognition
ViHateT5: Enhancing Hate Speech Detection in Vietnamese With a Unified Text-to-Text Transformer Model
Speech-based Slot Filling using Large Language Models
LLaST: Improved End-to-end Speech Translation System Leveraged by Large Language Models
Semantic Role Labeling from Chinese Speech via End-to-End Learning
Revisiting Interpolation Augmentation for Speech-to-Text Generation
Self-Supervised Singing Voice Pre-Training towards Speech-to-Singing Conversion
TransFace: Unit-Based Audio-Visual Speech Synthesizer for Talking Head Translation
SpeechGuard: Exploring the Adversarial Robustness of Multi-modal Large Language Models
SharedCon: Implicit Hate Speech Detection using Shared Semantics
IndicVoices: Towards building an Inclusive Multilingual Speech Dataset for Indian Languages
wav2vec-S: Adapting Pre-trained Speech Models for Streaming
On the Evaluation of Speech Foundation Models for Spoken Language Understanding
Integrating Pre-Trained Speech and Language Models for End-to-End Speech Recognition
Tox-BART: Leveraging Toxicity Attributes for Explanation Generation of Implicit Hate Speech
Pushing the Limits of Zero-shot End-to-End Speech Translation
Textless Acoustic Model with Self-Supervised Distillation for Noise-Robust Expressive Speech-to-Speech Translation
emotion2vec: Self-Supervised Pre-Training for Speech Emotion Representation
Label-aware Hard Negative Sampling Strategies with Momentum Contrastive Learning for Implicit Hate Speech Detection
Aligning Speech Segments Beyond Pure Semantics
CTC-based Non-autoregressive Textless Speech-to-Speech Translation
MELD-ST: An Emotion-aware Speech Translation Dataset
Part-of-speech Tagging for Extremely Low-resource Indian Languages

Audio

https://2024.aclweb.org/program/finding_papers/

8 papers

Paper Authorlist Status
AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension Qian Yang, Jin Xu, Wenrui Liu, Yunfei Chu, Ziyue Jiang, Xiaohuan Zhou, Yichong Leng, Yuanjun Lv, Zhou Zhao, Chang Zhou, Jingren Zhou Long, link
StreamAtt: Direct Streaming Speech-to-Text Translation with Attention-based Audio History Selection Sara Papi, Marco Gaido, Matteo Negri, Luisa Bentivogli https://aclanthology.org/2024.acl-long.202.pdf
M$^3$AV: A Multimodal, Multigenre, and Multipurpose Audio-Visual Academic Lecture Dataset Zhe Chen, Heyang Liu, Wenyi Yu, Guangzhi Sun, Hongcheng Liu, Ji Wu, Chao Zhang, Yu Wang, Yanfeng Wang https://aclanthology.org/2024.acl-long.489.pdf
XLAVS-R: Cross-Lingual Audio-Visual Speech Representation Learning for Noise-Robust Speech Perception HyoJung Han, Mohamed Anwar, Juan Pino, Wei-Ning Hsu, Marine Carpuat, Bowen Shi, Changhan Wang https://aclanthology.org/2024.acl-long.697.pdf
MuTox: Universal MUltilingual Audio-based TOXicity Dataset and Zero-shot Detector Marta R. Costa-jussà, Mariano Coria Meglioli, Pierre Andrews, David Dale, Prangthip Hansanti, Elahe Kalbassi, Alexandre Mourachko, Christophe Ropers, Carleigh Wood Findings
X-ACE: Explainable and Multi-factor Audio Captioning Evaluation Qian Wang, Jia-Chen Gu, Zhen-Hua Ling
Deepfake Defense: Constructing and Evaluating a Specialized Urdu Deepfake Audio Dataset Sheza Munir, Wassay Sajjad, Mukeet Raza, Emaan Mujahid Abbas, Abdul Hameed Azeemi, Ihsan Ayyub Qazi, Agha Ali Raza
Beyond Orthography: Automatic Recovery of Short Vowels and Dialectal Sounds in Arabic Yassine El Kheir, Hamdy Mubarak, Ahmed Ali, Shammur Absar Chowdhury sound

EMNLP'24

useful link: https://2024.emnlp.org/program/accepted_main_conference/

https://2024.emnlp.org/program/accepted_findings/

Speech

58 papers

Paper Authorlist Status
When LLMs Meets Acoustic Landmarks: An Efficient Approach to Integrate Speech into Large Language Models for Depression Detection Xiangyu Zhang, Hexin Liu, Kaishuai Xu, Qiquan Zhang, Daijiao Liu, Beena Ahmed, Julien Epps Main, link
Speaking in Wavelet Domain: A Simple and Efficient Approach to Speed up Speech Diffusion Model Xiangyu Zhang, Daijiao Liu, Hexin Liu, Qiquan Zhang, Hanyu Meng, Leibny Paola Garcia Perera, EngSiong Chng, Lina Yao https://aclanthology.org/2024.emnlp-main.9.pdf
Scaling Properties of Speech Language Models Santiago Cuervo, Ricard Marxer https://aclanthology.org/2024.emnlp-main.21.pdf
EmphAssess : a Prosodic Benchmark on Assessing Emphasis Transfer in Speech-to-Speech Models Maureen de Seyssel, Antony D’Avirro, Adina Williams, Emmanuel Dupoux https://aclanthology.org/2024.emnlp-main.30.pdf
Is Safer Better? The Impact of Guardrails on the Argumentative Strength of LLMs in Hate Speech Countering Helena Bonaldi, Greta Damo, Nicolás Benjamín Ocampo, Elena Cabrio, Serena Villata, Marco Guerini https://aclanthology.org/2024.emnlp-main.201.pdf
AlignCap: Aligning Speech Emotion Captioning to Human Preferences Ziqi Liang, Haoxiang Shi, Hanhui Chen https://aclanthology.org/2024.emnlp-main.224.pdf
F$^2$RL: Factuality and Faithfulness Reinforcement Learning Framework for Claim-Guided Evidence-Supported Counterspeech Generation Haiyang Wang, Yuchen Pan, Xin Song, Xuechen Zhao, Minghao Hu, Bin Zhou https://aclanthology.org/2024.emnlp-main.255.pdf
Outcome-Constrained Large Language Models for Countering Hate Speech Lingzi Hong, Pengcheng Luo, Eduardo Blanco, Xiaoying Song https://aclanthology.org/2024.emnlp-main.260.pdf
On Mitigating Performance Disparities in Multilingual Speech Recognition Monorama Swain, Anna Katrine van Zee, Anders Søgaard https://aclanthology.org/2024.emnlp-main.323.pdf
Methods of Automatic Matrix Language Determination for Code-Switched Speech Olga Iakovenko, Thomas Hain https://aclanthology.org/2024.emnlp-main.330.pdf
EH-MAM: Easy-to-Hard Masked Acoustic Modeling for Self-Supervised Speech Representation Learning Ashish Seth, Ramaneswaran S, S Sakshi, Sonal Kumar, Sreyan Ghosh, Dinesh Manocha https://aclanthology.org/2024.emnlp-main.366.pdf
Muting Whisper: A Universal Acoustic Adversarial Attack on Speech Foundation Models Vyas Raina, Rao Ma, Charles McGhee, Kate Knill, Mark Gales https://aclanthology.org/2024.emnlp-main.430.pdf
Bridging Modalities: Enhancing Cross-Modality Hate Speech Detection with Few-Shot In-Context Learning Ming Shan Hee, Aditi Kumaresan, Roy Ka-Wei Lee https://aclanthology.org/2024.emnlp-main.445.pdf
Task Arithmetic can Mitigate Synthetic-to-Real Gap in Automatic Speech Recognition Hsuan Su, Hua Farn, Fan-Yun Sun, Shang-Tse Chen, Hung-yi Lee https://aclanthology.org/2024.emnlp-main.503.pdf
ESC: Efficient Speech Coding with Cross-Scale Residual Vector Quantized Transformers Yuzhe Gu, Enmao Diao https://aclanthology.org/2024.emnlp-main.562.pdf
Towards Robust Speech Representation Learning for Thousands of Languages William Chen, Wangyou Zhang, Yifan Peng, Xinjian Li, Jinchuan Tian, Jiatong Shi, Xuankai Chang, Soumi Maiti, Karen Livescu, Shinji Watanabe https://aclanthology.org/2024.emnlp-main.570.pdf
Speechworthy Instruction-tuned Language Models Hyundong Justin Cho, Nicolaas Paul Jedema, Leonardo F. R. Ribeiro, Karishma Sharma, Pedro Szekely, Alessandro Moschitti, Ruben Janssen, Jonathan May https://aclanthology.org/2024.emnlp-main.595.pdf
Towards Probing Speech-Specific Risks in Large Multimodal Models: A Taxonomy, Benchmark, and Insights Hao Yang, Lizhen Qu, Ehsan Shareghi, Reza Haf https://aclanthology.org/2024.emnlp-main.614.pdf
Integrating Argumentation and Hate-Speech-based Techniques for Countering Misinformation Sougata Saha, Rohini Srihari https://aclanthology.org/2024.emnlp-main.622.pdf
Unveiling the Role of Pretraining in Direct Speech Translation Belen Alastruey, Gerard I. Gállego, Marta R. Costa-jussà https://aclanthology.org/2024.emnlp-main.630.pdf
Multi-Level Cross-Modal Alignment for Speech Relation Extraction Liang Zhang, Zhen Yang, Biao Fu, Ziyao Lu, Liangying Shao, Shiyu Liu, Fandong Meng, Jie Zhou, Xiaoli Wang, Jinsong Su https://aclanthology.org/2024.emnlp-main.668.pdf
Self-Powered LLM Modality Expansion for Large Speech-Text Models Tengfei Yu, Xuebo Liu, Zhiyi Hou, Liang Ding, Dacheng Tao, Min Zhang https://aclanthology.org/2024.emnlp-main.690.pdf
Optimizing Rare Word Accuracy in Direct Speech Translation with a Retrieval-and-Demonstration Approach Siqi Li, Danni Liu, Jan Niehues https://aclanthology.org/2024.emnlp-main.708.pdf
Towards an Open-Source Speech Foundation Model for EU: 950,000 Hours of Open-Source Compliant Speech Data for EU Languages Marco Gaido, Sara Papi, Luisa Bentivogli, Alessio Brutti, Mauro Cettolo, Roberto Gretter, Marco Matassoni, Mohamed Nabih, Matteo Negri https://aclanthology.org/2024.emnlp-main.771.pdf
VHASR: A Multimodal Speech Recognition System With Vision Hotwords Jiliang Hu, Zuchao Li, Ping Wang, Haojun Ai, Lefei Zhang, hai zhao https://aclanthology.org/2024.emnlp-main.821.pdf
AudioVSR: Enhancing Video Speech Recognition with Audio Data Xiaoda Yang, Xize Cheng, Jiaqi Duan, Hongshun Qiu, Minjie Hong, Minghui Fang, Shengpeng Ji, Jialong Zuo, Zhiqing Hong, Zhimeng Zhang, Tao Jin https://aclanthology.org/2024.emnlp-main.858.pdf
Hate Personified: Investigating the role of LLMs in content moderation pipeline for hate speech Sarah Masud, Sahajpreet Singh, Viktor Hangya, Alexander Fraser, Tanmoy Chakraborty https://aclanthology.org/2024.emnlp-main.886.pdf
Please note that I’m just an AI: Analysis of Behavior Patterns of LLMs in (Non-)offensive Speech Identification Esra Dönmez, Thang Vu, Agnieszka Falenska https://aclanthology.org/2024.emnlp-main.1019.pdf
BLSP-Emo: Towards Empathetic Large Speech-Language Models Chen Wang, Minpeng Liao, Zhongqiang Huang, Junhong Wu, Chengqing Zong, Jiajun Zhang https://aclanthology.org/2024.emnlp-main.1070.pdf
Delving into Qualitative Implications of Synthetic Data for Hate Speech Detection Camilla Casula, Sebastiano Vecellio Salto, Alan Ramponi, Sara Tonelli
Continual Test-time Adaptation for End-to-end Speech Recognition on Noisy Speech Guan-Ting Lin, Wei Ping Huang, Hung-yi Lee
Interventional Speech Noise Injection for ASR Generalizable Spoken Language Understanding YeonJoon Jung, Jaeseong Lee, Seungtaek Choi, Dohyeon Lee, Minsoo Kim, seung-won hwang
Bayesian Example Selection Improves In-Context Learning for Speech, Text, and Visual Modalities Siyin Wang, Chao-Han Huck Yang, Ji Wu, Chao Zhang
PREDICT: Multi-Agent-based Debate Simulation for Generalized Hate Speech Detection Someen Park, Jaehoon Kim, Seungwan Jin, Sohyun Park, Kyungsik Han
TokenVerse: Unifying Speech and NLP Tasks via Transducer-based ASR Shashi Kumar, Srikanth Madikeri, Juan Pablo Zuluaga Gomez, Iuliia Thorbecke, Esaú VILLATORO-TELLO, Sergio Burdisso, Petr Motlicek, Karthik Pandia D S, Aravind Ganapathiraju
Twists, Humps, and Pebbles: Multilingual Speech Recognition Models Exhibit Gender Performance Gaps Giuseppe Attanasio, Beatrice Savoldi, Dennis Fucci, Dirk Hovy
Casablanca: Data and Models for Multidialectal Arabic Speech Recognition Bashar Talafha, Karima Kadaoui, Samar Mohamed Magdy, Mariem Habiboullah, Chafei Mohamed Chafei, Ahmed Oumar El-Shangiti, et.al.
SpeechQE: Estimating the Quality of Direct Speech Translation HyoJung Han, Kevin Duh, Marine Carpuat
Simul-MuST-C: Simultaneous Multilingual Speech Translation Corpus Using Large Language Model Mana Makinae, Yusuke Sakai, Hidetaka Kamigaito, Taro Watanabe
Is Child-Directed Speech Effective Training Data for Language Models? Steven Y. Feng, Noah Goodman, Michael Frank
HateCOT: An Explanation-Enhanced Dataset for Generalizable Offensive Speech Detection via Large Language Models Huy Nghiem, Hal Daumé III Findings
PolyWER: A Holistic Evaluation Framework for Code-Switched Speech Recognition Karima Kadaoui, Maryam Al Ali, Hawau Olamide Toyin, Ibrahim Mohammed, Hanan Aldarmaki
STTATTS: Unified Speech-To-Text And Text-To-Speech Model Hawau Olamide Toyin, Hao Li, Hanan Aldarmaki
Contextualized Graph Representations for Generating Counter-Narrative against Hate Speech Selene Baez Santamaria, Helena Gomez Adorno, Ilia Markov
LaRA: Large Rank Adaptation for Speech and Text Cross-Modal Learning in Large Language Models Zuhair hasan shaik, Pradyoth Hegde, Prashant Bannulmath, Deepak K T
MultiVerse: Efficient and Expressive Zero-Shot Multi-Task Text-to-Speech Taejun Bak, Youngsik Eom, SeungJae Choi, Young-Sun Joo
Where Visual Speech Meets Language: VSP-LLM Framework for Efficient and Context-Aware Visual Speech Processing Jeonghun Yeo, Seunghee Han, Minsu Kim, Yong Man Ro
Adversarial Attacks on Parts of Speech: An Empirical Study in Text-to-Image Generation G M Shahariar, Jia Chen, Jiachen Li, Yue Dong
Breaking the Boundaries: A Unified Framework for Chinese Named Entity Recognition Across Text and Speech Jinzhong Ning, Yuanyuan Sun, Bo Xu, Zhihao Yang, Ling Luo, Hongfei Lin
Audio-Based Linguistic Feature Extraction for Enhancing Multi-lingual and Low-Resource Text-to-Speech Youngjae Kim, Yejin Jeon, Gary Lee
Modeling Gender and Dialect Bias in Automatic Speech Recognition Camille Harris, Chijioke Mgbahurike, Neha Kumar, Diyi Yang
LLM generated responses to mitigate the impact of hate speech Jakub Podolak, Szymon Łukasik, Paweł Balawender, Jan Ossowski, Jan Piotrowski, Katarzyna Bąkowicz, Piotr Sankowski
BLASER 2.0: a metric for evaluation and quality estimation of massively multilingual speech and text translation David Dale, Marta R. Costa-jussà
Textless Speech-to-Speech Translation With Limited Parallel Data Anuj Diwan, Anirudh Srinivasan, David Harwath, Eunsol Choi
PSLM: Parallel Generation of Text and Speech with LLMs for Low-Latency Spoken Dialogue Systems Kentaro Mitsui, Koh Mitsuda, Toshiaki Wakatsuki, Yukiya Hono, Kei Sawada
Bahasa Harmony: A Comprehensive Dataset for Bahasa Text-to-Speech Synthesis with Discrete Codec Modeling of EnGen-TTS. Onkar Kishor Susladkar, Vishesh Tripathi, Biddwan Ahmed
Recent Advances in Online Hate Speech Moderation: Multimodality and the Role of Large Models Ming Shan Hee, Shivam Sharma, RUI CAO, Palash Nandi, Preslav Nakov, Tanmoy Chakraborty, Roy Ka-Wei Lee
WavLLM: Towards Robust and Adaptive Speech Large Language Model Shujie HU, Long Zhou, Shujie LIU, Sanyuan Chen, Lingwei Meng, Hongkun Hao, Jing Pan, Xunying Liu, Jinyu Li, Sunit Sivasankaran, Linquan Liu, Furu Wei

Audio

22 papers

Paper Authorlist Status
IDEAW: Robust Neural Audio Watermarking with Invertible Dual-Embedding Pengcheng Li, Xulong Zhang, Jing Xiao, Jianzong Wang Main
Cross-Domain Audio Deepfake Detection: Dataset and Analysis Yuang Li, Min Zhang, Mengxin Ren, Xiaosong Qiao, Miaomiao Ma, Daimeng Wei, Hao Yang
GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities Sreyan Ghosh, Sonal Kumar, Ashish Seth, Chandra Kiran Reddy Evuru, Utkarsh Tyagi, S Sakshi, Oriol Nieto, Ramani Duraiswami, Dinesh Manocha
OpenSep: Leveraging Large Language Models with Textual Inversion for Open World Audio Separation Tanvir Mahmud, Diana Marculescu
AudioVSR: Enhancing Video Speech Recognition with Audio Data Xiaoda Yang, Xize Cheng, Jiaqi Duan, Hongshun Qiu, Minjie Hong, Minghui Fang, Shengpeng Ji, Jialong Zuo, Zhiqing Hong, Zhimeng Zhang, Tao Jin
PALM: Few-Shot Prompt Learning for Audio Language Models Asif Hanif, Maha Tufail Agro, Mohammad Areeb Qazi, Hanan Aldarmaki
TCSinger: Zero-Shot Singing Voice Synthesis with Style Transfer and Multi-Level Style Control Yu Zhang Ziyue Jiang Ruiqi Li Changhao Pan Jinzheng He Rongjie Huang Chuxin Wang Zhou Zhao voice
Voices Unheard: NLP Resources and Models for Yorùbá Regional Dialects Orevaoghene Ahia, Anuoluwapo Aremu, Diana Abagyan, Hila Gonen, David Ifeoluwa Adelani, Daud Abolade, Noah A. Smith, Yulia Tsvetkov voice
EmoKnob: Enhance Voice Cloning with Fine-Grained Emotion Control Haozhe Chen, Run Chen, Julia Hirschberg voice
Voices in a Crowd: Searching for clusters of unique perspectives Nikolas Vitsakis, Amit Parekh, Ioannis Konstas voice
With Ears to See and Eyes to Hear: Sound Symbolism Experiments with Multimodal Large Language Models Tyler Loakman, YUCHENG LI, Chenghua Lin sound
Adaptive Immune-based Sound-Shape Code Substitution for Adversarial Chinese Text Attacks Ao Wang, Xinghao Yang, Chen Li, Bao-di Liu, Weifeng Liu sound
A SMART Mnemonic Sounds like “Glue Tonic”: Mixing LLMs with Student Feedback to Make Mnemonic Learning Stick Nishant Balepur, Matthew Shu, Alexander Hoyle, Alison Robey, Shi Feng, Seraphina Goldfarb-Tarrant, Jordan Lee Boyd-Graber sound
A Fast and Sound Tagging Method for Discontinuous Named-Entity Recognition Caio Filippo Corro sound
Beyond Single-Audio: Advancing Multi-Audio Processing in Audio Large Language Models Yiming Chen, Xianghu Yue, Xiaoxue Gao, Chen Zhang, Luis Fernando D’Haro, Robby T. Tan, Haizhou Li Findings
AlanaVLM: A Multimodal Embodied AI Foundation Model for Egocentric Video Understanding Alessandro Suglia, Claudio Greco, Katie Baker, Jose L. Part, Ioannis Papaioannou, Arash Eshghi, Ioannis Konstas, Oliver Lemon
Unveiling Hallucination in Text, Image, Video, and Audio Foundation Models: A Comprehensive Review Pranab Sahoo, Prabhash Meharia, Akash Ghosh, Sriparna Saha, Vinija Jain, Aman Chadha
Audio-Based Linguistic Feature Extraction for Enhancing Multi-lingual and Low-Resource Text-to-Speech Youngjae Kim, Yejin Jeon, Gary Lee
SaSR-Net: Source-Aware Semantic Representation Network for Enhancing Audio-Visual Question Answering Tianyu Yang, Yiyang Nan, Lisen Dai, Zhenwen Liang, Yapeng Tian, Xiangliang Zhang
PyramidCodec: Hierarchical Codec for Long-form Music Generation in Audio Domain Jianyi Chen, Zheqi DAI, Zhen Ye, Xu Tan, Qifeng Liu, Yike Guo, Wei Xue
Enhancing Polyglot Voices by Leveraging Cross-Lingual Fine-Tuning in Any-to-One Voice Conversion Giuseppe Ruggiero, Matteo Testa, Jurgen Van de Walle, Luigi Di Caro voice
HSDreport: Heart Sound Diagnosis with Echocardiography Reports Zihan Zhao, Pingjie Wang, Liudan Zhao, Yuchen Yang, Ya Zhang, Kun Sun, Xin Sun, Xin Zhou, Yu Wang, Yanfeng Wang sound

NAACL'25

useful link: https://2025.naacl.org/program/accepted_papers/

Speech

Paper Authorlist Status
Robust and Unbounded Length Generalization in Autoregressive Transformer-Based Text-to-Speech Eric Battenberg, RJ Skerry-Ryan, Daisy Stanton, Soroosh Mariooryad, Matt Shannon, Julian Salazar, David Teh-Hwa Kao
Decoding Hate: Exploring Language Models’ Reactions to Hate Speech Paloma Piot, Javier Parapar
Multi$^3$Hate: Multimodal, Multilingual, and Multicultural Hate Speech Detection with Vision–Language Models Minh Duc Bui, Katharina von der Wense, Anne Lauscher
CSEval: Towards Automated, Multi-Dimensional, and Reference-Free Counterspeech Evaluation using Auto-Calibrated LLMs Amey Hengle, Aswini Kumar Padhi, Anil Bandhakavi, Tanmoy Chakraborty
MAD Speech: Measures of Acoustic Diversity of Speech Matthieu Futeral, Andrea Agostinelli, Marco Tagliasacchi, Neil Zeghidour, Eugene Kharitonov
Afrispeech-Dialog: A Benchmark Dataset for Spontaneous English Conversations in Healthcare and Beyond Mardhiyah Sanni, Tassallah Abdullahi, Devendra Deepak Kayande, Emmanuel Ayodele, Naome A Etori, Michael Samwel Mollel, Moshood O. Yekini, Chibuzor Okocha, Lukman Enegi Ismaila, Folafunmi Omofoye, Boluwatife A. Adewale, Tobi Olatunji
Wav2Prompt: End-to-End Speech Prompt Learning and Task-based Fine-tuning for Text-based LLMs Keqi Deng, Guangzhi Sun, Phil Woodland
On the Role of Speech Data in Reducing Toxicity Detection Bias Samuel Bell, Mariano Coria Meglioli, Megan Richards, Eduardo Sánchez, Christophe Ropers, Skyler Wang, Adina Williams, Levent Sagun, Marta R. Costa-jussà
Leveraging Allophony in Self-Supervised Speech Models for Atypical Pronunciation Assessment Kwanghee Choi, Eunjung Yeo, Kalvin Chang, Shinji Watanabe, David R Mortensen
StyleTTS-ZS: Efficient High-Quality Zero-Shot Text-to-Speech Synthesis with Distilled Time-Varying Style Diffusion Yinghao Aaron Li, Xilin Jiang, Cong Han, Nima Mesgarani
AfriHate: A Multilingual Collection of Hate Speech and Abusive Language Datasets for African Languages Shamsuddeen Hassan Muhammad, Idris Abdulmumin, Abinew Ali Ayele, David Ifeoluwa Adelani, Ibrahim Said Ahmad, Saminu Mohammad Aliyu, Paul Röttger, Abigail Oppong, Andiswa Bukula, et, al
Prepending or Cross-Attention for Speech-to-Text? An Empirical Comparison Tsz Kin Lam, Marco Gaido, Sara Papi, Luisa Bentivogli, Barry Haddow
ProSE: Diffusion Priors for Speech Enhancement Sonal Kumar, Sreyan Ghosh, Utkarsh Tyagi, Anton Jeran Ratnarajah, Chandra Kiran Reddy Evuru, Ramani Duraiswami, Dinesh Manocha
VoiceTextBlender: Augmenting Large Language Models with Speech Capabilities via Single-Stage Joint Speech-Text Supervised Fine-Tuning Yifan Peng, Krishna C Puvvada, Zhehuai Chen, Piotr Zelasko, He Huang, Kunal Dhawan, Ke Hu, Shinji Watanabe, Jagadeesh Balam, Boris Ginsburg
DyPCL: Dynamic Phoneme-level Contrastive Learning for Dysarthric Speech Recognition Wonjun Lee, Solee Im, Heejin Do, Yunsu Kim, Jungseul Ok, Gary Lee
How do Multimodal Foundation Models Encode Text and Speech? An Analysis of Cross-Lingual and Cross-Modal Representations Hyunji Lee, Danni Liu, Supriti Sinhamahapatra, Jan Niehues short
Developing multilingual speech synthesis system for Ojibwe, Mi’kmaq, and Maliseet Shenran Wang, Changbing Yang, Michael l parkhill, Chad Quinn, Christopher Hammerly, Jian Zhu
Cross-Lingual Transfer Learning for Speech Translation Rao Ma, Mengjie Qian, Yassir Fathullah, Siyuan Tang, Mark Gales, Kate Knill
kNN Retrieval for Simple and Effective Zero-Shot Multi-speaker Text-to-Speech Karl El Hajal, Ajinkya Kulkarni, Enno Hermann, Mathew Magimai Doss
WaveFM: A High-Fidelity and Efficient Vocoder Based on Flow Matching Tianze Luo, Xingchen Miao, Wenbo Duan
DiVISe: Direct Visual-Input Speech Synthesis Preserving Speaker Characteristics And Intelligibility Yifan Liu, Yu Fang, Zhouhan Lin Findings
BanTH: A Multi-label Hate Speech Detection Dataset for Transliterated Bangla Fabiha Haider, Fariha Tanjim Shifat, Md Farhan Ishmam, Md Sakib Ul Rahman Sourove, Deeparghya Dutta Barua, Md Fahim, Md Farhad Alam Bhuiyan
CDB: A Unified Framework for Hope Speech Detection Through Counterfactual, Desire and Belief Tulio Ferreira Leite da Silva, Gonzalo Freijedo Aduna, Farah Benamara, Alda Mari, Zongmin Li, Li Yue, Jian Su
Untangling Hate Speech Definitions: A Semantic Componential Analysis Across Cultures and Domains Katerina Korre, Arianna Muti, Federico Ruggeri, Alberto Barrón-Cedeño
Exploring Large Language Models for Hate Speech Detection in Rioplatense Spanish Juan Manuel Pérez, Paula Miguel, Viviana Cotik
Unsupervised Speech-text word-level alignment with Dynamic Programming Tianshu Yu, Zihan Gong, Minghuan Tan, Guhong Chen, Min Yang
Prompt-Guided Selective Masking Loss for Context-Aware Emotive Text-to-Speech Yejin Jeon, Youngjae Kim, Jihyun Lee, Gary Lee
Echoes of Discord: Forecasting Hater Reactions to Counterspeech Xiaoying Song, Sharon Lisseth Perez, Xinchen Yu, Eduardo Blanco, Lingzi Hong
Continuous Speech Tokenizer in Text To Speech Yixing Li, Ruobing Xie, Xingwu Sun, Yu Cheng, Zhanhui Kang
CA*: Addressing Evaluation Pitfalls in Computation-Aware Latency for Simultaneous Speech Translation Xi Xu, Wenda Xu, Siqi Ouyang, Lei Li
Gender Bias in Instruction-Guided Speech Synthesis Models Chun-Yi Kuan, Hung-yi Lee
Yeah, Un, Oh: Continuous and Real-time Backchannel Prediction with Fine-tuning of Voice Activity Projection Koji Inoue, Divesh Lala, Gabriel Skantze, Tatsuya Kawahara voice
Playing with Voices: Tabletop Role-Playing Game Recordings as a Diarization Challenge Lian Remme, Kevin Tang voice

Audio

Paper Authorlist Status
Ihquin tlahtouah in Tetelahtzincocah: An annotated, multi-purpose audio and text corpus of Western Sierra Puebla Nahuatl Robert Pugh, Cheyenne Wing, María Ximena Juárez Huerta, Angeles Márquez Hernandez, Francis M. Tyers
PAT: Parameter-Free Audio-Text Aligner to Boost Zero-Shot Audio Classification Ashish Seth, Ramaneswaran Selvakumar, Sonal Kumar, Sreyan Ghosh, Dinesh Manocha
AudioBench: A Universal Benchmark for Audio Large Language Models Bin Wang, Xunlong Zou, Geyu Lin, Shuo Sun, Zhuohan Liu, Wenyu Zhang, Zhengyuan Liu, AiTi Aw, Nancy F. Chen
Audio Is the Achilles’ Heel: Red Teaming Audio Large Multimodal Models Hao Yang, Lizhen Qu, Ehsan Shareghi, Gholamreza Haffari
Do Audio-Language Models Understand Linguistic Variations? Ramaneswaran Selvakumar, Sonal Kumar, Hemant Kumar Giri, Nishit Anand, Ashish Seth, Sreyan Ghosh, Dinesh Manocha
Comprehensive Layer-wise Analysis of SSL Models for Audio Deepfake Detection Yassine El Kheir, Younes Samih, Suraj Maharjan, Tim Polzehl, Sebastian Möller
Audio Description Generation in the Era of LLMs and VLMs: A Review of Transferable Generative AI Technologies Yingqiang Gao, Lukas Fischer, Alexa Lintner, Sarah Ebling
Synthetic Audio Helps for Cognitive State Tasks Adil Soubki, John Murzaku, Peter Zeng, Owen Rambow

AAAI'25

Speech

Paper Authorlist Status
ELLA-V: Stable Neural Codec Language Modeling with Alignment-Guided Sequence Reordering Yakun Song, Zhuo Chen, Xiaofei Wang, Ziyang Ma, Xie Chen
Language model can listen while speaking Ziyang Ma, Yakun Song, Chenpeng Du, Jian Cong, Zhuo Chen, Yuping Wang, Yuxuan Wang, Xie Chen
VQTalker: Towards Multilingual Talking Avatars through Facial Motion Tokenization Tao Liu, Ziyang Ma, Qi Chen, Feilong Chen, Shuai Fan, Xie Chen, Kai Yu
Speech Recognition Meets Large Language Model: Benchmarking, Models, and Exploration Ziyang Ma, Guanrou Yang, Yifan Yang, Zhifu Gao, Jiaming Wang, Zhihao Du, Fan Yu, Qian Chen, Siqi Zheng, Shiliang Zhang, Xie Chen
DIDiffGes: Decoupled Semi-Implicit Diffusion Models for Real-time Gesture Generation from Speech
FaceSpeak: Expressive and High-Quality Speech Synthesis from Human Portraits of Different Styles
Large Language Models Are Read/Write Policy-Makers for Simultaneous Generation
SECodec: Structural Entropy-based Compressive Speech Representation Codec for Speech Language Models
EmoReg: Directional Latent Vector Modeling for Emotional Intensity Regularization in Diffusion-based Voice Conversion
BSDB-Net: Band-Split Dual-Branch Network with Selective State Spaces Mechanism for Monaural Speech Enhancement
Enhancing Audiovisual Speech Recognition through Bifocal Preference Optimization
MathSpeech: Leveraging Small LMs for Accurate Conversion in Mathematical Speech-to-Formula
Speech Watermarking with Discrete Intermediate Representations
ProsodyFM: Unsupervised Phrasing and Intonation Control for Intelligible Speech Synthesis
Complex-Cycle-Consistent Diffusion Model for Monaural Speech Enhancement
StableVC: Style Controllable Zero-Shot Voice Conversion with Conditional Flow Matching
DEEPTalk: Dynamic Emotion Embedding for Probabilistic Speech-Driven 3D Face Animation
Drop the beat! Freestyler for Accompaniment Conditioned Rapping Voice Generation
Recording for Eyes, Not Echoing to Ears: Contextualized Spoken-to-Written Conversion of ASR Transcripts

Audio

Paper Authorlist Status
Codec does matter: Exploring the semantic shortcoming of codec for audio language model Zhen Ye, Peiwen Sun, Jiahe Lei, Hongzhan Lin, Xu Tan, Zheqi Dai, Qiuqiang Kong, Jianyi Chen, Jiahao Pan, Qifeng Liu, Yike Guo, Wei Xue
TechSinger: Technique Controllable Multilingual Singing Voice Synthesis via Flow Matching
MIDI-GPT: A Controllable Generative Model for Computer-Assisted Multitrack Music Composition
GVMGen: A General Video-to-Music Generation Model with Hierarchical Attentions
Detecting Music Performance Errors with Transformers
SoundBrush: Sound as a Brush for Visual Scene Editing
Tri-Ergon: Fine-grained Video-to-Audio Generation with Multi-modal Conditions and LUFS Control
SongGLM: Lyric-to-Melody Generation with 2D Alignment Encoding and Multi-Task Pre-Training
JoVALE: Detecting Human Actions in Video Using Audiovisual and Language Contexts
Query-centric Audio-Visual Cognition Network for Moment Retrieval, Segmentation and Step-Captioning
Dense Audio-Visual Event Localization under Cross-Modal Consistency and Multi-Temporal Granularity Collaboration
DLF: Disentangled-Language-Focused Multimodal Sentiment Analysis
Region-Based Optimization in Continual Learning for Audio Deepfake Detection
Enriching Multimodal Sentiment Analysis through Textual Emotional Descriptions of Visual-Audio Content
GoHD: Gaze-oriented and Highly Disentangled Portrait Animation with Rhythmic Poses and Realistic Expression
PointTalk: Audio-Driven Dynamic Lip Point Cloud for 3D Gaussian-based Talking Head Synthesis
Unleashing the Temporal-Spatial Reasoning Capacity of GPT for Training-Free Audio and Language Referenced Video Object Segmentation
JEN-1 Composer: A Unified Framework for High-Fidelity Multi-Track Music Generation
SongEditor: Adapting Zero-Shot Song Generation Language Model as a Multi-Task Editor
CSSinger: End-to-End Chunkwise Streaming Singing Voice Synthesis System Based on Conditional Variational Autoencoder
Read, Watch and Scream! Sound Generation from Text and Video
Mental-Perceiver: Audio-Textual Multi-Modal Learning for Estimating Mental Disorders

IJCAI'24

useful link: https://ijcai24.org/main-track-accepted-papers/index.html

The number of this conference (speech&audio) is small.

Speech

Paper Authorlist Status
Separate in the Speech Chain: Cross-Modal Conditional Audio-Visual Target Speech Extraction Zhaoxi Mu, Xinyu Yang
Bridge to Non-Barrier Communication: Gloss-Prompted Fine-Grained Cued Speech Gesture Generation with Diffusion Model Wentao Lei, Li Liu, Jun Wang
Two-stage Semi-supervised Speaker Recognition with Gated Label Learning Xingmei Wang, Jiaxiang Meng, Kong Aik Lee, Boquan Li, Jinghan Liu
Discriminative Feature Decoupling Enhancement for Speech Forgery Detection Yijun Bei, Xing Zhou, Erteng Liu, Yang Gao, Sen Lin, Kewei Gao, Zunlei Feng
Innovative Directional Encoding in Speech Processing: Leveraging Spherical Harmonics Injection for Multi-Channel Speech Enhancement Jiahui Pan, Pengjie Shen, Hui Zhang, Xueliang Zhang
Contextualized Speech Recognition: Rethinking Second-Pass Rescoring with Generative Large Language Models Yixuan Tang, Anthony K. H. Tung
Speech-Forensics: Towards Comprehensive Synthetic Speech Dataset Establishment and Analysis Zhoulin Ji, Chenhao Lin, Hang Wang, Chao Shen
Decoupling Breaks Data Barriers: A Decoupled Pre-training Framework for Multi-intent Spoken Language Understanding Libo Qin, Qiguang Chen, Jingxuan Zhou, Qinzheng Li, Chunlin Lu, Wanxiang Che
Recent Advances in End-to-End Simultaneous Speech Translation Xiaoqian Liu, Guoqiang Hu, Yangfan Du, Erfeng He, YingFeng Luo, Chen Xu, Tong Xiao, Jingbo Zhu Survey

Audio

Paper Authorlist Status
EAT: Self-Supervised Pre-Training with Efficient Audio Transformer Wenxi Chen, Yuzhe Liang, Ziyang Ma, Zhisheng Zheng, Xie Chen
Generating More Audios for End-to-End Spoken Language Understanding Xuxin Cheng, Yuexian Zou
BATON: Aligning Text-to-Audio Model Using Human Preference Feedback Huan Liao, Haonan Han, Kai Yang, Tianjiao Du, Rui Yang, Qinmei Xu, Zunnan Xu, Jingquan Liu, Jiasheng Lu, Xiu Li
HyDiscGAN: A Hybrid Distributed cGAN for Audio-Visual Privacy Preservation in Multimodal Sentiment Analysis Zhuojia Wu, Qi Zhang, Duoqian Miao, Kun Yi, Wei Fan, Liang Hu
InstructME: An Instruction Guided Music Edit Framework with Latent Diffusion Models Bing Han, Junyu Dai, Weituo Hao, Xinyan He, Dong Guo, Jitong Chen, Yuxuan Wang, Yanmin Qian, Xuchen Song

ICML'25

Speech

Paper Authorlist Status
MoHAVE: Mixture of Hierarchical Audio-Visual Experts for Robust Speech Recognition https://arxiv.org/abs/2502.10447
The Brain's Bitter Lesson: Scaling Speech Decoding With Self-Supervised Learning https://arxiv.org/abs/2406.04328
DMOSpeech: Direct Metric Optimization via Distilled Diffusion Model in Zero-Shot Speech Synthesis https://arxiv.org/abs/2410.11097
Do Not Mimic My Voice : Speaker Identity Unlearning for Zero-Shot Text-to-Speech https://openreview.net/forum?id=v9LjNopQ6W&noteId=B8CPk9usHO
BinauralFlow: A Causal and Streamable Approach for High-Quality Binaural Speech Synthesis with Flow Matching Models
Emotional Face-to-Speech https://arxiv.org/abs/2502.01046
DiTAR: Diffusion Transformer Autoregressive Modeling for Speech Generation https://arxiv.org/abs/2502.03930
Unsupervised Blind Speech Separation with a Diffusion Prior https://arxiv.org/abs/2505.05657
Sortformer: A Novel Approach for Permutation-Resolved Speaker Supervision in Speech-to-Text Systems
High-Fidelity Simultaneous Speech-To-Speech Translation https://arxiv.org/abs/2502.03382
Improving Conversational Capabilities of Speech Language Models via Generative Dual-channel Spoken Dialogue Learning
Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM https://arxiv.org/abs/2411.00774
OWLS: Scaling Laws for Multilingual Speech Recognition and Translation Models https://arxiv.org/abs/2502.10373
Long-Form Speech Generation with Spoken Language Models https://arxiv.org/abs/2412.18603
Aligning Spoken Dialogue Models from User Interactions spoken dialogue
A Variational Framework for Improving Naturalness in Generative Spoken Language Models
De-AntiFake: Rethinking the Protective Perturbations Against Voice Cloning Attacks
Multi-band Frequency Reconstruction for Neural Psychoacoustic Coding https://arxiv.org/abs/2505.07235

Audio

Paper Authorlist Status
XAttnMark: Learning Robust Audio Watermarking with Cross-Attention https://arxiv.org/abs/2502.04230
ETTA: Elucidating the Design Space of Text-to-Audio Models https://arxiv.org/abs/2412.19351
Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities https://arxiv.org/abs/2503.03983
Sounding that Object: Interactive Object-Aware Image to Audio Generation
ALMTokenizer: A Low-bitrate and Semantic-rich Audio Codec Tokenizer for Audio Language Modeling https://arxiv.org/abs/2504.10344
Generative Audio Language Modeling with Continuous-valued Tokens and Masked Next-Token Prediction
Supervised Contrastive Learning from Weakly-Labeled Audio Segments for Musical Version Matching https://arxiv.org/abs/2502.16936
FLAM: Frame-Wise Language-Audio Modeling https://arxiv.org/abs/2505.05335
MATS: An Audio Language Model under Text-only Supervision https://arxiv.org/abs/2502.13433
AGAV-Rater: Adapting Large Multimodal Model for AI-Generated Audio-Visual Quality Assessment https://arxiv.org/abs/2501.18314
AudioSpace: Generating Spatial Audio from 360-Degree Video https://arxiv.org/abs/2504.14906
IMPACT: Iterative Mask-based Parallel Decoding for Text-to-Audio Generation with Diffusion Modeling
video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model https://arxiv.org/abs/2502.11775
Efficient Fine-Grained Guidance for Diffusion-Based Symbolic Music Generation music, https://arxiv.org/abs/2410.08435
MuseControlLite: Multifunctional Music Generation with Lightweight Conditioners
Gaussian Mixture Flow Matching Models flow matching
SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation song

AAAI'26

Speech

Paper Authorlist Status
MF-Speech: Achieving Fine-Grained and Compositional Control in Speech Generation via Factor Disentanglement

Audio

Paper Authorlist Status
StreamingTalker: Audio-driven 3D Facial Animation with Autoregressive Diffusion Model

Useful Survey & Awesome Link

  1. Amphion v0.2 technical report https://arxiv.org/abs/2501.15442

  2. Emilia-Large:更大杯,更多实验结果及细节 https://arxiv.org/abs/2501.15907

  3. AnyEnhance:语音增强、歌声增强、说话人提取等等任务,AnyEnhance一个模型全搞定 https://arxiv.org/abs/2501.15417

Citation

If you find this repository helpful, please consider citing:

@misc{Zhang2025SpeechAudio,
  title = {Speech-and-audio-papers-Top-Conference},
  author = {Bowen Zhang},
  year = {2025},
  howpublished = {\url{https://github.com/01Zhangbw/Speech-and-audio-papers-Top-Conference}},
}

License

This repository is released under the MIT license.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors