Speech and audio papers@Top Conference (Update Regularly)

Welcome to star⭐ Discuss in Issues or collaborate via PRs~👏 Feel free to contact📧 me via zhangbw0102@gmail.com.

🎉 [01/23/2025] UPDATE ICLR 2025 conference papers successfully!

🎉 [01/23/2025] UPDATE ICLR 2024 conference papers successfully!

🎉 [01/29/2025] UPDATE ICML 2024 conference papers successfully!

🎉 [01/29/2025] UPDATE NeurIPS 2024 conference papers successfully!

🎉 [01/30/2025] UPDATE ICML 2023 conference papers successfully!

🎉 [01/30/2025] UPDATE NeurIPS 2023 conference papers successfully!

🎉 [01/30/2025] UPDATE ACMMM 2024 conference papers successfully!

🎉 [01/30/2025] UPDATE ICLR 2023 conference papers successfully!

🎉 [01/30/2025] UPDATE AAAI 2024 conference papers successfully!

🎉 [01/31/2025] UPDATE ACL 2024 conference papers successfully!

🎉 [01/31/2025] UPDATE EMNLP 2024 conference papers successfully!

🎉 [03/24/2025] UPDATE NAACL 2025 conference papers successfully!

🎉 [04/22/2025] UPDATE AAAI 2025 conference papers successfully!

🎉 [04/22/2025] UPDATE IJCAI 2024 conference papers successfully!

🎉 [05/16/2025] UPDATE ICML 2025 conference papers successfully!

🎉 [01/24/2026] UPDATE AAAI 2026 conference papers successfully!

Speech and audio papers@Top Conference

ICLR'25
- Speech
- Audio
- Summary
ICLR'24
- Speech
- Audio
- Summary
ICML'24
- Speech
- Audio
NeurIPS'24
- Speech
- Audio
ICML'23
- Speech
- Audio
NeurIPS'23
- Speech
- Audio
ACMMM'24
- Speech
- Audio
ICLR'23
- Speech
- Audio
AAAI'24
- Speech
- Audio
ACL'24
- Speech
- Audio
EMNLP'24
- Speech
- Audio
NAACL'25
- Speech
- Audio
AAAI'25
- Speech
- Audio
IJCAI'24
- Speech
- Audio
ICML'25
- Speech
- Audio
AAAI'26
- Speech
- Audio
Useful Survey & Awesome Link
Citation
License

ICLR'25

ICLR'25 total submission: 11672; accepted: 3706 (31.75%)

Speech

It includes the papers on speech (rate is good or middle, often more than 5), not limited to accepted or not.

Total speech papers@ICLR25 number is 100+; We select 49 papers.

re denotes rejected. con denotes conditionalonethicsreview. The numbers like 5668 denotes the detailed rate is 5,6,6,8.

Paper	Status	Average rate
TANGO: Co-Speech Gesture Video Reenactment with Hierarchical Audio Motion Embedding and Diffusion Interpolation	con	8.50
Co$^{\mathbf{3}}$Gesture: Towards Coherent Concurrent Co-speech 3D Gesture Generation with Interactive Diffusion		7.50
Scaling Transformers for Low-Bitrate High-Quality Speech Coding		7.00
Context-aware Dynamic Pruning for Speech Foundation Models		7.00
Scaling Speech-Text Pre-training with Synthetic Interleaved Data	con	7.00
CR-CTC: Consistency regularization on CTC for improved speech recognition		6.75
Sylber: Syllabic Embedding Representation of Speech from Raw Audio		6.75
Three-in-One: Fast and Accurate Transducer for Hybrid-Autoregressive Speech Recognition		6.75
Multi-Task Corrupted Prediction for Learning Robust Audio-Visual Speech Representation		6.75
Joint Fine-tuning and Conversion of Pretrained Speech and Language Models towards Linear Complexity		6.75
Audio Large Language Models Can Be Descriptive Speech Quality Evaluators		6.75
Continuous Autoregressive Modeling with Stochastic Monotonic Alignment for Speech Synthesis		6.67
EcoFace: Audio-Visual Emotional Co-Disentanglement Speech-Driven 3D Talking Face Generation		6.50
LLaMA-Omni: Seamless Speech Interaction with Large Language Models		6.50
Objective Soups: Multilingual Multi-Task Acoustic Modeling for Automatic Speech Recognition	not accepted but rate is good	6.50
SyllableLM: Learning Coarse Semantic Units for Speech Language Models		6.50
Improving Semantic Understanding in Speech Language Models via Brain-tuning		6.50
SonicSim: A customizable simulation platform for speech processing in moving sound source scenarios		6.50
Bridging the Data Provenance Gap Across Text, Speech, and Video		6.50
HALL-E: Hierarchical Neural Codec Language Model for Minute-Long Zero-Shot Text-to-Speech Synthesis		6.40
DiTTo-TTS: Diffusion Transformers for Scalable Text-to-Speech without Domain-Specific Factors		6.25
T2V2: A Unified Non-Autoregressive Model for Speech Recognition and Synthesis via Multitask Learning		6.25
VLAS: Vision-Language-Action Model with Speech Instructions for Customized Robot Manipulation		6.25
GenSE: Generative Speech Enhancement via Language Models using Hierarchical Modeling		6.00
UniWav: Towards Unified Pre-training for Speech Representation Learning and Generation		6.00
FIRING-Net: A filtered feature recycling network for speech enhancement		6.00
TIGER: Time-frequency Interleaved Gain Extraction and Reconstruction for Efficient Speech Separation		5.83
NIRANTAR: Continual Learning with New Languages and Domains on Real-world Speech Data	55568, rejected	5.80
Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis	5666, rejected	5.75
VChangeCodec: A High-efficiency Neural Speech Codec with Built-in Voice Changer for Real-time Communication	5666, rejected	5.75
Speech Robust Bench: A Robustness Benchmark For Speech Recognition	5666,accepted	5.75
OTTC: A differentiable alignment approach to automatic speech recognition	368, rejected	5.68
SpeechFake: A Large-Scale Multilingual Speech Deepfake Dataset Toward Cutting-Edge Speech Generation Methods	566, rejected	5.67
Realistic-Gesture: Co-Speech Gesture Video Generation through Semantic-aware Gesture Representation	35668, rejected	5.60
A$^2$-Flow: Alignment-Aware Pre-training for Speech Synthesis with Flow Matching	3568, rejected	5.50
Representing speech through autoregressive prediction of cochlear tokens	5566, rejected	5.50
F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching	3568, rejected, but have big influence!	5.50
ASROB: Measuring Automatic Speech Recognition from One Book	3568, rejected,	5.50
SSR: Alignment-Aware Modality Connector for Speech Language Models	3568, rejected,	5.50
A Variational Approach for Generative Speech Language Modeling	3568, re	5.50
SPARQ: Outlier-free SpeechLM with Fast Adaptation and Robust Quantization	5566,re	5.50
Vevo: Controllable Zero-Shot Voice Imitation with Self-Supervised Disentanglement	3568, accepted	5.50
Enhancing Zero-shot Text-to-Speech Synthesis with Human Feedback	3568,re	5.50
Time-Accurate Speech Rich Transcription with Non-Fluencies	5566 withdraw	5.50
dMel: Speech Tokenization Made Simple	35568 re	5.40
Orator: LLM-Guided Multi-Shot Speech Video Generation	35568 re	5.40
MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer	3666, accepted, have big influence!	5.25
Strategic Filtering for Content Moderation: Free Speech or Free of Distortion?	5556, re	5.25
ControlSpeech: Towards Simultaneous Zero-shot Speaker Cloning and Zero-shot Language Style Control	35558, withdraw	5.20
VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers	3368, re	5.00

Audio

It includes the papers on speech (rate is good or middle, often more than 5), not limited to accepted or not.

Total speech papers@ICLR25 number is 70+; We select 36 papers.

Paper	status	average rate
Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency	con	8.00
CyberHost: A One-stage Diffusion Framework for Audio-driven Talking Body Generation		7.60
$\texttt{BirdSet}$: A Large-Scale Dataset for Audio Classification in Avian Bioacoustics		7.50
ADIFF: Explaining audio difference using natural language		7.50
Both Ears Wide Open: Towards Language-Driven Spatial Audio Generation		7.50
Presto! Distilling Steps and Layers for Accelerating Music Generation	spotlight，music	7.25
FlowDec: A flow-based full-band general audio codec with high perceptual quality		7.00
I Can Hear You: Selective Robust Training for Deepfake Audio Detection	con	7.00
SSLAM: Enhancing Self-Supervised Models with Audio Mixtures for Polyphonic Soundscapes		7.00
RFWave: Multi-band Rectified Flow for Audio Waveform Reconstruction		6.80
Enhancing Deception Detection with Cognitive Load Features: An Audio-Visual Approach		6.75
Sylber: Syllabic Embedding Representation of Speech from Raw Audio		6.75
Synthio: Augmenting Small-Scale Audio Classification Datasets with Synthetic Data		6.75
Multi-Task Corrupted Prediction for Learning Robust Audio-Visual Speech Representation		6.75
Audio Large Language Models Can Be Descriptive Speech Quality Evaluators		6.75
Fugatto 1: Foundational Generative Audio Transformer Opus 1		6.75
WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling	35810	6.50
EcoFace: Audio-Visual Emotional Co-Disentanglement Speech-Driven 3D Talking Face Generation		6.50
MuPT: A Generative Symbolic Music Pretrained Transformer	music	6.50
ViSAGe: Video-to-Spatial Audio Generation		6.40
Aligned Better, Listen Better For Audio-Visual Large Language Models		6.25
Contrastive Learning from Synthetic Audio Doppelgängers		6.25
AVHBench: A Cross-Modal Hallucination Benchmark for Audio-Visual Large Language Models		6.20
Elucidating the Design Space of Text-to-Audio Models	5568, re	6.00
Hallo2: Long-Duration and High-Resolution Audio-Driven Portrait Image Animation		6.00
Discriminator-Guided Cooperative Diffusion for Joint Audio and Video Generation		6.00
Rethinking Audio-Visual Adversarial Vulnerability from Temporal and Modality Perspectives		6.00
Talking Turns: Benchmarking Audio Foundation Models on Turn-Taking Dynamics		5.80
Active Audio Cancellation with Multi-Band Mamba Network	3668, re	5.75
The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio	5666, re	5.75
AdvWave: Stealthy Adversarial Jailbreak Attack against Large Audio-Language Models	3388, accepted	5.50
NatureLM-audio: an Audio-Language Foundation Model for Bioacoustics	3,5,8, accepted	5.33
Taming Data and Transformers for Audio Generation	3666, re	5.25
AVESFormer: Efficient Transformer Design for Real-Time Audio-Visual Segmentation	5556, re	5.25
Segment, Associate, and Classify: Decoupled Audio-Visual Segmentation Framework	5556 withdraw	5.25
Reverse the auditory processing pathway: Coarse-to-fine audio reconstruction from fMRI	3558, re	5.25
Collaborative Hybrid Propagator for Temporal Misalignment in Audio-Visual Segmentation	35558, withdraw	5.20
T2A-Feedback: Improving Basic Capabilities of Text-to-Audio Generation via Fine-grained AI Feedback	3566, withdraw	5.00

Summary

The accepted(or not) status depends on rate mainly. The rate of speech/audio track is not high, which is much less than the tracks like CV, NLP, etc. The rebuttals are very important!!!

ICLR'24

Speech

It includes the papers on speech (rate is good or middle, often more than 5), not limited to accepted or not.

Total speech papers@ICLR24 number is 50+; We select 20+ papers.

Paper	status	average rate
NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers	Spot	8.00
Large Language Models are Efficient Learners of Noise-Robust Speech Recognition	Spot	8.00
Multi-resolution HuBERT: Multi-resolution Speech Self-Supervised Learning with Masked Unit Prediction	Spot	8.00
Zipformer: A faster and better encoder for automatic speech recognition	Oral	7.50
RTFS-Net: Recurrent Time-Frequency Modelling for Efficient Audio-Visual Speech Separation		7.50
Self-Supervised Speech Quality Estimation and Enhancement Using Only Clean Speech		7.00
Spoken Question Answering and Speech Continuation Using Spectrogram-Powered LLM		6.75
SALMONN: Towards Generic Hearing Abilities for Large Language Models		6.67
It's Never Too Late: Fusing Acoustic Information into Large Language Models for Automatic Speech Recognition		6.60
Mega-TTS 2: Boosting Prompting Mechanisms for Zero-Shot Speech Synthesis		6.50
CLaM-TTS: Improving Neural Codec Language Model for Zero-Shot Text-to-Speech		6.40
BLSP: Bootstrapping Language-Speech Pre-training via Behavior Alignment of Continuation Writing	5668, re, link: https://arxiv.org/pdf/2309.00916	6.25
TransFace: Unit-Based Audio-Visual Speech Synthesizer for Talking Head Translation	5668, desk re, accepted by ACL2024, https://aclanthology.org/2024.findings-acl.593.pdf	6.25
Multilingual Visual Speech Recognition with a Single Model using Visual Speech Unit	56668, re, link: https://arxiv.org/pdf/2401.09802v1	6.20
PromptTTS 2: Describing and Generating Voices with Text Prompt		6.00
Separate and Diffuse: Using a Pretrained Diffusion Model for Better Source Separation		6.00
PolyVoice: Language Models for Speech to Speech Translation	3588, accepted	6.00
DiffPoseTalk: Speech-Driven Stylistic 3D Facial Animation and Head Pose Generation via Diffusion Models	5568, re, accepted by SIGGRAPH 2024 (Journal Track), https://arxiv.org/pdf/2310.00434	6.00
LipVoicer: Generating Speech from Silent Videos Guided by Lip Reading	5568,accepted	6.00
Generative Pre-training for Speech with Flow Matching	3668,accepted	5.75
DiffAR: Denoising Diffusion Autoregressive Model for Raw Speech Waveform Generation	5666,accepted	5.75
SpeechTokenizer: Unified Speech Tokenizer for Speech Language Models	3668,accepted	5.75
SummaryMixing: A Linear-Complexity Alternative to Self-Attention for Speech Recognition and Understanding	3568, rem accepted by Interspeech24, https://arxiv.org/pdf/2307.07421	5.75
RepCodec: A Speech Representation Codec for Speech Tokenization	5566, re, accepted by ACL-main2024, https://arxiv.org/pdf/2309.00169	5.50
A Discrete and Variational Approach to Speech Representation Learning	33588, withdraw	5.40
Generative Pre-Trained Speech Language Model with Efficient Hierarchical Transformer	5556, re, accepted by ACL2024, https://arxiv.org/pdf/2406.00976	5.25

Audio

It includes the papers on speech (rate is good or middle, often more than 5), not limited to accepted or not.

Total speech papers@ICLR24 number is 20+; We select 17 papers.

Paper	status	average rate
Masked Audio Generation using a Single Non-Autoregressive Transformer		7.33
Listen, Think, and Understand		7.00
Revisiting Deep Audio-Text Retrieval Through the Lens of Transportation		6.67
Weakly-supervised Audio Separation via Bi-modal Semantic Similarity		6.67
CompA: Addressing the Gap in Compositional Reasoning in Audio-Language Models		6.50
Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis		6.00
Lifelong Audio-video Masked Autoencoder with Forget-robust Localized Alignments	55558， re	5.60
LAURAGPT: LISTEN, ATTEND, UNDERSTAND, AND REGENERATE AUDIO WITH GPT	5566， re	5.50
SoundStorm: Efficient Parallel Audio Generation	35568, re	5.40
Bootstrapping Audio-Visual Segmentation by Strengthening Audio Cues	3666, re	5.25
FINE-GRAINED AUDIO-VISUAL JOINT REPRESENTATIONS FOR MULTIMODAL LARGE LANGUAGE MODELS	3666, re	5.25
UniAudio: An Audio Foundation Model Toward Universal Audio Generation	15510, re, accept by icml24	5.25
Masked Autoencoders with Multi-Window Local-Global Attention Are Better Audio Learners	3666, re	5.25
SMILE: Audio-Visual Speech Recognition with Siamese Masked Interaction Learning	5555, re	5.00
Leveraging characteristics of the output distribution for identifying adversarial audio examples	5555, re	5.00
Rethinking Audiovisual Segmentation with Semantic Quantization and Decomposition	5555, re	5.00
WavJourney: Compositional Audio Creation with Large Language Models	35566, re	5.00

Summary

This year, the paper's number is not so large.

ICML'24

Speech

Paper	status
ELF: Encoding Speaker-Specific Latent Speech Feature for Speech Synthesis	link
video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models	link
NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models	link
InstructSpeech: Following Speech Editing Instructions via Large Language Models	link
Scaling Speech Technology to 1,000+ Languages	link
IIANet: An Intra- and Inter-Modality Attention Network for Audio-Visual Speech Separation	link
Speech Self-Supervised Learning Using Diffusion Model Synthetic Data	link
Proactive Detection of Voice Cloning with Localized Watermarking
SelfVC: Voice Conversion With Iterative Refinement using Self Transformations

Audio

Paper	status
Zero-Shot Unsupervised and Text-Based Audio Editing Using DDPM Inversion	link
UniAudio: Towards Universal Audio Generation with Large Language Models	link
Prompt-guided Precise Audio Editing with Diffusion Models
Creative Text-to-Audio Generation via Synthesizer Programming
Fast Timing-Conditioned Latent Audio Diffusion
Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities
Listenable Maps for Audio Classifiers
STELLA: Continual Audio-Video Pre-training with SpatioTemporal Localized Alignment
From Vision to Audio and Beyond: A Unified Model for Audio-Visual Representation and Generation
AND: Audio Network Dissection for Interpreting Deep Acoustic Models
EquiAV: Leveraging Equivariance for Audio-Visual Contrastive Learning
BAT: Learning to Reason about Spatial Sounds with Large Language Models	sound
Symbolic Music Generation with Non-Differentiable Rule Guided Diffusion	music
DITTO: Diffusion Inference-Time T-Optimization for Music Generation
An Independence-promoting Loss for Music Generation with Language Models
LLark: A Multimodal Instruction-Following Language Model for Music
MusicFlow: Cascaded Flow Matching for Text Guided Music Generation
MusicRL: Aligning Music Generation to Human Preferences

NeurIPS'24

Speech

useful link: https://nips.cc/virtual/2024/papers.html?filter=titles&search=speech

Paper	status
SSDM: Scalable Speech Dysfluency Modeling
SpeechForensics: Audio-Visual Speech Representation Learning for Face Forgery Detection
Paralinguistics-Aware Speech-Empowered Large Language Models for Natural Conversation
A Full-duplex Speech Dialogue Scheme Based On Large Language Model
CA-SSLR: Condition-Aware Self-Supervised Learning Representation for Generalized Speech Processing
Self-Taught Recognizer: Toward Unsupervised Adaptation for Speech Foundation Models
DiffNorm: Self-Supervised Normalization for Non-autoregressive Speech-to-speech Translation
SILENCE: Protecting privacy in offloaded speech understanding on resource-constrained devices
FINALLY: fast and universal speech enhancement with studio-like quality
SpeechAlign: Aligning Speech Generation to Human Preferences
Du-IN: Discrete units-guided mask modeling for decoding speech from Intracranial Neural signals
SCOREQ: Speech Quality Assessment with Contrastive Regression
RealMAN: A Real-Recorded and Annotated Microphone Array Dataset for Dynamic Speech Enhancement and Localization
TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation
CoVoMix: Advancing Zero-Shot Speech Generation for Human-like Multi-talker Conversations
Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs
IndicVoices-R: Unlocking a Massive Multilingual Multi-speaker Speech Corpus for Scaling Indian TTS
Separate and Reconstruct: Asymmetric Encoder-Decoder for Speech Separation
Comprehensive Framework for Curating Speech Datasets and Evaluating ASR Systems: A Case Study for the Polish Language
SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words	dataset

Audio

useful link: https://nips.cc/virtual/2024/papers.html?filter=titles&search=audio

Paper	status
Vocal Call Locator Benchmark (VCL'24) for localizing rodent vocalizations from multi-channel audio
SLIM: Style-Linguistics Mismatch Model for Generalized Audio Deepfake Detection
Tell What You Hear From What You See - Video to Audio Generation Through Text
Learning Spatially-Aware Language and Audio Embeddings
Lips Are Lying: Spotting the Temporal Inconsistency between Audio and Visual in Lip-Syncing DeepFakes
SpeechForensics: Audio-Visual Speech Representation Learning for Face Forgery Detection
Look, Listen, and Answer: Overcoming Biases for Audio-Visual Question Answering
Continual Audio-Visual Sound Separation
Mixtures of Experts for Audio-Visual Learning
Listenable Maps for Zero-Shot Audio Classifiers
Aligning Audio-Visual Joint Representations with an Agentic Workflow
AV-Cloud: Spatial Audio Rendering Through Audio-Visual Cloud Splatting
Frieren: Efficient Video-to-Audio Generation Network with Rectified Flow Matching
An eye for an ear: zero-shot audio description leveraging an image captioner with audio-visual token distribution matching
A Versatile Diffusion Transformer with Mixture of Noise Levels for Audiovisual Generation
Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs
VASA-1: Lifelike Audio-Driven Talking Faces Generated in Real Time
UniAudio 1.5: Large Language Model-Driven Audio Codec is A Few-Shot Audio Task Learner
AudioMarkBench: Benchmarking Robustness of Audio Watermarking
Differentiable Modal Synthesis for Physical Modeling of Planar String Sound and Motion Simulation	sound
Spike-based Neuromorphic Model for Sound Source Localization	sound
Images that Sound: Composing Images and Sounds on a Single Canvas	sound
The iNaturalist Sounds Dataset	sound
GTSinger: A Global Multi-Technique Singing Corpus with Realistic Music Scores for All Singing Tasks	music
MoMu-Diffusion: On Learning Long-Term Motion-Music Synchronization and Correspondence	music
Algorithmic Collective Action in Recommender Systems: Promoting Songs by Reordering Playlists	song

ICML'23

Speech

useful link: https://icml.cc/virtual/2023/papers.html?filter=titles&search=speech

Paper	status
Pre-training for Speech Translation: CTC Meets Optimal Transport	Oral
Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language	Oral
Robust Speech Recognition via Large-Scale Weak Supervision
Shiftable Context: Addressing Training-Inference Context Mismatch in Simultaneous Speech Translation
Self-supervised Neural Factor Analysis for Disentangling Utterance-level Speech Representations
MetricGAN-OKD: Multi-Metric Optimization of MetricGAN via Online Knowledge Distillation for Speech Enhancement
Mu$^2$SLAM: Multitask, Multilingual Speech and Language Models

Audio

useful link: https://icml.cc/virtual/2023/papers.html?filter=titles&search=audio

Paper	Status
AudioLDM: Text-to-Audio Generation with Latent Diffusion Models
Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models
A Unified Audio-Visual Learning Framework for Localization, Separation, and Recognition
BEATs: Audio Pre-Training with Acoustic Tokenizers	Oral
Do You Remember? Overcoming Catastrophic Forgetting for Fake Audio Detection

NeurIPS'23

Speech

Paper	Status
High-Fidelity Audio Compression with Improved RVQGAN	Spot
Sounding Bodies: Modeling 3D Spatial Sound of Humans Using Body Pose and Audio	Spot
How to Scale Your EMA	Spot
Textually Pretrained Speech Language Models
ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text Translation
DASpeech: Directed Acyclic Transformer for Fast and High-quality Speech-to-Speech Translation
StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models
Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale
DOSE: Diffusion Dropout with Adaptive Prior for Speech Enhancement
P-Flow: A Fast and Data-Efficient Zero-Shot TTS through Speech Prompting
DinoSR: Self-Distillation and Online Clustering for Self-supervised Speech Representation Learning
Parts of Speech–Grounded Subspaces in Vision-Language Models
UNSSOR: Unsupervised Neural Speech Separation by Leveraging Over-determined Training Mixtures
Learning Repeatable Speech Embeddings Using An Intra-class Correlation Regularizer
Disentangling Voice and Content with Self-Supervision for Speaker Recognition
From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion
Unified Segment-to-Segment Framework for Simultaneous Sequence Generation
Conditional Adapters: Parameter-efficient Transfer Learning with Fast Inference
Progressive Ensemble Distillation: Building Ensembles for Efficient Inference
LEACE: Perfect linear concept erasure in closed form
TART: A plug-and-play Transformer module for task-agnostic reasoning

Audio

Paper	Status
Compression with Bayesian Implicit Neural Representations	Spot
From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion
Pengi: An Audio Language Model for Audio Tasks
AUDIT: Audio Editing by Following Instructions with Latent Diffusion Models
MAViL: Masked Audio-Video Learners
Weakly-Supervised Audio-Visual Segmentation
Disentangled Counterfactual Learning for Physical Audiovisual Commonsense Reasoning
Diff-Foley: Synchronized Video-to-Audio Synthesis with Latent Diffusion Models
AV-NeRF: Learning Neural Fields for Real-World Audio-Visual Scene Synthesis
Simple and Controllable Music Generation	music
CoLLAT: On Adding Fine-grained Audio Understanding to Language Models using Token-Level Locked-Language Tuning
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
Modality-Independent Teachers Meet Weakly-Supervised Audio-Visual Event Parser
Dual Mean-Teacher: An Unbiased Semi-Supervised Framework for Audio-Visual Source Localization
Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual Downstream Tasks
Revisit Weakly-Supervised Audio-Visual Video Parsing from the Language Perspective
Self-Supervised Visual Acoustic Matching
Connecting Multi-modal Contrastive Representations
Kiki or Bouba? Sound Symbolism in Vision-and-Language Models	sound
SoundCam: A Dataset for Finding Humans Using Room Acoustics	sound
DISCO-10M: A Large-Scale Music Dataset	music bench
MARBLE: Music Audio Representation Benchmark for Universal Evaluation	music bench
Achieving Cross Modal Generalization with Multimodal Unified Representation
Any-to-Any Generation via Composable Diffusion
Efficient Neural Music Generation	music
Training Transitive and Commutative Multimodal Transformers with LoReTTa
Latent Diffusion for Language Generation
Block-State Transformers
Learning Interpretable Low-dimensional Representation via Physical Symmetry
Alternating Gradient Descent and Mixture-of-Experts for Integrated Multimodal Perception
Feature Dropout: Revisiting the Role of Augmentations in Contrastive Learning
Language Semantic Graph Guided Data-Efficient Learning

ACMMM'24

Speech

Paper	Status
VoxInstruct: Expressive Human Instruction-to-Speech Generation with Unified Multilingual Codec Language Modelling	Oral
UniStyle: Unified Style Modeling for Speaking Style Captioning and Stylistic Speech Synthesis	Oral
Boosting Speech Recognition Robustness to Modality-Distortion with Contrast-Augmented Prompts	Oral
ArtSpeech: Adaptive Text-to-Speech Synthesis with Articulatory Representations	Oral
Self-Supervised Emotion Representation Disentanglement for Speech-Preserving Facial Expression Manipulation	Oral
Generative Expressive Conversational Speech Synthesis
SpeechCraft: A Fine-Grained Expressive Speech Dataset with Natural Language Description
CIEASR:Contextual Image-Enhanced Automatic Speech Recognition for Improved Homophone Discrimination
EGGesture: Entropy-Guided Vector Quantized Variational AutoEncoder for Co-Speech Gesture Generation
DEITalk: Speech-Driven 3D Facial Animation with Dynamic Emotional Intensity Modeling
Contrastive Context-Speech Pretraining for Expressive Text-to-Speech Synthesis
RAVSS: Robust Audio-Visual Speech Separation in Multi-Speaker Scenarios with Missing Visual Cues
Efficient Training for Multilingual Visual Speech Recognition: Pre-training with Discretized Visual Speech Representation
Speech Reconstruction from Silent Lip and Tongue Articulation by Diffusion Models and Text-Guided Pseudo Target Generation
MDT-A2G: Exploring Masked Diffusion Transformers for Co-Speech Gesture Generation
SpeechEE: A Novel Benchmark for Speech Event Extraction
MambaGesture: Enhancing Co-Speech Gesture Generation with Mamba and Disentangled Multi-Modality Fusion
Emphasizing Semantic Consistency of Salient Posture for Speech-Driven Gesture Generation
Enabling Synergistic Full-Body Control in Prompt-Based Co-Speech Motion Generation
FlashSpeech: Efficient Zero-Shot Speech Synthesis

Audio

Paper	Status
OpenAVE: Moving towards Open Set Audio-Visual Event Localization	Oral
Unveiling and Mitigating Bias in Audio Visual Segmentation	Oral
AV-Deepfake1M: A Large-Scale LLM-Driven Audio-Visual Deepfake Dataset	Oral
Tango 2: Aligning Diffusion-based Text-to-Audio Generative Models through Direct Preference Optimization	Oral
Towards Trustworthy MetaShopping: Studying Manipulative Audiovisual Designs in Virtual-Physical Commercial Platforms	Oral
Open-Vocabulary Audio-Visual Semantic Segmentation	Oral
Advancing Multi-grained Alignment for Contrastive Language-Audio Pre-training	Oral
Toward Explainable Physical Audiovisual Commonsense Reasoning	Oral
TiVA: Time-Aligned Video-to-Audio Generation	Oral
Coarse-to-Fine Proposal Refinement Framework For Audio Temporal Forgery Detection and Localization	Oral
SelM: Selective Mechanism based Audio-Visual Segmentation	Oral
Dissecting Temporal Understanding in Text-to-Audio Retrieval
FRADE: Forgery-aware Audio-distilled Multimodal Learning for Deepfake Detection
AMG-Embedding: a Self-Supervised Embedding Approach for Audio Identification
MMAL: Multi-Modal Analytic Learning for Exemplar-Free Audio-Visual Class Incremental Tasks
Utilizing Speaker Profiles for Impersonation Audio Detection
CACE-Net: Co-guidance Attention and Contrastive Enhancement for Effective Audio-Visual Event Localization
CoPL:Parameter-Efficient Collaborative Prompt Learning for Audio-Visual Tasks
Time-Frequency Domain Fusion Enhancement for Audio Super-Resolution
Auto-ACD: A Large-scale Dataset for Audio-Language Representation Learning
Multi-grained Correspondence Learning of Audio-language Models for Few-shot Audio Recognition
Audio Deepfake Detection with Self-Supervised XLS-R and SLS Classifier
AVHash: Joint Audio-Visual Hashing for Video Retrieval
RAVSS: Robust Audio-Visual Speech Separation in Multi-Speaker Scenarios with Missing Visual Cues
EchoAudio: Efficient and High-Quality Text-to-Audio Generation with Minimal Inference Steps
Instance-Level Panoramic Audio-Visual Saliency Detection and Ranking
Audio-Driven Identity Manipulation for Face Inpainting
GROOT: Generating Robust Watermark for Diffusion-Model-Based Audio Synthesis
TAS: Personalized Text-guided Audio Spatialization
Boosting Audio Visual Question Answering via Key Semantic-Aware Cues
V2A-Mark: Versatile Deep Visual-Audio Watermarking for Manipulation Localization and Copyright Protection

ICLR'23

Speech

Paper	Status
TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation
An efficient encoder-decoder architecture with top-down attention for speech separation
Jointly Learning Visual and Auditory Speech Representations from Raw Data
Bag of Tricks for Unsupervised Text-to-Speech
In-Situ Text-Only Adaptation of Speech Models with Low-Overhead Speech Imputations
Revisiting the Entropy Semiring for Neural Speech Recognition
D4AM: A General Denoising Framework for Downstream Acoustic Models
Filter-Recovery Network for Multi-Speaker Audio-Visual Speech Separation
BigVGAN: A Universal Neural Vocoder with Large-Scale Training
Continuous pseudo-labeling from the start
NANSY++: Unified Voice Synthesis with Neural Analysis and Synthesis
BAYES RISK CTC: CONTROLLABLE CTC ALIGNMENT IN SEQUENCE-TO-SEQUENCE TASKS

Audio

Paper	Status
Token Merging: Your ViT But Faster	Oral
Contrastive Audio-Visual Masked Autoencoder	Spot
AudioGen: Textually Guided Audio Generation
Defending against Adversarial Audio via Diffusion Model
wav2tok: Deep Sequence Tokenizer for Audio Retrieval
Continual Transformers: Redundancy-Free Attention for Online Inference
GeneFace: Generalized and High-Fidelity Audio-Driven 3D Talking Face Synthesis
Words are all you need? Language as an approximation for human similarity judgments

AAAI'24

useful link: https://aaai.org/wp-content/uploads/2024/02/AAAI-24_Main_2024-02-01.pdf

https://github.com/DmitryRyumin/AAAI-2024-Papers

Speech

Paper	Status
Mimic: Speaking Style Disentanglement for Speech-Driven 3D Facial Animation	https://arxiv.org/abs/2312.10877
UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-Diffusion and Vocoding	https://arxiv.org/abs/2306.07547
Multichannel AV-wav2vec2: A Framework for Learning Multichannel Multi-Modal Speech Representation	https://arxiv.org/abs/2401.03468
Visual Hallucination Elevates Speech Recognition	https://ojs.aaai.org/index.php/AAAI/article/view/29926
Spanning the Spectrum of Hatred Detection: A Persian Multi-Label Hate Speech Dataset with Annotator Rationales	https://ojs.aaai.org/index.php/AAAI/article/view/29743
Restoring Speaking Lips from Occlusion for Audio-Visual Speech Recognition	https://ojs.aaai.org/index.php/AAAI/article/view/29882
MM-TTS: Multi-Modal Prompt Based Style Transfer for Expressive Text-toSpeech Synthesis	https://arxiv.org/abs/2312.10687
Emotion Rendering for Conversational Speech Synthesis with Heterogeneous Graph-Based Context Modeling	https://arxiv.org/abs/2312.11947
Let There Be Sound: Reconstructing High Quality Speech from Silent Videos	https://arxiv.org/abs/2308.15256
Divergence-Guided Simultaneous Speech Translation	https://ojs.aaai.org/index.php/AAAI/article/view/29733
SECap: Speech Emotion Captioning with Large Language Model	https://arxiv.org/abs/2312.10381
Self-Supervised Disentangled Representation Learning for Robust Target Speech Extraction	https://arxiv.org/abs/2312.10305

Audio

Paper	Status
AE-NeRF: Audio Enhanced Neural Radiance Field for Few Shot Talking Head Synthesis	https://arxiv.org/abs/2312.10921
V2A-Mapper: A Lightweight Solution for Vision-to-Audio Generation by Connecting Foundation Models	https://arxiv.org/abs/2308.09300
What to Remember: Self-Adaptive Continual Learning for Audio Deepfake Detection	https://arxiv.org/abs/2312.09651
Audio Generation with Multiple Conditional Diffusion Model	https://arxiv.org/abs/2308.11940
AVSegFormer: Audio-Visual Segmentation with Transformer	https://ojs.aaai.org/index.php/AAAI/article/view/29104
Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation	https://arxiv.org/abs/2309.16429
Sample-Constrained Black Box Optimization for Audio Personalization	https://ojs.aaai.org/index.php/AAAI/article/view/28881
DTF-AT: Decoupled Time-Frequency Audio Transformer for Event Classification	https://ojs.aaai.org/index.php/AAAI/article/view/29716
CAVEN: An Embodied Conversational Agent for Efficient Audio-Visual Navigation in Noisy Environments	https://arxiv.org/abs/2306.04047
Learning Temporal Resolution in Spectrogram for Audio Classification	https://arxiv.org/abs/2210.01719
SoundCount: Sound Counting from Raw Audio with Dyadic Decomposition Neural Network	https://arxiv.org/abs/2312.16149
Segment beyond View: Handling Partially Missing Modality for Audio-Visual Semantic Segmentation	https://arxiv.org/abs/2312.08673
Improving Audio-Visual Segmentation with Bidirectional Generation	https://arxiv.org/abs/2308.08288
Audio Scanning Network: Bridging Time and Frequency Domains for Audio Classification	https://ojs.aaai.org/index.php/AAAI/article/view/29015
Object-Aware Adaptive-Positivity Learning for Audio-Visual Question Answering	https://arxiv.org/abs/2312.12816
Prompting Segmentation with Sound Is Generalizable Audio-Visual Source Localizer	https://arxiv.org/abs/2309.07929

ACL'24

useful link: https://2024.aclweb.org/program/main_conference_papers/#long-papers

https://2024.aclweb.org/program/finding_papers/

Speech

60 papers

Paper	Authorlist	Status
GenTranslate: Large Language Models are Generative Multilingual Speech and Machine Translators	Yuchen Hu, Chen Chen, Chao-Han Huck Yang, Ruizhe Li, Dong Zhang, Zhehuai Chen, EngSiong Chng	Long, link
Wav2Gloss: Generating Interlinear Glossed Text from Speech	Taiqi He, Kwanghee Choi, Lindia Tjuatja, Nathaniel Romney Robinson, Jiatong Shi, Shinji Watanabe, Graham Neubig, David R Mortensen, Lori Levin	https://aclanthology.org/2024.acl-long.34.pdf
A Non-autoregressive Generation Framework for End-to-End Simultaneous Speech-to-Any Translation	Zhengrui Ma, Qingkai Fang, Shaolei Zhang, Shoutao Guo, Yang Feng, Min zhang	https://aclanthology.org/2024.acl-long.85.pdf
Generative Pre-trained Speech Language Model with Efficient Hierarchical Transformer	Yongxin Zhu, Dan Su, Liqiang He, Linli Xu, Dong Yu	https://aclanthology.org/2024.acl-long.97.pdf
Speech Translation with Speech Foundation Models and Large Language Models: What is There and What is Missing?	Marco Gaido, Sara Papi, Matteo Negri, Luisa Bentivogli	https://aclanthology.org/2024.acl-long.789.pdf
StreamAtt: Direct Streaming Speech-to-Text Translation with Attention-based Audio History Selection	Sara Papi, Marco Gaido, Matteo Negri, Luisa Bentivogli	https://aclanthology.org/2024.acl-long.202.pdf
Speech vs. Transcript: Does It Matter for Human Annotators in Speech Summarization?	Roshan Sharma, Suwon Shon, Mark Lindsey, Hira Dhamyal, Bhiksha Raj	https://aclanthology.org/2024.acl-long.790.pdf
LLM Knows Body Language, Too: Translating Speech Voices into Human Gestures	Chenghao Xu, Guangtao Lyu, Jiexi Yan, Muli Yang, Cheng Deng	https://aclanthology.org/2024.acl-long.273.pdf
RepCodec: A Speech Representation Codec for Speech Tokenization	Zhichao Huang, Chutong Meng, Tom Ko	https://aclanthology.org/2024.acl-long.314.pdf
Error-preserving Automatic Speech Recognition of Young English Learners’ Language	Janick Michot, Manuela Hürlimann, Jan Milan Deriu, Luzia Sauer, Katsiaryna Mlynchyk, Mark Cieliebak	https://aclanthology.org/2024.acl-long.348.pdf
Can We Achieve High-quality Direct Speech-to-Speech Translation without Parallel Speech Data?	Qingkai Fang, Shaolei Zhang, Zhengrui Ma, Min zhang, Yang Feng	https://aclanthology.org/2024.acl-long.392.pdf
Multimodal Contextualized Semantic Parsing from Speech	Jordan Voas, David Harwath, Ray Mooney	https://aclanthology.org/2024.acl-long.398.pdf
SpikeVoice: High-Quality Text-to-Speech Via Efficient Spiking Neural Network	Kexin Wang, Jiahong Zhang, Yong Ren, Man Yao, Di Shang, Bo XU, Guoqi Li	https://aclanthology.org/2024.acl-long.429.pdf
Speech Sense Disambiguation: Tackling Homophone Ambiguity in End-to-End Speech Translation	Tengfei Yu, Xuebo Liu, Liang Ding, Kehai Chen, Dacheng Tao, Min Zhang	https://aclanthology.org/2024.acl-long.435.pdf
Label-Synchronous Neural Transducer for E2E Simultaneous Speech Translation	Keqi Deng, Phil Woodland	https://aclanthology.org/2024.acl-long.448.pdf
Language Complexity and Speech Recognition Accuracy: Orthographic Complexity Hurts, Phonological Complexity Doesn’t	Chihiro Taguchi, David Chiang	https://aclanthology.org/2024.acl-long.827.pdf
Speech language models lack important brain-relevant semantics	SUBBA REDDY OOTA, Emin Çelik, Fatma Deniz, Mariya Toneva	https://aclanthology.org/2024.acl-long.462.pdf
StreamSpeech: Simultaneous Speech-to-Speech Translation with Multi-task Learning	Shaolei Zhang, Qingkai Fang, Shoutao Guo, Zhengrui Ma, Min zhang, Yang Feng	https://aclanthology.org/2024.acl-long.485.pdf
NaijaHate: Evaluating Hate Speech Detection on Nigerian Twitter Using Representative Data	Manuel Tonneau, Pedro Vitor Quinta de Castro, Karim Lasri, Ibrahim Sambo Farouq, Lakshmi Subramanian, Victor Orozco-Olvera, Samuel Fraiberger	https://aclanthology.org/2024.acl-long.488v2.pdf
Uni-Dubbing: Zero-Shot Speech Synthesis from Visual Articulation	Songju Lei, Xize Cheng, Mengjiao Lyu, Jianqiao Hu, Jintao Tan, Runlin Liu, Lingyu Xiong, Tao Jin, Xiandong Li, Zhou Zhao	https://aclanthology.org/2024.acl-long.543.pdf
OWSM-CTC: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification	Yifan Peng, Yui Sudo, Muhammad Shakeel, Shinji Watanabe	https://aclanthology.org/2024.acl-long.549.pdf
Don’t Go To Extremes: Revealing the Excessive Sensitivity and Calibration Limitations of LLMs in Implicit Hate Speech Detection	Min Zhang, Jianfeng He, Taoran Ji, Chang-Tien Lu	https://aclanthology.org/2024.acl-long.652.pdf
Structured Tree Alignment for Evaluation of (Speech) Constituency Parsing	Freda Shi, Kevin Gimpel, Karen Livescu	https://aclanthology.org/2024.acl-long.666.pdf
VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild	Puyuan Peng, Po-Yao Huang, Shang-Wen Li, Abdelrahman Mohamed, David Harwath	https://aclanthology.org/2024.acl-long.673.pdf
A Community-Centric Perspective for Characterizing and Detecting Anti-Asian Violence-Provoking Speech	Gaurav Verma, Rynaa Grover, Jiawei Zhou, Binny Mathew, Jordan Kraemer, Munmun De Choudhury, Srijan Kumar	https://aclanthology.org/2024.acl-long.684.pdf
XLAVS-R: Cross-Lingual Audio-Visual Speech Representation Learning for Noise-Robust Speech Perception	HyoJung Han, Mohamed Anwar, Juan Pino, Wei-Ning Hsu, Marine Carpuat, Bowen Shi, Changhan Wang	https://aclanthology.org/2024.acl-long.697.pdf
MobileSpeech: A Fast and High-Fidelity Framework for Mobile Zero-Shot Text-to-Speech	Shengpeng Ji, Ziyue Jiang, Wang Hanting, Jialung Zuo, Zhou Zhao	https://aclanthology.org/2024.acl-long.733.pdf
The MERSA Dataset and a Transformer-Based Approach for Speech Emotion Recognition	Enshi Zhang, Rafael Trujillo, Christian Poellabauer	https://aclanthology.org/2024.acl-long.752.pdf
Growing Trees on Sounds: Assessing Strategies for End-to-End Dependency Parsing of Speech	Adrien Pupier, Maximin Coavoux, Jérôme Goulian, Benjamin Lecouteux	Short, link
Explainability and Hate Speech: Structured Explanations Make Social Media Moderators Faster	Agostina Calabrese, Leonardo Neves, Neil Shah, Maarten W. Bos, Björn Ross, Mirella Lapata, Francesco Barbieri	https://aclanthology.org/2024.acl-short.38.pdf
On the Semantic Latent Space of Diffusion-Based Text-To-Speech Models	Miri Varshavsky, Roy Hirsch, Regev Cohen, Tomer Golany, Daniel Freedman, Ehud Rivlin	https://aclanthology.org/2024.acl-short.24.pdf
StreamVoice: Streamable Context-Aware Language Modeling for Real-time Zero-Shot Voice Conversion	Zhichao Wang, Yuanzhe Chen, Xinsheng Wang, Lei Xie, Yuping Wang	voice
Robust Singing Voice Transcription Serves Synthesis	Ruiqi Li, Yu Zhang, Yongqi Wang, Zhiqing Hong, Rongjie Huang, Zhou Zhao	voice
Make-A-Voice: Revisiting Voice Large Language Models as Scalable Multilingual and Multitask Learners	Rongjie Huang, Chunlei Zhang, Yongqi Wang, Dongchao Yang, Jinchuan Tian, Zhenhui Ye, Luping Liu, Zehan Wang, Ziyue Jiang, Xuankai Chang, Jiatong Shi, CHAO WENG, Zhou Zhao, Dong Yu	voice
Codec-SUPERB: An In-Depth Analysis of Sound Codec Models	Haibin Wu, Ho-Lam Chung, Yi-Cheng Lin, Yuan-Kuei Wu, Xuanjun Chen, Yu-Chi Pai, Hsiu-Hsuan Wang, Kai-Wei Chang, Alexander H. Liu, Hung-yi Lee	Findings
Listen Again and Choose the Right Answer: A New Paradigm for Automatic Speech Recognition with Large Language Models		Findings,
Wav2SQL: Direct Generalizable Speech-To-SQL Parsing
Multi-Modal Retrieval For Large Language Model Based Speech Recognition
ViHateT5: Enhancing Hate Speech Detection in Vietnamese With a Unified Text-to-Text Transformer Model
Speech-based Slot Filling using Large Language Models
LLaST: Improved End-to-end Speech Translation System Leveraged by Large Language Models
Semantic Role Labeling from Chinese Speech via End-to-End Learning
Revisiting Interpolation Augmentation for Speech-to-Text Generation
Self-Supervised Singing Voice Pre-Training towards Speech-to-Singing Conversion
TransFace: Unit-Based Audio-Visual Speech Synthesizer for Talking Head Translation
SpeechGuard: Exploring the Adversarial Robustness of Multi-modal Large Language Models
SharedCon: Implicit Hate Speech Detection using Shared Semantics
IndicVoices: Towards building an Inclusive Multilingual Speech Dataset for Indian Languages
wav2vec-S: Adapting Pre-trained Speech Models for Streaming
On the Evaluation of Speech Foundation Models for Spoken Language Understanding
Integrating Pre-Trained Speech and Language Models for End-to-End Speech Recognition
Tox-BART: Leveraging Toxicity Attributes for Explanation Generation of Implicit Hate Speech
Pushing the Limits of Zero-shot End-to-End Speech Translation
Textless Acoustic Model with Self-Supervised Distillation for Noise-Robust Expressive Speech-to-Speech Translation
emotion2vec: Self-Supervised Pre-Training for Speech Emotion Representation
Label-aware Hard Negative Sampling Strategies with Momentum Contrastive Learning for Implicit Hate Speech Detection
Aligning Speech Segments Beyond Pure Semantics
CTC-based Non-autoregressive Textless Speech-to-Speech Translation
MELD-ST: An Emotion-aware Speech Translation Dataset
Part-of-speech Tagging for Extremely Low-resource Indian Languages

Audio

https://2024.aclweb.org/program/finding_papers/

8 papers

Paper	Authorlist	Status
AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension	Qian Yang, Jin Xu, Wenrui Liu, Yunfei Chu, Ziyue Jiang, Xiaohuan Zhou, Yichong Leng, Yuanjun Lv, Zhou Zhao, Chang Zhou, Jingren Zhou	Long, link
StreamAtt: Direct Streaming Speech-to-Text Translation with Attention-based Audio History Selection	Sara Papi, Marco Gaido, Matteo Negri, Luisa Bentivogli	https://aclanthology.org/2024.acl-long.202.pdf
M$^3$AV: A Multimodal, Multigenre, and Multipurpose Audio-Visual Academic Lecture Dataset	Zhe Chen, Heyang Liu, Wenyi Yu, Guangzhi Sun, Hongcheng Liu, Ji Wu, Chao Zhang, Yu Wang, Yanfeng Wang	https://aclanthology.org/2024.acl-long.489.pdf
XLAVS-R: Cross-Lingual Audio-Visual Speech Representation Learning for Noise-Robust Speech Perception	HyoJung Han, Mohamed Anwar, Juan Pino, Wei-Ning Hsu, Marine Carpuat, Bowen Shi, Changhan Wang	https://aclanthology.org/2024.acl-long.697.pdf
MuTox: Universal MUltilingual Audio-based TOXicity Dataset and Zero-shot Detector	Marta R. Costa-jussà, Mariano Coria Meglioli, Pierre Andrews, David Dale, Prangthip Hansanti, Elahe Kalbassi, Alexandre Mourachko, Christophe Ropers, Carleigh Wood	Findings
X-ACE: Explainable and Multi-factor Audio Captioning Evaluation	Qian Wang, Jia-Chen Gu, Zhen-Hua Ling
Deepfake Defense: Constructing and Evaluating a Specialized Urdu Deepfake Audio Dataset	Sheza Munir, Wassay Sajjad, Mukeet Raza, Emaan Mujahid Abbas, Abdul Hameed Azeemi, Ihsan Ayyub Qazi, Agha Ali Raza
Beyond Orthography: Automatic Recovery of Short Vowels and Dialectal Sounds in Arabic	Yassine El Kheir, Hamdy Mubarak, Ahmed Ali, Shammur Absar Chowdhury	sound

EMNLP'24

useful link: https://2024.emnlp.org/program/accepted_main_conference/

https://2024.emnlp.org/program/accepted_findings/

Speech

58 papers

Paper	Authorlist	Status
When LLMs Meets Acoustic Landmarks: An Efficient Approach to Integrate Speech into Large Language Models for Depression Detection	Xiangyu Zhang, Hexin Liu, Kaishuai Xu, Qiquan Zhang, Daijiao Liu, Beena Ahmed, Julien Epps	Main, link
Speaking in Wavelet Domain: A Simple and Efficient Approach to Speed up Speech Diffusion Model	Xiangyu Zhang, Daijiao Liu, Hexin Liu, Qiquan Zhang, Hanyu Meng, Leibny Paola Garcia Perera, EngSiong Chng, Lina Yao	https://aclanthology.org/2024.emnlp-main.9.pdf
Scaling Properties of Speech Language Models	Santiago Cuervo, Ricard Marxer	https://aclanthology.org/2024.emnlp-main.21.pdf
EmphAssess : a Prosodic Benchmark on Assessing Emphasis Transfer in Speech-to-Speech Models	Maureen de Seyssel, Antony D’Avirro, Adina Williams, Emmanuel Dupoux	https://aclanthology.org/2024.emnlp-main.30.pdf
Is Safer Better? The Impact of Guardrails on the Argumentative Strength of LLMs in Hate Speech Countering	Helena Bonaldi, Greta Damo, Nicolás Benjamín Ocampo, Elena Cabrio, Serena Villata, Marco Guerini	https://aclanthology.org/2024.emnlp-main.201.pdf
AlignCap: Aligning Speech Emotion Captioning to Human Preferences	Ziqi Liang, Haoxiang Shi, Hanhui Chen	https://aclanthology.org/2024.emnlp-main.224.pdf
F$^2$RL: Factuality and Faithfulness Reinforcement Learning Framework for Claim-Guided Evidence-Supported Counterspeech Generation	Haiyang Wang, Yuchen Pan, Xin Song, Xuechen Zhao, Minghao Hu, Bin Zhou	https://aclanthology.org/2024.emnlp-main.255.pdf
Outcome-Constrained Large Language Models for Countering Hate Speech	Lingzi Hong, Pengcheng Luo, Eduardo Blanco, Xiaoying Song	https://aclanthology.org/2024.emnlp-main.260.pdf
On Mitigating Performance Disparities in Multilingual Speech Recognition	Monorama Swain, Anna Katrine van Zee, Anders Søgaard	https://aclanthology.org/2024.emnlp-main.323.pdf
Methods of Automatic Matrix Language Determination for Code-Switched Speech	Olga Iakovenko, Thomas Hain	https://aclanthology.org/2024.emnlp-main.330.pdf
EH-MAM: Easy-to-Hard Masked Acoustic Modeling for Self-Supervised Speech Representation Learning	Ashish Seth, Ramaneswaran S, S Sakshi, Sonal Kumar, Sreyan Ghosh, Dinesh Manocha	https://aclanthology.org/2024.emnlp-main.366.pdf
Muting Whisper: A Universal Acoustic Adversarial Attack on Speech Foundation Models	Vyas Raina, Rao Ma, Charles McGhee, Kate Knill, Mark Gales	https://aclanthology.org/2024.emnlp-main.430.pdf
Bridging Modalities: Enhancing Cross-Modality Hate Speech Detection with Few-Shot In-Context Learning	Ming Shan Hee, Aditi Kumaresan, Roy Ka-Wei Lee	https://aclanthology.org/2024.emnlp-main.445.pdf
Task Arithmetic can Mitigate Synthetic-to-Real Gap in Automatic Speech Recognition	Hsuan Su, Hua Farn, Fan-Yun Sun, Shang-Tse Chen, Hung-yi Lee	https://aclanthology.org/2024.emnlp-main.503.pdf
ESC: Efficient Speech Coding with Cross-Scale Residual Vector Quantized Transformers	Yuzhe Gu, Enmao Diao	https://aclanthology.org/2024.emnlp-main.562.pdf
Towards Robust Speech Representation Learning for Thousands of Languages	William Chen, Wangyou Zhang, Yifan Peng, Xinjian Li, Jinchuan Tian, Jiatong Shi, Xuankai Chang, Soumi Maiti, Karen Livescu, Shinji Watanabe	https://aclanthology.org/2024.emnlp-main.570.pdf
Speechworthy Instruction-tuned Language Models	Hyundong Justin Cho, Nicolaas Paul Jedema, Leonardo F. R. Ribeiro, Karishma Sharma, Pedro Szekely, Alessandro Moschitti, Ruben Janssen, Jonathan May	https://aclanthology.org/2024.emnlp-main.595.pdf
Towards Probing Speech-Specific Risks in Large Multimodal Models: A Taxonomy, Benchmark, and Insights	Hao Yang, Lizhen Qu, Ehsan Shareghi, Reza Haf	https://aclanthology.org/2024.emnlp-main.614.pdf
Integrating Argumentation and Hate-Speech-based Techniques for Countering Misinformation	Sougata Saha, Rohini Srihari	https://aclanthology.org/2024.emnlp-main.622.pdf
Unveiling the Role of Pretraining in Direct Speech Translation	Belen Alastruey, Gerard I. Gállego, Marta R. Costa-jussà	https://aclanthology.org/2024.emnlp-main.630.pdf
Multi-Level Cross-Modal Alignment for Speech Relation Extraction	Liang Zhang, Zhen Yang, Biao Fu, Ziyao Lu, Liangying Shao, Shiyu Liu, Fandong Meng, Jie Zhou, Xiaoli Wang, Jinsong Su	https://aclanthology.org/2024.emnlp-main.668.pdf
Self-Powered LLM Modality Expansion for Large Speech-Text Models	Tengfei Yu, Xuebo Liu, Zhiyi Hou, Liang Ding, Dacheng Tao, Min Zhang	https://aclanthology.org/2024.emnlp-main.690.pdf
Optimizing Rare Word Accuracy in Direct Speech Translation with a Retrieval-and-Demonstration Approach	Siqi Li, Danni Liu, Jan Niehues	https://aclanthology.org/2024.emnlp-main.708.pdf
Towards an Open-Source Speech Foundation Model for EU: 950,000 Hours of Open-Source Compliant Speech Data for EU Languages	Marco Gaido, Sara Papi, Luisa Bentivogli, Alessio Brutti, Mauro Cettolo, Roberto Gretter, Marco Matassoni, Mohamed Nabih, Matteo Negri	https://aclanthology.org/2024.emnlp-main.771.pdf
VHASR: A Multimodal Speech Recognition System With Vision Hotwords	Jiliang Hu, Zuchao Li, Ping Wang, Haojun Ai, Lefei Zhang, hai zhao	https://aclanthology.org/2024.emnlp-main.821.pdf
AudioVSR: Enhancing Video Speech Recognition with Audio Data	Xiaoda Yang, Xize Cheng, Jiaqi Duan, Hongshun Qiu, Minjie Hong, Minghui Fang, Shengpeng Ji, Jialong Zuo, Zhiqing Hong, Zhimeng Zhang, Tao Jin	https://aclanthology.org/2024.emnlp-main.858.pdf
Hate Personified: Investigating the role of LLMs in content moderation pipeline for hate speech	Sarah Masud, Sahajpreet Singh, Viktor Hangya, Alexander Fraser, Tanmoy Chakraborty	https://aclanthology.org/2024.emnlp-main.886.pdf
Please note that I’m just an AI: Analysis of Behavior Patterns of LLMs in (Non-)offensive Speech Identification	Esra Dönmez, Thang Vu, Agnieszka Falenska	https://aclanthology.org/2024.emnlp-main.1019.pdf
BLSP-Emo: Towards Empathetic Large Speech-Language Models	Chen Wang, Minpeng Liao, Zhongqiang Huang, Junhong Wu, Chengqing Zong, Jiajun Zhang	https://aclanthology.org/2024.emnlp-main.1070.pdf
Delving into Qualitative Implications of Synthetic Data for Hate Speech Detection	Camilla Casula, Sebastiano Vecellio Salto, Alan Ramponi, Sara Tonelli
Continual Test-time Adaptation for End-to-end Speech Recognition on Noisy Speech	Guan-Ting Lin, Wei Ping Huang, Hung-yi Lee
Interventional Speech Noise Injection for ASR Generalizable Spoken Language Understanding	YeonJoon Jung, Jaeseong Lee, Seungtaek Choi, Dohyeon Lee, Minsoo Kim, seung-won hwang
Bayesian Example Selection Improves In-Context Learning for Speech, Text, and Visual Modalities	Siyin Wang, Chao-Han Huck Yang, Ji Wu, Chao Zhang
PREDICT: Multi-Agent-based Debate Simulation for Generalized Hate Speech Detection	Someen Park, Jaehoon Kim, Seungwan Jin, Sohyun Park, Kyungsik Han
TokenVerse: Unifying Speech and NLP Tasks via Transducer-based ASR	Shashi Kumar, Srikanth Madikeri, Juan Pablo Zuluaga Gomez, Iuliia Thorbecke, Esaú VILLATORO-TELLO, Sergio Burdisso, Petr Motlicek, Karthik Pandia D S, Aravind Ganapathiraju
Twists, Humps, and Pebbles: Multilingual Speech Recognition Models Exhibit Gender Performance Gaps	Giuseppe Attanasio, Beatrice Savoldi, Dennis Fucci, Dirk Hovy
Casablanca: Data and Models for Multidialectal Arabic Speech Recognition	Bashar Talafha, Karima Kadaoui, Samar Mohamed Magdy, Mariem Habiboullah, Chafei Mohamed Chafei, Ahmed Oumar El-Shangiti, et.al.
SpeechQE: Estimating the Quality of Direct Speech Translation	HyoJung Han, Kevin Duh, Marine Carpuat
Simul-MuST-C: Simultaneous Multilingual Speech Translation Corpus Using Large Language Model	Mana Makinae, Yusuke Sakai, Hidetaka Kamigaito, Taro Watanabe
Is Child-Directed Speech Effective Training Data for Language Models?	Steven Y. Feng, Noah Goodman, Michael Frank
HateCOT: An Explanation-Enhanced Dataset for Generalizable Offensive Speech Detection via Large Language Models	Huy Nghiem, Hal Daumé III	Findings
PolyWER: A Holistic Evaluation Framework for Code-Switched Speech Recognition	Karima Kadaoui, Maryam Al Ali, Hawau Olamide Toyin, Ibrahim Mohammed, Hanan Aldarmaki
STTATTS: Unified Speech-To-Text And Text-To-Speech Model	Hawau Olamide Toyin, Hao Li, Hanan Aldarmaki
Contextualized Graph Representations for Generating Counter-Narrative against Hate Speech	Selene Baez Santamaria, Helena Gomez Adorno, Ilia Markov
LaRA: Large Rank Adaptation for Speech and Text Cross-Modal Learning in Large Language Models	Zuhair hasan shaik, Pradyoth Hegde, Prashant Bannulmath, Deepak K T
MultiVerse: Efficient and Expressive Zero-Shot Multi-Task Text-to-Speech	Taejun Bak, Youngsik Eom, SeungJae Choi, Young-Sun Joo
Where Visual Speech Meets Language: VSP-LLM Framework for Efficient and Context-Aware Visual Speech Processing	Jeonghun Yeo, Seunghee Han, Minsu Kim, Yong Man Ro
Adversarial Attacks on Parts of Speech: An Empirical Study in Text-to-Image Generation	G M Shahariar, Jia Chen, Jiachen Li, Yue Dong
Breaking the Boundaries: A Unified Framework for Chinese Named Entity Recognition Across Text and Speech	Jinzhong Ning, Yuanyuan Sun, Bo Xu, Zhihao Yang, Ling Luo, Hongfei Lin
Audio-Based Linguistic Feature Extraction for Enhancing Multi-lingual and Low-Resource Text-to-Speech	Youngjae Kim, Yejin Jeon, Gary Lee
Modeling Gender and Dialect Bias in Automatic Speech Recognition	Camille Harris, Chijioke Mgbahurike, Neha Kumar, Diyi Yang
LLM generated responses to mitigate the impact of hate speech	Jakub Podolak, Szymon Łukasik, Paweł Balawender, Jan Ossowski, Jan Piotrowski, Katarzyna Bąkowicz, Piotr Sankowski
BLASER 2.0: a metric for evaluation and quality estimation of massively multilingual speech and text translation	David Dale, Marta R. Costa-jussà
Textless Speech-to-Speech Translation With Limited Parallel Data	Anuj Diwan, Anirudh Srinivasan, David Harwath, Eunsol Choi
PSLM: Parallel Generation of Text and Speech with LLMs for Low-Latency Spoken Dialogue Systems	Kentaro Mitsui, Koh Mitsuda, Toshiaki Wakatsuki, Yukiya Hono, Kei Sawada
Bahasa Harmony: A Comprehensive Dataset for Bahasa Text-to-Speech Synthesis with Discrete Codec Modeling of EnGen-TTS.	Onkar Kishor Susladkar, Vishesh Tripathi, Biddwan Ahmed
Recent Advances in Online Hate Speech Moderation: Multimodality and the Role of Large Models	Ming Shan Hee, Shivam Sharma, RUI CAO, Palash Nandi, Preslav Nakov, Tanmoy Chakraborty, Roy Ka-Wei Lee
WavLLM: Towards Robust and Adaptive Speech Large Language Model	Shujie HU, Long Zhou, Shujie LIU, Sanyuan Chen, Lingwei Meng, Hongkun Hao, Jing Pan, Xunying Liu, Jinyu Li, Sunit Sivasankaran, Linquan Liu, Furu Wei

Audio

22 papers

Paper	Authorlist	Status
IDEAW: Robust Neural Audio Watermarking with Invertible Dual-Embedding	Pengcheng Li, Xulong Zhang, Jing Xiao, Jianzong Wang	Main
Cross-Domain Audio Deepfake Detection: Dataset and Analysis	Yuang Li, Min Zhang, Mengxin Ren, Xiaosong Qiao, Miaomiao Ma, Daimeng Wei, Hao Yang
GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities	Sreyan Ghosh, Sonal Kumar, Ashish Seth, Chandra Kiran Reddy Evuru, Utkarsh Tyagi, S Sakshi, Oriol Nieto, Ramani Duraiswami, Dinesh Manocha
OpenSep: Leveraging Large Language Models with Textual Inversion for Open World Audio Separation	Tanvir Mahmud, Diana Marculescu
AudioVSR: Enhancing Video Speech Recognition with Audio Data	Xiaoda Yang, Xize Cheng, Jiaqi Duan, Hongshun Qiu, Minjie Hong, Minghui Fang, Shengpeng Ji, Jialong Zuo, Zhiqing Hong, Zhimeng Zhang, Tao Jin
PALM: Few-Shot Prompt Learning for Audio Language Models	Asif Hanif, Maha Tufail Agro, Mohammad Areeb Qazi, Hanan Aldarmaki
TCSinger: Zero-Shot Singing Voice Synthesis with Style Transfer and Multi-Level Style Control	Yu Zhang Ziyue Jiang Ruiqi Li Changhao Pan Jinzheng He Rongjie Huang Chuxin Wang Zhou Zhao	voice
Voices Unheard: NLP Resources and Models for Yorùbá Regional Dialects	Orevaoghene Ahia, Anuoluwapo Aremu, Diana Abagyan, Hila Gonen, David Ifeoluwa Adelani, Daud Abolade, Noah A. Smith, Yulia Tsvetkov	voice
EmoKnob: Enhance Voice Cloning with Fine-Grained Emotion Control	Haozhe Chen, Run Chen, Julia Hirschberg	voice
Voices in a Crowd: Searching for clusters of unique perspectives	Nikolas Vitsakis, Amit Parekh, Ioannis Konstas	voice
With Ears to See and Eyes to Hear: Sound Symbolism Experiments with Multimodal Large Language Models	Tyler Loakman, YUCHENG LI, Chenghua Lin	sound
Adaptive Immune-based Sound-Shape Code Substitution for Adversarial Chinese Text Attacks	Ao Wang, Xinghao Yang, Chen Li, Bao-di Liu, Weifeng Liu	sound
A SMART Mnemonic Sounds like “Glue Tonic”: Mixing LLMs with Student Feedback to Make Mnemonic Learning Stick	Nishant Balepur, Matthew Shu, Alexander Hoyle, Alison Robey, Shi Feng, Seraphina Goldfarb-Tarrant, Jordan Lee Boyd-Graber	sound
A Fast and Sound Tagging Method for Discontinuous Named-Entity Recognition	Caio Filippo Corro	sound
Beyond Single-Audio: Advancing Multi-Audio Processing in Audio Large Language Models	Yiming Chen, Xianghu Yue, Xiaoxue Gao, Chen Zhang, Luis Fernando D’Haro, Robby T. Tan, Haizhou Li	Findings
AlanaVLM: A Multimodal Embodied AI Foundation Model for Egocentric Video Understanding	Alessandro Suglia, Claudio Greco, Katie Baker, Jose L. Part, Ioannis Papaioannou, Arash Eshghi, Ioannis Konstas, Oliver Lemon
Unveiling Hallucination in Text, Image, Video, and Audio Foundation Models: A Comprehensive Review	Pranab Sahoo, Prabhash Meharia, Akash Ghosh, Sriparna Saha, Vinija Jain, Aman Chadha
Audio-Based Linguistic Feature Extraction for Enhancing Multi-lingual and Low-Resource Text-to-Speech	Youngjae Kim, Yejin Jeon, Gary Lee
SaSR-Net: Source-Aware Semantic Representation Network for Enhancing Audio-Visual Question Answering	Tianyu Yang, Yiyang Nan, Lisen Dai, Zhenwen Liang, Yapeng Tian, Xiangliang Zhang
PyramidCodec: Hierarchical Codec for Long-form Music Generation in Audio Domain	Jianyi Chen, Zheqi DAI, Zhen Ye, Xu Tan, Qifeng Liu, Yike Guo, Wei Xue
Enhancing Polyglot Voices by Leveraging Cross-Lingual Fine-Tuning in Any-to-One Voice Conversion	Giuseppe Ruggiero, Matteo Testa, Jurgen Van de Walle, Luigi Di Caro	voice
HSDreport: Heart Sound Diagnosis with Echocardiography Reports	Zihan Zhao, Pingjie Wang, Liudan Zhao, Yuchen Yang, Ya Zhang, Kun Sun, Xin Sun, Xin Zhou, Yu Wang, Yanfeng Wang	sound

NAACL'25

useful link: https://2025.naacl.org/program/accepted_papers/

Speech

Paper	Authorlist	Status
Robust and Unbounded Length Generalization in Autoregressive Transformer-Based Text-to-Speech	Eric Battenberg, RJ Skerry-Ryan, Daisy Stanton, Soroosh Mariooryad, Matt Shannon, Julian Salazar, David Teh-Hwa Kao
Decoding Hate: Exploring Language Models’ Reactions to Hate Speech	Paloma Piot, Javier Parapar
Multi$^3$Hate: Multimodal, Multilingual, and Multicultural Hate Speech Detection with Vision–Language Models	Minh Duc Bui, Katharina von der Wense, Anne Lauscher
CSEval: Towards Automated, Multi-Dimensional, and Reference-Free Counterspeech Evaluation using Auto-Calibrated LLMs	Amey Hengle, Aswini Kumar Padhi, Anil Bandhakavi, Tanmoy Chakraborty
MAD Speech: Measures of Acoustic Diversity of Speech	Matthieu Futeral, Andrea Agostinelli, Marco Tagliasacchi, Neil Zeghidour, Eugene Kharitonov
Afrispeech-Dialog: A Benchmark Dataset for Spontaneous English Conversations in Healthcare and Beyond	Mardhiyah Sanni, Tassallah Abdullahi, Devendra Deepak Kayande, Emmanuel Ayodele, Naome A Etori, Michael Samwel Mollel, Moshood O. Yekini, Chibuzor Okocha, Lukman Enegi Ismaila, Folafunmi Omofoye, Boluwatife A. Adewale, Tobi Olatunji
Wav2Prompt: End-to-End Speech Prompt Learning and Task-based Fine-tuning for Text-based LLMs	Keqi Deng, Guangzhi Sun, Phil Woodland
On the Role of Speech Data in Reducing Toxicity Detection Bias	Samuel Bell, Mariano Coria Meglioli, Megan Richards, Eduardo Sánchez, Christophe Ropers, Skyler Wang, Adina Williams, Levent Sagun, Marta R. Costa-jussà
Leveraging Allophony in Self-Supervised Speech Models for Atypical Pronunciation Assessment	Kwanghee Choi, Eunjung Yeo, Kalvin Chang, Shinji Watanabe, David R Mortensen
StyleTTS-ZS: Efficient High-Quality Zero-Shot Text-to-Speech Synthesis with Distilled Time-Varying Style Diffusion	Yinghao Aaron Li, Xilin Jiang, Cong Han, Nima Mesgarani
AfriHate: A Multilingual Collection of Hate Speech and Abusive Language Datasets for African Languages	Shamsuddeen Hassan Muhammad, Idris Abdulmumin, Abinew Ali Ayele, David Ifeoluwa Adelani, Ibrahim Said Ahmad, Saminu Mohammad Aliyu, Paul Röttger, Abigail Oppong, Andiswa Bukula, et, al
Prepending or Cross-Attention for Speech-to-Text? An Empirical Comparison	Tsz Kin Lam, Marco Gaido, Sara Papi, Luisa Bentivogli, Barry Haddow
ProSE: Diffusion Priors for Speech Enhancement	Sonal Kumar, Sreyan Ghosh, Utkarsh Tyagi, Anton Jeran Ratnarajah, Chandra Kiran Reddy Evuru, Ramani Duraiswami, Dinesh Manocha
VoiceTextBlender: Augmenting Large Language Models with Speech Capabilities via Single-Stage Joint Speech-Text Supervised Fine-Tuning	Yifan Peng, Krishna C Puvvada, Zhehuai Chen, Piotr Zelasko, He Huang, Kunal Dhawan, Ke Hu, Shinji Watanabe, Jagadeesh Balam, Boris Ginsburg
DyPCL: Dynamic Phoneme-level Contrastive Learning for Dysarthric Speech Recognition	Wonjun Lee, Solee Im, Heejin Do, Yunsu Kim, Jungseul Ok, Gary Lee
How do Multimodal Foundation Models Encode Text and Speech? An Analysis of Cross-Lingual and Cross-Modal Representations	Hyunji Lee, Danni Liu, Supriti Sinhamahapatra, Jan Niehues	short
Developing multilingual speech synthesis system for Ojibwe, Mi’kmaq, and Maliseet	Shenran Wang, Changbing Yang, Michael l parkhill, Chad Quinn, Christopher Hammerly, Jian Zhu
Cross-Lingual Transfer Learning for Speech Translation	Rao Ma, Mengjie Qian, Yassir Fathullah, Siyuan Tang, Mark Gales, Kate Knill
kNN Retrieval for Simple and Effective Zero-Shot Multi-speaker Text-to-Speech	Karl El Hajal, Ajinkya Kulkarni, Enno Hermann, Mathew Magimai Doss
WaveFM: A High-Fidelity and Efficient Vocoder Based on Flow Matching	Tianze Luo, Xingchen Miao, Wenbo Duan
DiVISe: Direct Visual-Input Speech Synthesis Preserving Speaker Characteristics And Intelligibility	Yifan Liu, Yu Fang, Zhouhan Lin	Findings
BanTH: A Multi-label Hate Speech Detection Dataset for Transliterated Bangla	Fabiha Haider, Fariha Tanjim Shifat, Md Farhan Ishmam, Md Sakib Ul Rahman Sourove, Deeparghya Dutta Barua, Md Fahim, Md Farhad Alam Bhuiyan
CDB: A Unified Framework for Hope Speech Detection Through Counterfactual, Desire and Belief	Tulio Ferreira Leite da Silva, Gonzalo Freijedo Aduna, Farah Benamara, Alda Mari, Zongmin Li, Li Yue, Jian Su
Untangling Hate Speech Definitions: A Semantic Componential Analysis Across Cultures and Domains	Katerina Korre, Arianna Muti, Federico Ruggeri, Alberto Barrón-Cedeño
Exploring Large Language Models for Hate Speech Detection in Rioplatense Spanish	Juan Manuel Pérez, Paula Miguel, Viviana Cotik
Unsupervised Speech-text word-level alignment with Dynamic Programming	Tianshu Yu, Zihan Gong, Minghuan Tan, Guhong Chen, Min Yang
Prompt-Guided Selective Masking Loss for Context-Aware Emotive Text-to-Speech	Yejin Jeon, Youngjae Kim, Jihyun Lee, Gary Lee
Echoes of Discord: Forecasting Hater Reactions to Counterspeech	Xiaoying Song, Sharon Lisseth Perez, Xinchen Yu, Eduardo Blanco, Lingzi Hong
Continuous Speech Tokenizer in Text To Speech	Yixing Li, Ruobing Xie, Xingwu Sun, Yu Cheng, Zhanhui Kang
*CA: Addressing Evaluation Pitfalls in Computation-Aware Latency for Simultaneous Speech Translation**	Xi Xu, Wenda Xu, Siqi Ouyang, Lei Li
Gender Bias in Instruction-Guided Speech Synthesis Models	Chun-Yi Kuan, Hung-yi Lee
Yeah, Un, Oh: Continuous and Real-time Backchannel Prediction with Fine-tuning of Voice Activity Projection	Koji Inoue, Divesh Lala, Gabriel Skantze, Tatsuya Kawahara	voice
Playing with Voices: Tabletop Role-Playing Game Recordings as a Diarization Challenge	Lian Remme, Kevin Tang	voice

Audio

Paper	Authorlist	Status
Ihquin tlahtouah in Tetelahtzincocah: An annotated, multi-purpose audio and text corpus of Western Sierra Puebla Nahuatl	Robert Pugh, Cheyenne Wing, María Ximena Juárez Huerta, Angeles Márquez Hernandez, Francis M. Tyers
PAT: Parameter-Free Audio-Text Aligner to Boost Zero-Shot Audio Classification	Ashish Seth, Ramaneswaran Selvakumar, Sonal Kumar, Sreyan Ghosh, Dinesh Manocha
AudioBench: A Universal Benchmark for Audio Large Language Models	Bin Wang, Xunlong Zou, Geyu Lin, Shuo Sun, Zhuohan Liu, Wenyu Zhang, Zhengyuan Liu, AiTi Aw, Nancy F. Chen
Audio Is the Achilles’ Heel: Red Teaming Audio Large Multimodal Models	Hao Yang, Lizhen Qu, Ehsan Shareghi, Gholamreza Haffari
Do Audio-Language Models Understand Linguistic Variations?	Ramaneswaran Selvakumar, Sonal Kumar, Hemant Kumar Giri, Nishit Anand, Ashish Seth, Sreyan Ghosh, Dinesh Manocha
Comprehensive Layer-wise Analysis of SSL Models for Audio Deepfake Detection	Yassine El Kheir, Younes Samih, Suraj Maharjan, Tim Polzehl, Sebastian Möller
Audio Description Generation in the Era of LLMs and VLMs: A Review of Transferable Generative AI Technologies	Yingqiang Gao, Lukas Fischer, Alexa Lintner, Sarah Ebling
Synthetic Audio Helps for Cognitive State Tasks	Adil Soubki, John Murzaku, Peter Zeng, Owen Rambow

AAAI'25

Speech

Paper	Authorlist	Status
ELLA-V: Stable Neural Codec Language Modeling with Alignment-Guided Sequence Reordering	Yakun Song, Zhuo Chen, Xiaofei Wang, Ziyang Ma, Xie Chen
Language model can listen while speaking	Ziyang Ma, Yakun Song, Chenpeng Du, Jian Cong, Zhuo Chen, Yuping Wang, Yuxuan Wang, Xie Chen
VQTalker: Towards Multilingual Talking Avatars through Facial Motion Tokenization	Tao Liu, Ziyang Ma, Qi Chen, Feilong Chen, Shuai Fan, Xie Chen, Kai Yu
Speech Recognition Meets Large Language Model: Benchmarking, Models, and Exploration	Ziyang Ma, Guanrou Yang, Yifan Yang, Zhifu Gao, Jiaming Wang, Zhihao Du, Fan Yu, Qian Chen, Siqi Zheng, Shiliang Zhang, Xie Chen
DIDiffGes: Decoupled Semi-Implicit Diffusion Models for Real-time Gesture Generation from Speech
FaceSpeak: Expressive and High-Quality Speech Synthesis from Human Portraits of Different Styles
Large Language Models Are Read/Write Policy-Makers for Simultaneous Generation
SECodec: Structural Entropy-based Compressive Speech Representation Codec for Speech Language Models
EmoReg: Directional Latent Vector Modeling for Emotional Intensity Regularization in Diffusion-based Voice Conversion
BSDB-Net: Band-Split Dual-Branch Network with Selective State Spaces Mechanism for Monaural Speech Enhancement
Enhancing Audiovisual Speech Recognition through Bifocal Preference Optimization
MathSpeech: Leveraging Small LMs for Accurate Conversion in Mathematical Speech-to-Formula
Speech Watermarking with Discrete Intermediate Representations
ProsodyFM: Unsupervised Phrasing and Intonation Control for Intelligible Speech Synthesis
Complex-Cycle-Consistent Diffusion Model for Monaural Speech Enhancement
StableVC: Style Controllable Zero-Shot Voice Conversion with Conditional Flow Matching
DEEPTalk: Dynamic Emotion Embedding for Probabilistic Speech-Driven 3D Face Animation
Drop the beat! Freestyler for Accompaniment Conditioned Rapping Voice Generation
Recording for Eyes, Not Echoing to Ears: Contextualized Spoken-to-Written Conversion of ASR Transcripts

Audio

Paper	Authorlist	Status
Codec does matter: Exploring the semantic shortcoming of codec for audio language model	Zhen Ye, Peiwen Sun, Jiahe Lei, Hongzhan Lin, Xu Tan, Zheqi Dai, Qiuqiang Kong, Jianyi Chen, Jiahao Pan, Qifeng Liu, Yike Guo, Wei Xue
TechSinger: Technique Controllable Multilingual Singing Voice Synthesis via Flow Matching
MIDI-GPT: A Controllable Generative Model for Computer-Assisted Multitrack Music Composition
GVMGen: A General Video-to-Music Generation Model with Hierarchical Attentions
Detecting Music Performance Errors with Transformers
SoundBrush: Sound as a Brush for Visual Scene Editing
Tri-Ergon: Fine-grained Video-to-Audio Generation with Multi-modal Conditions and LUFS Control
SongGLM: Lyric-to-Melody Generation with 2D Alignment Encoding and Multi-Task Pre-Training
JoVALE: Detecting Human Actions in Video Using Audiovisual and Language Contexts
Query-centric Audio-Visual Cognition Network for Moment Retrieval, Segmentation and Step-Captioning
Dense Audio-Visual Event Localization under Cross-Modal Consistency and Multi-Temporal Granularity Collaboration
DLF: Disentangled-Language-Focused Multimodal Sentiment Analysis
Region-Based Optimization in Continual Learning for Audio Deepfake Detection
Enriching Multimodal Sentiment Analysis through Textual Emotional Descriptions of Visual-Audio Content
GoHD: Gaze-oriented and Highly Disentangled Portrait Animation with Rhythmic Poses and Realistic Expression
PointTalk: Audio-Driven Dynamic Lip Point Cloud for 3D Gaussian-based Talking Head Synthesis
Unleashing the Temporal-Spatial Reasoning Capacity of GPT for Training-Free Audio and Language Referenced Video Object Segmentation
JEN-1 Composer: A Unified Framework for High-Fidelity Multi-Track Music Generation
SongEditor: Adapting Zero-Shot Song Generation Language Model as a Multi-Task Editor
CSSinger: End-to-End Chunkwise Streaming Singing Voice Synthesis System Based on Conditional Variational Autoencoder
Read, Watch and Scream! Sound Generation from Text and Video
Mental-Perceiver: Audio-Textual Multi-Modal Learning for Estimating Mental Disorders

IJCAI'24

useful link: https://ijcai24.org/main-track-accepted-papers/index.html

The number of this conference (speech&audio) is small.

Speech

Paper	Authorlist	Status
Separate in the Speech Chain: Cross-Modal Conditional Audio-Visual Target Speech Extraction	Zhaoxi Mu, Xinyu Yang
Bridge to Non-Barrier Communication: Gloss-Prompted Fine-Grained Cued Speech Gesture Generation with Diffusion Model	Wentao Lei, Li Liu, Jun Wang
Two-stage Semi-supervised Speaker Recognition with Gated Label Learning	Xingmei Wang, Jiaxiang Meng, Kong Aik Lee, Boquan Li, Jinghan Liu
Discriminative Feature Decoupling Enhancement for Speech Forgery Detection	Yijun Bei, Xing Zhou, Erteng Liu, Yang Gao, Sen Lin, Kewei Gao, Zunlei Feng
Innovative Directional Encoding in Speech Processing: Leveraging Spherical Harmonics Injection for Multi-Channel Speech Enhancement	Jiahui Pan, Pengjie Shen, Hui Zhang, Xueliang Zhang
Contextualized Speech Recognition: Rethinking Second-Pass Rescoring with Generative Large Language Models	Yixuan Tang, Anthony K. H. Tung
Speech-Forensics: Towards Comprehensive Synthetic Speech Dataset Establishment and Analysis	Zhoulin Ji, Chenhao Lin, Hang Wang, Chao Shen
Decoupling Breaks Data Barriers: A Decoupled Pre-training Framework for Multi-intent Spoken Language Understanding	Libo Qin, Qiguang Chen, Jingxuan Zhou, Qinzheng Li, Chunlin Lu, Wanxiang Che
Recent Advances in End-to-End Simultaneous Speech Translation	Xiaoqian Liu, Guoqiang Hu, Yangfan Du, Erfeng He, YingFeng Luo, Chen Xu, Tong Xiao, Jingbo Zhu	Survey

Audio

Paper	Authorlist	Status
EAT: Self-Supervised Pre-Training with Efficient Audio Transformer	Wenxi Chen, Yuzhe Liang, Ziyang Ma, Zhisheng Zheng, Xie Chen
Generating More Audios for End-to-End Spoken Language Understanding	Xuxin Cheng, Yuexian Zou
BATON: Aligning Text-to-Audio Model Using Human Preference Feedback	Huan Liao, Haonan Han, Kai Yang, Tianjiao Du, Rui Yang, Qinmei Xu, Zunnan Xu, Jingquan Liu, Jiasheng Lu, Xiu Li
HyDiscGAN: A Hybrid Distributed cGAN for Audio-Visual Privacy Preservation in Multimodal Sentiment Analysis	Zhuojia Wu, Qi Zhang, Duoqian Miao, Kun Yi, Wei Fan, Liang Hu
InstructME: An Instruction Guided Music Edit Framework with Latent Diffusion Models	Bing Han, Junyu Dai, Weituo Hao, Xinyan He, Dong Guo, Jitong Chen, Yuxuan Wang, Yanmin Qian, Xuchen Song

ICML'25

Speech

Paper	Authorlist	Status
MoHAVE: Mixture of Hierarchical Audio-Visual Experts for Robust Speech Recognition		https://arxiv.org/abs/2502.10447
The Brain's Bitter Lesson: Scaling Speech Decoding With Self-Supervised Learning		https://arxiv.org/abs/2406.04328
DMOSpeech: Direct Metric Optimization via Distilled Diffusion Model in Zero-Shot Speech Synthesis		https://arxiv.org/abs/2410.11097
Do Not Mimic My Voice : Speaker Identity Unlearning for Zero-Shot Text-to-Speech		https://openreview.net/forum?id=v9LjNopQ6W&noteId=B8CPk9usHO
BinauralFlow: A Causal and Streamable Approach for High-Quality Binaural Speech Synthesis with Flow Matching Models
Emotional Face-to-Speech		https://arxiv.org/abs/2502.01046
DiTAR: Diffusion Transformer Autoregressive Modeling for Speech Generation		https://arxiv.org/abs/2502.03930
Unsupervised Blind Speech Separation with a Diffusion Prior		https://arxiv.org/abs/2505.05657
Sortformer: A Novel Approach for Permutation-Resolved Speaker Supervision in Speech-to-Text Systems
High-Fidelity Simultaneous Speech-To-Speech Translation		https://arxiv.org/abs/2502.03382
Improving Conversational Capabilities of Speech Language Models via Generative Dual-channel Spoken Dialogue Learning
Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM		https://arxiv.org/abs/2411.00774
OWLS: Scaling Laws for Multilingual Speech Recognition and Translation Models		https://arxiv.org/abs/2502.10373
Long-Form Speech Generation with Spoken Language Models		https://arxiv.org/abs/2412.18603
Aligning Spoken Dialogue Models from User Interactions		spoken dialogue
A Variational Framework for Improving Naturalness in Generative Spoken Language Models
De-AntiFake: Rethinking the Protective Perturbations Against Voice Cloning Attacks
Multi-band Frequency Reconstruction for Neural Psychoacoustic Coding		https://arxiv.org/abs/2505.07235

Audio

Paper	Authorlist	Status
XAttnMark: Learning Robust Audio Watermarking with Cross-Attention		https://arxiv.org/abs/2502.04230
ETTA: Elucidating the Design Space of Text-to-Audio Models		https://arxiv.org/abs/2412.19351
Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities		https://arxiv.org/abs/2503.03983
Sounding that Object: Interactive Object-Aware Image to Audio Generation
ALMTokenizer: A Low-bitrate and Semantic-rich Audio Codec Tokenizer for Audio Language Modeling		https://arxiv.org/abs/2504.10344
Generative Audio Language Modeling with Continuous-valued Tokens and Masked Next-Token Prediction
Supervised Contrastive Learning from Weakly-Labeled Audio Segments for Musical Version Matching		https://arxiv.org/abs/2502.16936
FLAM: Frame-Wise Language-Audio Modeling		https://arxiv.org/abs/2505.05335
MATS: An Audio Language Model under Text-only Supervision		https://arxiv.org/abs/2502.13433
AGAV-Rater: Adapting Large Multimodal Model for AI-Generated Audio-Visual Quality Assessment		https://arxiv.org/abs/2501.18314
AudioSpace: Generating Spatial Audio from 360-Degree Video		https://arxiv.org/abs/2504.14906
IMPACT: Iterative Mask-based Parallel Decoding for Text-to-Audio Generation with Diffusion Modeling
video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model		https://arxiv.org/abs/2502.11775
Efficient Fine-Grained Guidance for Diffusion-Based Symbolic Music Generation		music, https://arxiv.org/abs/2410.08435
MuseControlLite: Multifunctional Music Generation with Lightweight Conditioners
Gaussian Mixture Flow Matching Models		flow matching
SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation		song

AAAI'26

Speech

Paper	Authorlist	Status
MF-Speech: Achieving Fine-Grained and Compositional Control in Speech Generation via Factor Disentanglement

Audio

Paper	Authorlist	Status
StreamingTalker: Audio-driven 3D Facial Animation with Autoregressive Diffusion Model

Useful Survey & Awesome Link

Expressive TTS: https://github.com/01Zhangbw/Awesome-Expressive-speech-synthesis
Disordered Speech: https://github.com/01Zhangbw/Awesome-Disordered-Speech
Neural Codec & Speech Language Models: https://github.com/LqNoob/Neural-Codec-and-Speech-Language-Models
Controllable TTS: https://github.com/imxtx/awesome-controllabe-speech-synthesis
Large Audio Model: https://github.com/EmulationAI/awesome-large-audio-models
Codec-SuperB: https://github.com/voidful/Codec-SUPERB
Next Token Prediction: https://github.com/LMM101/Awesome-Multimodal-Next-Token-Prediction
Paper daily: https://github.com/halsay/ASR-TTS-paper-daily
Audio LLM: https://github.com/AudioLLMs/Awesome-Audio-LLM
Speech Trident: https://github.com/ga642381/speech-trident
Speech Pretrained: https://github.com/ddlBoJack/Awesome-Speech-Pretraining
TTS: https://github.com/wenet-e2e/speech-synthesis-paper
Speech Language model: https://github.com/ddlBoJack/Awesome-Speech-Language-Model
Amphion
InterSpeech23-24: https://github.com/DmitryRyumin/INTERSPEECH-2023-24-Papers
ICASSP23-24: https://github.com/DmitryRyumin/ICASSP-2023-24-Papers

Amphion v0.2 technical report https://arxiv.org/abs/2501.15442
Emilia-Large：更大杯，更多实验结果及细节 https://arxiv.org/abs/2501.15907
AnyEnhance：语音增强、歌声增强、说话人提取等等任务，AnyEnhance一个模型全搞定 https://arxiv.org/abs/2501.15417

Citation

If you find this repository helpful, please consider citing:

@misc{Zhang2025SpeechAudio,
  title = {Speech-and-audio-papers-Top-Conference},
  author = {Bowen Zhang},
  year = {2025},
  howpublished = {\url{https://github.com/01Zhangbw/Speech-and-audio-papers-Top-Conference}},
}

License

This repository is released under the MIT license.

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Speech and audio papers@Top Conference (Update Regularly)

ICLR'25

Speech

Audio

Summary

ICLR'24

Speech

Audio

Summary

ICML'24

Speech

Audio

NeurIPS'24

Speech

Audio

ICML'23

Speech

Audio

NeurIPS'23

Speech

Audio

ACMMM'24

Speech

Audio

ICLR'23

Speech

Audio

AAAI'24

Speech

Audio

ACL'24

Speech

Audio

EMNLP'24

Speech

Audio

NAACL'25

Speech

Audio

AAAI'25

Speech

Audio

IJCAI'24

Speech

Audio

ICML'25

Speech

Audio

AAAI'26

Speech

Audio

Useful Survey & Awesome Link

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages