Welcome to star⭐ Discuss in Issues or collaborate via PRs~👏 Feel free to contact📧 me via zhangbw0102@gmail.com.
🎉 [01/23/2025] UPDATE ICLR 2025 conference papers successfully!
🎉 [01/23/2025] UPDATE ICLR 2024 conference papers successfully!
🎉 [01/29/2025] UPDATE ICML 2024 conference papers successfully!
🎉 [01/29/2025] UPDATE NeurIPS 2024 conference papers successfully!
🎉 [01/30/2025] UPDATE ICML 2023 conference papers successfully!
🎉 [01/30/2025] UPDATE NeurIPS 2023 conference papers successfully!
🎉 [01/30/2025] UPDATE ACMMM 2024 conference papers successfully!
🎉 [01/30/2025] UPDATE ICLR 2023 conference papers successfully!
🎉 [01/30/2025] UPDATE AAAI 2024 conference papers successfully!
🎉 [01/31/2025] UPDATE ACL 2024 conference papers successfully!
🎉 [01/31/2025] UPDATE EMNLP 2024 conference papers successfully!
🎉 [03/24/2025] UPDATE NAACL 2025 conference papers successfully!
🎉 [04/22/2025] UPDATE AAAI 2025 conference papers successfully!
🎉 [04/22/2025] UPDATE IJCAI 2024 conference papers successfully!
🎉 [05/16/2025] UPDATE ICML 2025 conference papers successfully!
🎉 [01/24/2026] UPDATE AAAI 2026 conference papers successfully!
Speech and audio papers@Top Conference
- ICLR'25
- ICLR'24
- ICML'24
- NeurIPS'24
- ICML'23
- NeurIPS'23
- ACMMM'24
- ICLR'23
- AAAI'24
- ACL'24
- EMNLP'24
- NAACL'25
- AAAI'25
- IJCAI'24
- ICML'25
- AAAI'26
- Useful Survey & Awesome Link
- Citation
- License
ICLR'25 total submission: 11672; accepted: 3706 (31.75%)
It includes the papers on speech (rate is good or middle, often more than 5), not limited to accepted or not.
Total speech papers@ICLR25 number is 100+; We select 49 papers.
re denotes rejected. con denotes conditionalonethicsreview. The numbers like 5668 denotes the detailed rate is 5,6,6,8.
It includes the papers on speech (rate is good or middle, often more than 5), not limited to accepted or not.
Total speech papers@ICLR25 number is 70+; We select 36 papers.
The accepted(or not) status depends on rate mainly. The rate of speech/audio track is not high, which is much less than the tracks like CV, NLP, etc. The rebuttals are very important!!!
It includes the papers on speech (rate is good or middle, often more than 5), not limited to accepted or not.
Total speech papers@ICLR24 number is 50+; We select 20+ papers.
It includes the papers on speech (rate is good or middle, often more than 5), not limited to accepted or not.
Total speech papers@ICLR24 number is 20+; We select 17 papers.
This year, the paper's number is not so large.
useful link: https://nips.cc/virtual/2024/papers.html?filter=titles&search=speech
useful link: https://nips.cc/virtual/2024/papers.html?filter=titles&search=audio
useful link: https://icml.cc/virtual/2023/papers.html?filter=titles&search=speech
useful link: https://icml.cc/virtual/2023/papers.html?filter=titles&search=audio
useful link: https://aaai.org/wp-content/uploads/2024/02/AAAI-24_Main_2024-02-01.pdf
https://github.com/DmitryRyumin/AAAI-2024-Papers
| Paper | Status |
|---|---|
| Mimic: Speaking Style Disentanglement for Speech-Driven 3D Facial Animation | https://arxiv.org/abs/2312.10877 |
| UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-Diffusion and Vocoding | https://arxiv.org/abs/2306.07547 |
| Multichannel AV-wav2vec2: A Framework for Learning Multichannel Multi-Modal Speech Representation | https://arxiv.org/abs/2401.03468 |
| Visual Hallucination Elevates Speech Recognition | https://ojs.aaai.org/index.php/AAAI/article/view/29926 |
| Spanning the Spectrum of Hatred Detection: A Persian Multi-Label Hate Speech Dataset with Annotator Rationales | https://ojs.aaai.org/index.php/AAAI/article/view/29743 |
| Restoring Speaking Lips from Occlusion for Audio-Visual Speech Recognition | https://ojs.aaai.org/index.php/AAAI/article/view/29882 |
| MM-TTS: Multi-Modal Prompt Based Style Transfer for Expressive Text-toSpeech Synthesis | https://arxiv.org/abs/2312.10687 |
| Emotion Rendering for Conversational Speech Synthesis with Heterogeneous Graph-Based Context Modeling | https://arxiv.org/abs/2312.11947 |
| Let There Be Sound: Reconstructing High Quality Speech from Silent Videos | https://arxiv.org/abs/2308.15256 |
| Divergence-Guided Simultaneous Speech Translation | https://ojs.aaai.org/index.php/AAAI/article/view/29733 |
| SECap: Speech Emotion Captioning with Large Language Model | https://arxiv.org/abs/2312.10381 |
| Self-Supervised Disentangled Representation Learning for Robust Target Speech Extraction | https://arxiv.org/abs/2312.10305 |
| Paper | Status |
|---|---|
| AE-NeRF: Audio Enhanced Neural Radiance Field for Few Shot Talking Head Synthesis | https://arxiv.org/abs/2312.10921 |
| V2A-Mapper: A Lightweight Solution for Vision-to-Audio Generation by Connecting Foundation Models | https://arxiv.org/abs/2308.09300 |
| What to Remember: Self-Adaptive Continual Learning for Audio Deepfake Detection | https://arxiv.org/abs/2312.09651 |
| Audio Generation with Multiple Conditional Diffusion Model | https://arxiv.org/abs/2308.11940 |
| AVSegFormer: Audio-Visual Segmentation with Transformer | https://ojs.aaai.org/index.php/AAAI/article/view/29104 |
| Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation | https://arxiv.org/abs/2309.16429 |
| Sample-Constrained Black Box Optimization for Audio Personalization | https://ojs.aaai.org/index.php/AAAI/article/view/28881 |
| DTF-AT: Decoupled Time-Frequency Audio Transformer for Event Classification | https://ojs.aaai.org/index.php/AAAI/article/view/29716 |
| CAVEN: An Embodied Conversational Agent for Efficient Audio-Visual Navigation in Noisy Environments | https://arxiv.org/abs/2306.04047 |
| Learning Temporal Resolution in Spectrogram for Audio Classification | https://arxiv.org/abs/2210.01719 |
| SoundCount: Sound Counting from Raw Audio with Dyadic Decomposition Neural Network | https://arxiv.org/abs/2312.16149 |
| Segment beyond View: Handling Partially Missing Modality for Audio-Visual Semantic Segmentation | https://arxiv.org/abs/2312.08673 |
| Improving Audio-Visual Segmentation with Bidirectional Generation | https://arxiv.org/abs/2308.08288 |
| Audio Scanning Network: Bridging Time and Frequency Domains for Audio Classification | https://ojs.aaai.org/index.php/AAAI/article/view/29015 |
| Object-Aware Adaptive-Positivity Learning for Audio-Visual Question Answering | https://arxiv.org/abs/2312.12816 |
| Prompting Segmentation with Sound Is Generalizable Audio-Visual Source Localizer | https://arxiv.org/abs/2309.07929 |
useful link: https://2024.aclweb.org/program/main_conference_papers/#long-papers
https://2024.aclweb.org/program/finding_papers/
60 papers
| Paper | Authorlist | Status |
|---|---|---|
| GenTranslate: Large Language Models are Generative Multilingual Speech and Machine Translators | Yuchen Hu, Chen Chen, Chao-Han Huck Yang, Ruizhe Li, Dong Zhang, Zhehuai Chen, EngSiong Chng | Long, link |
| Wav2Gloss: Generating Interlinear Glossed Text from Speech | Taiqi He, Kwanghee Choi, Lindia Tjuatja, Nathaniel Romney Robinson, Jiatong Shi, Shinji Watanabe, Graham Neubig, David R Mortensen, Lori Levin | https://aclanthology.org/2024.acl-long.34.pdf |
| A Non-autoregressive Generation Framework for End-to-End Simultaneous Speech-to-Any Translation | Zhengrui Ma, Qingkai Fang, Shaolei Zhang, Shoutao Guo, Yang Feng, Min zhang | https://aclanthology.org/2024.acl-long.85.pdf |
| Generative Pre-trained Speech Language Model with Efficient Hierarchical Transformer | Yongxin Zhu, Dan Su, Liqiang He, Linli Xu, Dong Yu | https://aclanthology.org/2024.acl-long.97.pdf |
| Speech Translation with Speech Foundation Models and Large Language Models: What is There and What is Missing? | Marco Gaido, Sara Papi, Matteo Negri, Luisa Bentivogli | https://aclanthology.org/2024.acl-long.789.pdf |
| StreamAtt: Direct Streaming Speech-to-Text Translation with Attention-based Audio History Selection | Sara Papi, Marco Gaido, Matteo Negri, Luisa Bentivogli | https://aclanthology.org/2024.acl-long.202.pdf |
| Speech vs. Transcript: Does It Matter for Human Annotators in Speech Summarization? | Roshan Sharma, Suwon Shon, Mark Lindsey, Hira Dhamyal, Bhiksha Raj | https://aclanthology.org/2024.acl-long.790.pdf |
| LLM Knows Body Language, Too: Translating Speech Voices into Human Gestures | Chenghao Xu, Guangtao Lyu, Jiexi Yan, Muli Yang, Cheng Deng | https://aclanthology.org/2024.acl-long.273.pdf |
| RepCodec: A Speech Representation Codec for Speech Tokenization | Zhichao Huang, Chutong Meng, Tom Ko | https://aclanthology.org/2024.acl-long.314.pdf |
| Error-preserving Automatic Speech Recognition of Young English Learners’ Language | Janick Michot, Manuela Hürlimann, Jan Milan Deriu, Luzia Sauer, Katsiaryna Mlynchyk, Mark Cieliebak | https://aclanthology.org/2024.acl-long.348.pdf |
| Can We Achieve High-quality Direct Speech-to-Speech Translation without Parallel Speech Data? | Qingkai Fang, Shaolei Zhang, Zhengrui Ma, Min zhang, Yang Feng | https://aclanthology.org/2024.acl-long.392.pdf |
| Multimodal Contextualized Semantic Parsing from Speech | Jordan Voas, David Harwath, Ray Mooney | https://aclanthology.org/2024.acl-long.398.pdf |
| SpikeVoice: High-Quality Text-to-Speech Via Efficient Spiking Neural Network | Kexin Wang, Jiahong Zhang, Yong Ren, Man Yao, Di Shang, Bo XU, Guoqi Li | https://aclanthology.org/2024.acl-long.429.pdf |
| Speech Sense Disambiguation: Tackling Homophone Ambiguity in End-to-End Speech Translation | Tengfei Yu, Xuebo Liu, Liang Ding, Kehai Chen, Dacheng Tao, Min Zhang | https://aclanthology.org/2024.acl-long.435.pdf |
| Label-Synchronous Neural Transducer for E2E Simultaneous Speech Translation | Keqi Deng, Phil Woodland | https://aclanthology.org/2024.acl-long.448.pdf |
| Language Complexity and Speech Recognition Accuracy: Orthographic Complexity Hurts, Phonological Complexity Doesn’t | Chihiro Taguchi, David Chiang | https://aclanthology.org/2024.acl-long.827.pdf |
| Speech language models lack important brain-relevant semantics | SUBBA REDDY OOTA, Emin Çelik, Fatma Deniz, Mariya Toneva | https://aclanthology.org/2024.acl-long.462.pdf |
| StreamSpeech: Simultaneous Speech-to-Speech Translation with Multi-task Learning | Shaolei Zhang, Qingkai Fang, Shoutao Guo, Zhengrui Ma, Min zhang, Yang Feng | https://aclanthology.org/2024.acl-long.485.pdf |
| NaijaHate: Evaluating Hate Speech Detection on Nigerian Twitter Using Representative Data | Manuel Tonneau, Pedro Vitor Quinta de Castro, Karim Lasri, Ibrahim Sambo Farouq, Lakshmi Subramanian, Victor Orozco-Olvera, Samuel Fraiberger | https://aclanthology.org/2024.acl-long.488v2.pdf |
| Uni-Dubbing: Zero-Shot Speech Synthesis from Visual Articulation | Songju Lei, Xize Cheng, Mengjiao Lyu, Jianqiao Hu, Jintao Tan, Runlin Liu, Lingyu Xiong, Tao Jin, Xiandong Li, Zhou Zhao | https://aclanthology.org/2024.acl-long.543.pdf |
| OWSM-CTC: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification | Yifan Peng, Yui Sudo, Muhammad Shakeel, Shinji Watanabe | https://aclanthology.org/2024.acl-long.549.pdf |
| Don’t Go To Extremes: Revealing the Excessive Sensitivity and Calibration Limitations of LLMs in Implicit Hate Speech Detection | Min Zhang, Jianfeng He, Taoran Ji, Chang-Tien Lu | https://aclanthology.org/2024.acl-long.652.pdf |
| Structured Tree Alignment for Evaluation of (Speech) Constituency Parsing | Freda Shi, Kevin Gimpel, Karen Livescu | https://aclanthology.org/2024.acl-long.666.pdf |
| VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild | Puyuan Peng, Po-Yao Huang, Shang-Wen Li, Abdelrahman Mohamed, David Harwath | https://aclanthology.org/2024.acl-long.673.pdf |
| A Community-Centric Perspective for Characterizing and Detecting Anti-Asian Violence-Provoking Speech | Gaurav Verma, Rynaa Grover, Jiawei Zhou, Binny Mathew, Jordan Kraemer, Munmun De Choudhury, Srijan Kumar | https://aclanthology.org/2024.acl-long.684.pdf |
| XLAVS-R: Cross-Lingual Audio-Visual Speech Representation Learning for Noise-Robust Speech Perception | HyoJung Han, Mohamed Anwar, Juan Pino, Wei-Ning Hsu, Marine Carpuat, Bowen Shi, Changhan Wang | https://aclanthology.org/2024.acl-long.697.pdf |
| MobileSpeech: A Fast and High-Fidelity Framework for Mobile Zero-Shot Text-to-Speech | Shengpeng Ji, Ziyue Jiang, Wang Hanting, Jialung Zuo, Zhou Zhao | https://aclanthology.org/2024.acl-long.733.pdf |
| The MERSA Dataset and a Transformer-Based Approach for Speech Emotion Recognition | Enshi Zhang, Rafael Trujillo, Christian Poellabauer | https://aclanthology.org/2024.acl-long.752.pdf |
| Growing Trees on Sounds: Assessing Strategies for End-to-End Dependency Parsing of Speech | Adrien Pupier, Maximin Coavoux, Jérôme Goulian, Benjamin Lecouteux | Short, link |
| Explainability and Hate Speech: Structured Explanations Make Social Media Moderators Faster | Agostina Calabrese, Leonardo Neves, Neil Shah, Maarten W. Bos, Björn Ross, Mirella Lapata, Francesco Barbieri | https://aclanthology.org/2024.acl-short.38.pdf |
| On the Semantic Latent Space of Diffusion-Based Text-To-Speech Models | Miri Varshavsky, Roy Hirsch, Regev Cohen, Tomer Golany, Daniel Freedman, Ehud Rivlin | https://aclanthology.org/2024.acl-short.24.pdf |
| StreamVoice: Streamable Context-Aware Language Modeling for Real-time Zero-Shot Voice Conversion | Zhichao Wang, Yuanzhe Chen, Xinsheng Wang, Lei Xie, Yuping Wang | voice |
| Robust Singing Voice Transcription Serves Synthesis | Ruiqi Li, Yu Zhang, Yongqi Wang, Zhiqing Hong, Rongjie Huang, Zhou Zhao | voice |
| Make-A-Voice: Revisiting Voice Large Language Models as Scalable Multilingual and Multitask Learners | Rongjie Huang, Chunlei Zhang, Yongqi Wang, Dongchao Yang, Jinchuan Tian, Zhenhui Ye, Luping Liu, Zehan Wang, Ziyue Jiang, Xuankai Chang, Jiatong Shi, CHAO WENG, Zhou Zhao, Dong Yu | voice |
| Codec-SUPERB: An In-Depth Analysis of Sound Codec Models | Haibin Wu, Ho-Lam Chung, Yi-Cheng Lin, Yuan-Kuei Wu, Xuanjun Chen, Yu-Chi Pai, Hsiu-Hsuan Wang, Kai-Wei Chang, Alexander H. Liu, Hung-yi Lee | Findings |
| Listen Again and Choose the Right Answer: A New Paradigm for Automatic Speech Recognition with Large Language Models | Findings, | |
| Wav2SQL: Direct Generalizable Speech-To-SQL Parsing | ||
| Multi-Modal Retrieval For Large Language Model Based Speech Recognition | ||
| ViHateT5: Enhancing Hate Speech Detection in Vietnamese With a Unified Text-to-Text Transformer Model | ||
| Speech-based Slot Filling using Large Language Models | ||
| LLaST: Improved End-to-end Speech Translation System Leveraged by Large Language Models | ||
| Semantic Role Labeling from Chinese Speech via End-to-End Learning | ||
| Revisiting Interpolation Augmentation for Speech-to-Text Generation | ||
| Self-Supervised Singing Voice Pre-Training towards Speech-to-Singing Conversion | ||
| TransFace: Unit-Based Audio-Visual Speech Synthesizer for Talking Head Translation | ||
| SpeechGuard: Exploring the Adversarial Robustness of Multi-modal Large Language Models | ||
| SharedCon: Implicit Hate Speech Detection using Shared Semantics | ||
| IndicVoices: Towards building an Inclusive Multilingual Speech Dataset for Indian Languages | ||
| wav2vec-S: Adapting Pre-trained Speech Models for Streaming | ||
| On the Evaluation of Speech Foundation Models for Spoken Language Understanding | ||
| Integrating Pre-Trained Speech and Language Models for End-to-End Speech Recognition | ||
| Tox-BART: Leveraging Toxicity Attributes for Explanation Generation of Implicit Hate Speech | ||
| Pushing the Limits of Zero-shot End-to-End Speech Translation | ||
| Textless Acoustic Model with Self-Supervised Distillation for Noise-Robust Expressive Speech-to-Speech Translation | ||
| emotion2vec: Self-Supervised Pre-Training for Speech Emotion Representation | ||
| Label-aware Hard Negative Sampling Strategies with Momentum Contrastive Learning for Implicit Hate Speech Detection | ||
| Aligning Speech Segments Beyond Pure Semantics | ||
| CTC-based Non-autoregressive Textless Speech-to-Speech Translation | ||
| MELD-ST: An Emotion-aware Speech Translation Dataset | ||
| Part-of-speech Tagging for Extremely Low-resource Indian Languages |
https://2024.aclweb.org/program/finding_papers/
8 papers
| Paper | Authorlist | Status |
|---|---|---|
| AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension | Qian Yang, Jin Xu, Wenrui Liu, Yunfei Chu, Ziyue Jiang, Xiaohuan Zhou, Yichong Leng, Yuanjun Lv, Zhou Zhao, Chang Zhou, Jingren Zhou | Long, link |
| StreamAtt: Direct Streaming Speech-to-Text Translation with Attention-based Audio History Selection | Sara Papi, Marco Gaido, Matteo Negri, Luisa Bentivogli | https://aclanthology.org/2024.acl-long.202.pdf |
| M$^3$AV: A Multimodal, Multigenre, and Multipurpose Audio-Visual Academic Lecture Dataset | Zhe Chen, Heyang Liu, Wenyi Yu, Guangzhi Sun, Hongcheng Liu, Ji Wu, Chao Zhang, Yu Wang, Yanfeng Wang | https://aclanthology.org/2024.acl-long.489.pdf |
| XLAVS-R: Cross-Lingual Audio-Visual Speech Representation Learning for Noise-Robust Speech Perception | HyoJung Han, Mohamed Anwar, Juan Pino, Wei-Ning Hsu, Marine Carpuat, Bowen Shi, Changhan Wang | https://aclanthology.org/2024.acl-long.697.pdf |
| MuTox: Universal MUltilingual Audio-based TOXicity Dataset and Zero-shot Detector | Marta R. Costa-jussà, Mariano Coria Meglioli, Pierre Andrews, David Dale, Prangthip Hansanti, Elahe Kalbassi, Alexandre Mourachko, Christophe Ropers, Carleigh Wood | Findings |
| X-ACE: Explainable and Multi-factor Audio Captioning Evaluation | Qian Wang, Jia-Chen Gu, Zhen-Hua Ling | |
| Deepfake Defense: Constructing and Evaluating a Specialized Urdu Deepfake Audio Dataset | Sheza Munir, Wassay Sajjad, Mukeet Raza, Emaan Mujahid Abbas, Abdul Hameed Azeemi, Ihsan Ayyub Qazi, Agha Ali Raza | |
| Beyond Orthography: Automatic Recovery of Short Vowels and Dialectal Sounds in Arabic | Yassine El Kheir, Hamdy Mubarak, Ahmed Ali, Shammur Absar Chowdhury | sound |
useful link: https://2024.emnlp.org/program/accepted_main_conference/
https://2024.emnlp.org/program/accepted_findings/
58 papers
| Paper | Authorlist | Status |
|---|---|---|
| When LLMs Meets Acoustic Landmarks: An Efficient Approach to Integrate Speech into Large Language Models for Depression Detection | Xiangyu Zhang, Hexin Liu, Kaishuai Xu, Qiquan Zhang, Daijiao Liu, Beena Ahmed, Julien Epps | Main, link |
| Speaking in Wavelet Domain: A Simple and Efficient Approach to Speed up Speech Diffusion Model | Xiangyu Zhang, Daijiao Liu, Hexin Liu, Qiquan Zhang, Hanyu Meng, Leibny Paola Garcia Perera, EngSiong Chng, Lina Yao | https://aclanthology.org/2024.emnlp-main.9.pdf |
| Scaling Properties of Speech Language Models | Santiago Cuervo, Ricard Marxer | https://aclanthology.org/2024.emnlp-main.21.pdf |
| EmphAssess : a Prosodic Benchmark on Assessing Emphasis Transfer in Speech-to-Speech Models | Maureen de Seyssel, Antony D’Avirro, Adina Williams, Emmanuel Dupoux | https://aclanthology.org/2024.emnlp-main.30.pdf |
| Is Safer Better? The Impact of Guardrails on the Argumentative Strength of LLMs in Hate Speech Countering | Helena Bonaldi, Greta Damo, Nicolás Benjamín Ocampo, Elena Cabrio, Serena Villata, Marco Guerini | https://aclanthology.org/2024.emnlp-main.201.pdf |
| AlignCap: Aligning Speech Emotion Captioning to Human Preferences | Ziqi Liang, Haoxiang Shi, Hanhui Chen | https://aclanthology.org/2024.emnlp-main.224.pdf |
| F$^2$RL: Factuality and Faithfulness Reinforcement Learning Framework for Claim-Guided Evidence-Supported Counterspeech Generation | Haiyang Wang, Yuchen Pan, Xin Song, Xuechen Zhao, Minghao Hu, Bin Zhou | https://aclanthology.org/2024.emnlp-main.255.pdf |
| Outcome-Constrained Large Language Models for Countering Hate Speech | Lingzi Hong, Pengcheng Luo, Eduardo Blanco, Xiaoying Song | https://aclanthology.org/2024.emnlp-main.260.pdf |
| On Mitigating Performance Disparities in Multilingual Speech Recognition | Monorama Swain, Anna Katrine van Zee, Anders Søgaard | https://aclanthology.org/2024.emnlp-main.323.pdf |
| Methods of Automatic Matrix Language Determination for Code-Switched Speech | Olga Iakovenko, Thomas Hain | https://aclanthology.org/2024.emnlp-main.330.pdf |
| EH-MAM: Easy-to-Hard Masked Acoustic Modeling for Self-Supervised Speech Representation Learning | Ashish Seth, Ramaneswaran S, S Sakshi, Sonal Kumar, Sreyan Ghosh, Dinesh Manocha | https://aclanthology.org/2024.emnlp-main.366.pdf |
| Muting Whisper: A Universal Acoustic Adversarial Attack on Speech Foundation Models | Vyas Raina, Rao Ma, Charles McGhee, Kate Knill, Mark Gales | https://aclanthology.org/2024.emnlp-main.430.pdf |
| Bridging Modalities: Enhancing Cross-Modality Hate Speech Detection with Few-Shot In-Context Learning | Ming Shan Hee, Aditi Kumaresan, Roy Ka-Wei Lee | https://aclanthology.org/2024.emnlp-main.445.pdf |
| Task Arithmetic can Mitigate Synthetic-to-Real Gap in Automatic Speech Recognition | Hsuan Su, Hua Farn, Fan-Yun Sun, Shang-Tse Chen, Hung-yi Lee | https://aclanthology.org/2024.emnlp-main.503.pdf |
| ESC: Efficient Speech Coding with Cross-Scale Residual Vector Quantized Transformers | Yuzhe Gu, Enmao Diao | https://aclanthology.org/2024.emnlp-main.562.pdf |
| Towards Robust Speech Representation Learning for Thousands of Languages | William Chen, Wangyou Zhang, Yifan Peng, Xinjian Li, Jinchuan Tian, Jiatong Shi, Xuankai Chang, Soumi Maiti, Karen Livescu, Shinji Watanabe | https://aclanthology.org/2024.emnlp-main.570.pdf |
| Speechworthy Instruction-tuned Language Models | Hyundong Justin Cho, Nicolaas Paul Jedema, Leonardo F. R. Ribeiro, Karishma Sharma, Pedro Szekely, Alessandro Moschitti, Ruben Janssen, Jonathan May | https://aclanthology.org/2024.emnlp-main.595.pdf |
| Towards Probing Speech-Specific Risks in Large Multimodal Models: A Taxonomy, Benchmark, and Insights | Hao Yang, Lizhen Qu, Ehsan Shareghi, Reza Haf | https://aclanthology.org/2024.emnlp-main.614.pdf |
| Integrating Argumentation and Hate-Speech-based Techniques for Countering Misinformation | Sougata Saha, Rohini Srihari | https://aclanthology.org/2024.emnlp-main.622.pdf |
| Unveiling the Role of Pretraining in Direct Speech Translation | Belen Alastruey, Gerard I. Gállego, Marta R. Costa-jussà | https://aclanthology.org/2024.emnlp-main.630.pdf |
| Multi-Level Cross-Modal Alignment for Speech Relation Extraction | Liang Zhang, Zhen Yang, Biao Fu, Ziyao Lu, Liangying Shao, Shiyu Liu, Fandong Meng, Jie Zhou, Xiaoli Wang, Jinsong Su | https://aclanthology.org/2024.emnlp-main.668.pdf |
| Self-Powered LLM Modality Expansion for Large Speech-Text Models | Tengfei Yu, Xuebo Liu, Zhiyi Hou, Liang Ding, Dacheng Tao, Min Zhang | https://aclanthology.org/2024.emnlp-main.690.pdf |
| Optimizing Rare Word Accuracy in Direct Speech Translation with a Retrieval-and-Demonstration Approach | Siqi Li, Danni Liu, Jan Niehues | https://aclanthology.org/2024.emnlp-main.708.pdf |
| Towards an Open-Source Speech Foundation Model for EU: 950,000 Hours of Open-Source Compliant Speech Data for EU Languages | Marco Gaido, Sara Papi, Luisa Bentivogli, Alessio Brutti, Mauro Cettolo, Roberto Gretter, Marco Matassoni, Mohamed Nabih, Matteo Negri | https://aclanthology.org/2024.emnlp-main.771.pdf |
| VHASR: A Multimodal Speech Recognition System With Vision Hotwords | Jiliang Hu, Zuchao Li, Ping Wang, Haojun Ai, Lefei Zhang, hai zhao | https://aclanthology.org/2024.emnlp-main.821.pdf |
| AudioVSR: Enhancing Video Speech Recognition with Audio Data | Xiaoda Yang, Xize Cheng, Jiaqi Duan, Hongshun Qiu, Minjie Hong, Minghui Fang, Shengpeng Ji, Jialong Zuo, Zhiqing Hong, Zhimeng Zhang, Tao Jin | https://aclanthology.org/2024.emnlp-main.858.pdf |
| Hate Personified: Investigating the role of LLMs in content moderation pipeline for hate speech | Sarah Masud, Sahajpreet Singh, Viktor Hangya, Alexander Fraser, Tanmoy Chakraborty | https://aclanthology.org/2024.emnlp-main.886.pdf |
| Please note that I’m just an AI: Analysis of Behavior Patterns of LLMs in (Non-)offensive Speech Identification | Esra Dönmez, Thang Vu, Agnieszka Falenska | https://aclanthology.org/2024.emnlp-main.1019.pdf |
| BLSP-Emo: Towards Empathetic Large Speech-Language Models | Chen Wang, Minpeng Liao, Zhongqiang Huang, Junhong Wu, Chengqing Zong, Jiajun Zhang | https://aclanthology.org/2024.emnlp-main.1070.pdf |
| Delving into Qualitative Implications of Synthetic Data for Hate Speech Detection | Camilla Casula, Sebastiano Vecellio Salto, Alan Ramponi, Sara Tonelli | |
| Continual Test-time Adaptation for End-to-end Speech Recognition on Noisy Speech | Guan-Ting Lin, Wei Ping Huang, Hung-yi Lee | |
| Interventional Speech Noise Injection for ASR Generalizable Spoken Language Understanding | YeonJoon Jung, Jaeseong Lee, Seungtaek Choi, Dohyeon Lee, Minsoo Kim, seung-won hwang | |
| Bayesian Example Selection Improves In-Context Learning for Speech, Text, and Visual Modalities | Siyin Wang, Chao-Han Huck Yang, Ji Wu, Chao Zhang | |
| PREDICT: Multi-Agent-based Debate Simulation for Generalized Hate Speech Detection | Someen Park, Jaehoon Kim, Seungwan Jin, Sohyun Park, Kyungsik Han | |
| TokenVerse: Unifying Speech and NLP Tasks via Transducer-based ASR | Shashi Kumar, Srikanth Madikeri, Juan Pablo Zuluaga Gomez, Iuliia Thorbecke, Esaú VILLATORO-TELLO, Sergio Burdisso, Petr Motlicek, Karthik Pandia D S, Aravind Ganapathiraju | |
| Twists, Humps, and Pebbles: Multilingual Speech Recognition Models Exhibit Gender Performance Gaps | Giuseppe Attanasio, Beatrice Savoldi, Dennis Fucci, Dirk Hovy | |
| Casablanca: Data and Models for Multidialectal Arabic Speech Recognition | Bashar Talafha, Karima Kadaoui, Samar Mohamed Magdy, Mariem Habiboullah, Chafei Mohamed Chafei, Ahmed Oumar El-Shangiti, et.al. | |
| SpeechQE: Estimating the Quality of Direct Speech Translation | HyoJung Han, Kevin Duh, Marine Carpuat | |
| Simul-MuST-C: Simultaneous Multilingual Speech Translation Corpus Using Large Language Model | Mana Makinae, Yusuke Sakai, Hidetaka Kamigaito, Taro Watanabe | |
| Is Child-Directed Speech Effective Training Data for Language Models? | Steven Y. Feng, Noah Goodman, Michael Frank | |
| HateCOT: An Explanation-Enhanced Dataset for Generalizable Offensive Speech Detection via Large Language Models | Huy Nghiem, Hal Daumé III | Findings |
| PolyWER: A Holistic Evaluation Framework for Code-Switched Speech Recognition | Karima Kadaoui, Maryam Al Ali, Hawau Olamide Toyin, Ibrahim Mohammed, Hanan Aldarmaki | |
| STTATTS: Unified Speech-To-Text And Text-To-Speech Model | Hawau Olamide Toyin, Hao Li, Hanan Aldarmaki | |
| Contextualized Graph Representations for Generating Counter-Narrative against Hate Speech | Selene Baez Santamaria, Helena Gomez Adorno, Ilia Markov | |
| LaRA: Large Rank Adaptation for Speech and Text Cross-Modal Learning in Large Language Models | Zuhair hasan shaik, Pradyoth Hegde, Prashant Bannulmath, Deepak K T | |
| MultiVerse: Efficient and Expressive Zero-Shot Multi-Task Text-to-Speech | Taejun Bak, Youngsik Eom, SeungJae Choi, Young-Sun Joo | |
| Where Visual Speech Meets Language: VSP-LLM Framework for Efficient and Context-Aware Visual Speech Processing | Jeonghun Yeo, Seunghee Han, Minsu Kim, Yong Man Ro | |
| Adversarial Attacks on Parts of Speech: An Empirical Study in Text-to-Image Generation | G M Shahariar, Jia Chen, Jiachen Li, Yue Dong | |
| Breaking the Boundaries: A Unified Framework for Chinese Named Entity Recognition Across Text and Speech | Jinzhong Ning, Yuanyuan Sun, Bo Xu, Zhihao Yang, Ling Luo, Hongfei Lin | |
| Audio-Based Linguistic Feature Extraction for Enhancing Multi-lingual and Low-Resource Text-to-Speech | Youngjae Kim, Yejin Jeon, Gary Lee | |
| Modeling Gender and Dialect Bias in Automatic Speech Recognition | Camille Harris, Chijioke Mgbahurike, Neha Kumar, Diyi Yang | |
| LLM generated responses to mitigate the impact of hate speech | Jakub Podolak, Szymon Łukasik, Paweł Balawender, Jan Ossowski, Jan Piotrowski, Katarzyna Bąkowicz, Piotr Sankowski | |
| BLASER 2.0: a metric for evaluation and quality estimation of massively multilingual speech and text translation | David Dale, Marta R. Costa-jussà | |
| Textless Speech-to-Speech Translation With Limited Parallel Data | Anuj Diwan, Anirudh Srinivasan, David Harwath, Eunsol Choi | |
| PSLM: Parallel Generation of Text and Speech with LLMs for Low-Latency Spoken Dialogue Systems | Kentaro Mitsui, Koh Mitsuda, Toshiaki Wakatsuki, Yukiya Hono, Kei Sawada | |
| Bahasa Harmony: A Comprehensive Dataset for Bahasa Text-to-Speech Synthesis with Discrete Codec Modeling of EnGen-TTS. | Onkar Kishor Susladkar, Vishesh Tripathi, Biddwan Ahmed | |
| Recent Advances in Online Hate Speech Moderation: Multimodality and the Role of Large Models | Ming Shan Hee, Shivam Sharma, RUI CAO, Palash Nandi, Preslav Nakov, Tanmoy Chakraborty, Roy Ka-Wei Lee | |
| WavLLM: Towards Robust and Adaptive Speech Large Language Model | Shujie HU, Long Zhou, Shujie LIU, Sanyuan Chen, Lingwei Meng, Hongkun Hao, Jing Pan, Xunying Liu, Jinyu Li, Sunit Sivasankaran, Linquan Liu, Furu Wei |
22 papers
| Paper | Authorlist | Status |
|---|---|---|
| IDEAW: Robust Neural Audio Watermarking with Invertible Dual-Embedding | Pengcheng Li, Xulong Zhang, Jing Xiao, Jianzong Wang | Main |
| Cross-Domain Audio Deepfake Detection: Dataset and Analysis | Yuang Li, Min Zhang, Mengxin Ren, Xiaosong Qiao, Miaomiao Ma, Daimeng Wei, Hao Yang | |
| GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities | Sreyan Ghosh, Sonal Kumar, Ashish Seth, Chandra Kiran Reddy Evuru, Utkarsh Tyagi, S Sakshi, Oriol Nieto, Ramani Duraiswami, Dinesh Manocha | |
| OpenSep: Leveraging Large Language Models with Textual Inversion for Open World Audio Separation | Tanvir Mahmud, Diana Marculescu | |
| AudioVSR: Enhancing Video Speech Recognition with Audio Data | Xiaoda Yang, Xize Cheng, Jiaqi Duan, Hongshun Qiu, Minjie Hong, Minghui Fang, Shengpeng Ji, Jialong Zuo, Zhiqing Hong, Zhimeng Zhang, Tao Jin | |
| PALM: Few-Shot Prompt Learning for Audio Language Models | Asif Hanif, Maha Tufail Agro, Mohammad Areeb Qazi, Hanan Aldarmaki | |
| TCSinger: Zero-Shot Singing Voice Synthesis with Style Transfer and Multi-Level Style Control | Yu Zhang Ziyue Jiang Ruiqi Li Changhao Pan Jinzheng He Rongjie Huang Chuxin Wang Zhou Zhao | voice |
| Voices Unheard: NLP Resources and Models for Yorùbá Regional Dialects | Orevaoghene Ahia, Anuoluwapo Aremu, Diana Abagyan, Hila Gonen, David Ifeoluwa Adelani, Daud Abolade, Noah A. Smith, Yulia Tsvetkov | voice |
| EmoKnob: Enhance Voice Cloning with Fine-Grained Emotion Control | Haozhe Chen, Run Chen, Julia Hirschberg | voice |
| Voices in a Crowd: Searching for clusters of unique perspectives | Nikolas Vitsakis, Amit Parekh, Ioannis Konstas | voice |
| With Ears to See and Eyes to Hear: Sound Symbolism Experiments with Multimodal Large Language Models | Tyler Loakman, YUCHENG LI, Chenghua Lin | sound |
| Adaptive Immune-based Sound-Shape Code Substitution for Adversarial Chinese Text Attacks | Ao Wang, Xinghao Yang, Chen Li, Bao-di Liu, Weifeng Liu | sound |
| A SMART Mnemonic Sounds like “Glue Tonic”: Mixing LLMs with Student Feedback to Make Mnemonic Learning Stick | Nishant Balepur, Matthew Shu, Alexander Hoyle, Alison Robey, Shi Feng, Seraphina Goldfarb-Tarrant, Jordan Lee Boyd-Graber | sound |
| A Fast and Sound Tagging Method for Discontinuous Named-Entity Recognition | Caio Filippo Corro | sound |
| Beyond Single-Audio: Advancing Multi-Audio Processing in Audio Large Language Models | Yiming Chen, Xianghu Yue, Xiaoxue Gao, Chen Zhang, Luis Fernando D’Haro, Robby T. Tan, Haizhou Li | Findings |
| AlanaVLM: A Multimodal Embodied AI Foundation Model for Egocentric Video Understanding | Alessandro Suglia, Claudio Greco, Katie Baker, Jose L. Part, Ioannis Papaioannou, Arash Eshghi, Ioannis Konstas, Oliver Lemon | |
| Unveiling Hallucination in Text, Image, Video, and Audio Foundation Models: A Comprehensive Review | Pranab Sahoo, Prabhash Meharia, Akash Ghosh, Sriparna Saha, Vinija Jain, Aman Chadha | |
| Audio-Based Linguistic Feature Extraction for Enhancing Multi-lingual and Low-Resource Text-to-Speech | Youngjae Kim, Yejin Jeon, Gary Lee | |
| SaSR-Net: Source-Aware Semantic Representation Network for Enhancing Audio-Visual Question Answering | Tianyu Yang, Yiyang Nan, Lisen Dai, Zhenwen Liang, Yapeng Tian, Xiangliang Zhang | |
| PyramidCodec: Hierarchical Codec for Long-form Music Generation in Audio Domain | Jianyi Chen, Zheqi DAI, Zhen Ye, Xu Tan, Qifeng Liu, Yike Guo, Wei Xue | |
| Enhancing Polyglot Voices by Leveraging Cross-Lingual Fine-Tuning in Any-to-One Voice Conversion | Giuseppe Ruggiero, Matteo Testa, Jurgen Van de Walle, Luigi Di Caro | voice |
| HSDreport: Heart Sound Diagnosis with Echocardiography Reports | Zihan Zhao, Pingjie Wang, Liudan Zhao, Yuchen Yang, Ya Zhang, Kun Sun, Xin Sun, Xin Zhou, Yu Wang, Yanfeng Wang | sound |
useful link: https://2025.naacl.org/program/accepted_papers/
| Paper | Authorlist | Status |
|---|---|---|
| Robust and Unbounded Length Generalization in Autoregressive Transformer-Based Text-to-Speech | Eric Battenberg, RJ Skerry-Ryan, Daisy Stanton, Soroosh Mariooryad, Matt Shannon, Julian Salazar, David Teh-Hwa Kao | |
| Decoding Hate: Exploring Language Models’ Reactions to Hate Speech | Paloma Piot, Javier Parapar | |
| Multi$^3$Hate: Multimodal, Multilingual, and Multicultural Hate Speech Detection with Vision–Language Models | Minh Duc Bui, Katharina von der Wense, Anne Lauscher | |
| CSEval: Towards Automated, Multi-Dimensional, and Reference-Free Counterspeech Evaluation using Auto-Calibrated LLMs | Amey Hengle, Aswini Kumar Padhi, Anil Bandhakavi, Tanmoy Chakraborty | |
| MAD Speech: Measures of Acoustic Diversity of Speech | Matthieu Futeral, Andrea Agostinelli, Marco Tagliasacchi, Neil Zeghidour, Eugene Kharitonov | |
| Afrispeech-Dialog: A Benchmark Dataset for Spontaneous English Conversations in Healthcare and Beyond | Mardhiyah Sanni, Tassallah Abdullahi, Devendra Deepak Kayande, Emmanuel Ayodele, Naome A Etori, Michael Samwel Mollel, Moshood O. Yekini, Chibuzor Okocha, Lukman Enegi Ismaila, Folafunmi Omofoye, Boluwatife A. Adewale, Tobi Olatunji | |
| Wav2Prompt: End-to-End Speech Prompt Learning and Task-based Fine-tuning for Text-based LLMs | Keqi Deng, Guangzhi Sun, Phil Woodland | |
| On the Role of Speech Data in Reducing Toxicity Detection Bias | Samuel Bell, Mariano Coria Meglioli, Megan Richards, Eduardo Sánchez, Christophe Ropers, Skyler Wang, Adina Williams, Levent Sagun, Marta R. Costa-jussà | |
| Leveraging Allophony in Self-Supervised Speech Models for Atypical Pronunciation Assessment | Kwanghee Choi, Eunjung Yeo, Kalvin Chang, Shinji Watanabe, David R Mortensen | |
| StyleTTS-ZS: Efficient High-Quality Zero-Shot Text-to-Speech Synthesis with Distilled Time-Varying Style Diffusion | Yinghao Aaron Li, Xilin Jiang, Cong Han, Nima Mesgarani | |
| AfriHate: A Multilingual Collection of Hate Speech and Abusive Language Datasets for African Languages | Shamsuddeen Hassan Muhammad, Idris Abdulmumin, Abinew Ali Ayele, David Ifeoluwa Adelani, Ibrahim Said Ahmad, Saminu Mohammad Aliyu, Paul Röttger, Abigail Oppong, Andiswa Bukula, et, al | |
| Prepending or Cross-Attention for Speech-to-Text? An Empirical Comparison | Tsz Kin Lam, Marco Gaido, Sara Papi, Luisa Bentivogli, Barry Haddow | |
| ProSE: Diffusion Priors for Speech Enhancement | Sonal Kumar, Sreyan Ghosh, Utkarsh Tyagi, Anton Jeran Ratnarajah, Chandra Kiran Reddy Evuru, Ramani Duraiswami, Dinesh Manocha | |
| VoiceTextBlender: Augmenting Large Language Models with Speech Capabilities via Single-Stage Joint Speech-Text Supervised Fine-Tuning | Yifan Peng, Krishna C Puvvada, Zhehuai Chen, Piotr Zelasko, He Huang, Kunal Dhawan, Ke Hu, Shinji Watanabe, Jagadeesh Balam, Boris Ginsburg | |
| DyPCL: Dynamic Phoneme-level Contrastive Learning for Dysarthric Speech Recognition | Wonjun Lee, Solee Im, Heejin Do, Yunsu Kim, Jungseul Ok, Gary Lee | |
| How do Multimodal Foundation Models Encode Text and Speech? An Analysis of Cross-Lingual and Cross-Modal Representations | Hyunji Lee, Danni Liu, Supriti Sinhamahapatra, Jan Niehues | short |
| Developing multilingual speech synthesis system for Ojibwe, Mi’kmaq, and Maliseet | Shenran Wang, Changbing Yang, Michael l parkhill, Chad Quinn, Christopher Hammerly, Jian Zhu | |
| Cross-Lingual Transfer Learning for Speech Translation | Rao Ma, Mengjie Qian, Yassir Fathullah, Siyuan Tang, Mark Gales, Kate Knill | |
| kNN Retrieval for Simple and Effective Zero-Shot Multi-speaker Text-to-Speech | Karl El Hajal, Ajinkya Kulkarni, Enno Hermann, Mathew Magimai Doss | |
| WaveFM: A High-Fidelity and Efficient Vocoder Based on Flow Matching | Tianze Luo, Xingchen Miao, Wenbo Duan | |
| DiVISe: Direct Visual-Input Speech Synthesis Preserving Speaker Characteristics And Intelligibility | Yifan Liu, Yu Fang, Zhouhan Lin | Findings |
| BanTH: A Multi-label Hate Speech Detection Dataset for Transliterated Bangla | Fabiha Haider, Fariha Tanjim Shifat, Md Farhan Ishmam, Md Sakib Ul Rahman Sourove, Deeparghya Dutta Barua, Md Fahim, Md Farhad Alam Bhuiyan | |
| CDB: A Unified Framework for Hope Speech Detection Through Counterfactual, Desire and Belief | Tulio Ferreira Leite da Silva, Gonzalo Freijedo Aduna, Farah Benamara, Alda Mari, Zongmin Li, Li Yue, Jian Su | |
| Untangling Hate Speech Definitions: A Semantic Componential Analysis Across Cultures and Domains | Katerina Korre, Arianna Muti, Federico Ruggeri, Alberto Barrón-Cedeño | |
| Exploring Large Language Models for Hate Speech Detection in Rioplatense Spanish | Juan Manuel Pérez, Paula Miguel, Viviana Cotik | |
| Unsupervised Speech-text word-level alignment with Dynamic Programming | Tianshu Yu, Zihan Gong, Minghuan Tan, Guhong Chen, Min Yang | |
| Prompt-Guided Selective Masking Loss for Context-Aware Emotive Text-to-Speech | Yejin Jeon, Youngjae Kim, Jihyun Lee, Gary Lee | |
| Echoes of Discord: Forecasting Hater Reactions to Counterspeech | Xiaoying Song, Sharon Lisseth Perez, Xinchen Yu, Eduardo Blanco, Lingzi Hong | |
| Continuous Speech Tokenizer in Text To Speech | Yixing Li, Ruobing Xie, Xingwu Sun, Yu Cheng, Zhanhui Kang | |
| CA*: Addressing Evaluation Pitfalls in Computation-Aware Latency for Simultaneous Speech Translation | Xi Xu, Wenda Xu, Siqi Ouyang, Lei Li | |
| Gender Bias in Instruction-Guided Speech Synthesis Models | Chun-Yi Kuan, Hung-yi Lee | |
| Yeah, Un, Oh: Continuous and Real-time Backchannel Prediction with Fine-tuning of Voice Activity Projection | Koji Inoue, Divesh Lala, Gabriel Skantze, Tatsuya Kawahara | voice |
| Playing with Voices: Tabletop Role-Playing Game Recordings as a Diarization Challenge | Lian Remme, Kevin Tang | voice |
| Paper | Authorlist | Status |
|---|---|---|
| Ihquin tlahtouah in Tetelahtzincocah: An annotated, multi-purpose audio and text corpus of Western Sierra Puebla Nahuatl | Robert Pugh, Cheyenne Wing, María Ximena Juárez Huerta, Angeles Márquez Hernandez, Francis M. Tyers | |
| PAT: Parameter-Free Audio-Text Aligner to Boost Zero-Shot Audio Classification | Ashish Seth, Ramaneswaran Selvakumar, Sonal Kumar, Sreyan Ghosh, Dinesh Manocha | |
| AudioBench: A Universal Benchmark for Audio Large Language Models | Bin Wang, Xunlong Zou, Geyu Lin, Shuo Sun, Zhuohan Liu, Wenyu Zhang, Zhengyuan Liu, AiTi Aw, Nancy F. Chen | |
| Audio Is the Achilles’ Heel: Red Teaming Audio Large Multimodal Models | Hao Yang, Lizhen Qu, Ehsan Shareghi, Gholamreza Haffari | |
| Do Audio-Language Models Understand Linguistic Variations? | Ramaneswaran Selvakumar, Sonal Kumar, Hemant Kumar Giri, Nishit Anand, Ashish Seth, Sreyan Ghosh, Dinesh Manocha | |
| Comprehensive Layer-wise Analysis of SSL Models for Audio Deepfake Detection | Yassine El Kheir, Younes Samih, Suraj Maharjan, Tim Polzehl, Sebastian Möller | |
| Audio Description Generation in the Era of LLMs and VLMs: A Review of Transferable Generative AI Technologies | Yingqiang Gao, Lukas Fischer, Alexa Lintner, Sarah Ebling | |
| Synthetic Audio Helps for Cognitive State Tasks | Adil Soubki, John Murzaku, Peter Zeng, Owen Rambow |
useful link: https://ijcai24.org/main-track-accepted-papers/index.html
The number of this conference (speech&audio) is small.
| Paper | Authorlist | Status |
|---|---|---|
| Separate in the Speech Chain: Cross-Modal Conditional Audio-Visual Target Speech Extraction | Zhaoxi Mu, Xinyu Yang | |
| Bridge to Non-Barrier Communication: Gloss-Prompted Fine-Grained Cued Speech Gesture Generation with Diffusion Model | Wentao Lei, Li Liu, Jun Wang | |
| Two-stage Semi-supervised Speaker Recognition with Gated Label Learning | Xingmei Wang, Jiaxiang Meng, Kong Aik Lee, Boquan Li, Jinghan Liu | |
| Discriminative Feature Decoupling Enhancement for Speech Forgery Detection | Yijun Bei, Xing Zhou, Erteng Liu, Yang Gao, Sen Lin, Kewei Gao, Zunlei Feng | |
| Innovative Directional Encoding in Speech Processing: Leveraging Spherical Harmonics Injection for Multi-Channel Speech Enhancement | Jiahui Pan, Pengjie Shen, Hui Zhang, Xueliang Zhang | |
| Contextualized Speech Recognition: Rethinking Second-Pass Rescoring with Generative Large Language Models | Yixuan Tang, Anthony K. H. Tung | |
| Speech-Forensics: Towards Comprehensive Synthetic Speech Dataset Establishment and Analysis | Zhoulin Ji, Chenhao Lin, Hang Wang, Chao Shen | |
| Decoupling Breaks Data Barriers: A Decoupled Pre-training Framework for Multi-intent Spoken Language Understanding | Libo Qin, Qiguang Chen, Jingxuan Zhou, Qinzheng Li, Chunlin Lu, Wanxiang Che | |
| Recent Advances in End-to-End Simultaneous Speech Translation | Xiaoqian Liu, Guoqiang Hu, Yangfan Du, Erfeng He, YingFeng Luo, Chen Xu, Tong Xiao, Jingbo Zhu | Survey |
| Paper | Authorlist | Status |
|---|---|---|
| EAT: Self-Supervised Pre-Training with Efficient Audio Transformer | Wenxi Chen, Yuzhe Liang, Ziyang Ma, Zhisheng Zheng, Xie Chen | |
| Generating More Audios for End-to-End Spoken Language Understanding | Xuxin Cheng, Yuexian Zou | |
| BATON: Aligning Text-to-Audio Model Using Human Preference Feedback | Huan Liao, Haonan Han, Kai Yang, Tianjiao Du, Rui Yang, Qinmei Xu, Zunnan Xu, Jingquan Liu, Jiasheng Lu, Xiu Li | |
| HyDiscGAN: A Hybrid Distributed cGAN for Audio-Visual Privacy Preservation in Multimodal Sentiment Analysis | Zhuojia Wu, Qi Zhang, Duoqian Miao, Kun Yi, Wei Fan, Liang Hu | |
| InstructME: An Instruction Guided Music Edit Framework with Latent Diffusion Models | Bing Han, Junyu Dai, Weituo Hao, Xinyan He, Dong Guo, Jitong Chen, Yuxuan Wang, Yanmin Qian, Xuchen Song |
| Paper | Authorlist | Status |
|---|---|---|
| MF-Speech: Achieving Fine-Grained and Compositional Control in Speech Generation via Factor Disentanglement |
| Paper | Authorlist | Status |
|---|---|---|
| StreamingTalker: Audio-driven 3D Facial Animation with Autoregressive Diffusion Model |
- Expressive TTS: https://github.com/01Zhangbw/Awesome-Expressive-speech-synthesis
- Disordered Speech: https://github.com/01Zhangbw/Awesome-Disordered-Speech
- Neural Codec & Speech Language Models: https://github.com/LqNoob/Neural-Codec-and-Speech-Language-Models
- Controllable TTS: https://github.com/imxtx/awesome-controllabe-speech-synthesis
- Large Audio Model: https://github.com/EmulationAI/awesome-large-audio-models
- Codec-SuperB: https://github.com/voidful/Codec-SUPERB
- Next Token Prediction: https://github.com/LMM101/Awesome-Multimodal-Next-Token-Prediction
- Paper daily: https://github.com/halsay/ASR-TTS-paper-daily
- Audio LLM: https://github.com/AudioLLMs/Awesome-Audio-LLM
- Speech Trident: https://github.com/ga642381/speech-trident
- Speech Pretrained: https://github.com/ddlBoJack/Awesome-Speech-Pretraining
- TTS: https://github.com/wenet-e2e/speech-synthesis-paper
- Speech Language model: https://github.com/ddlBoJack/Awesome-Speech-Language-Model
- Amphion
- InterSpeech23-24: https://github.com/DmitryRyumin/INTERSPEECH-2023-24-Papers
- ICASSP23-24: https://github.com/DmitryRyumin/ICASSP-2023-24-Papers
-
Amphion v0.2 technical report https://arxiv.org/abs/2501.15442
-
Emilia-Large:更大杯,更多实验结果及细节 https://arxiv.org/abs/2501.15907
-
AnyEnhance:语音增强、歌声增强、说话人提取等等任务,AnyEnhance一个模型全搞定 https://arxiv.org/abs/2501.15417
If you find this repository helpful, please consider citing:
@misc{Zhang2025SpeechAudio,
title = {Speech-and-audio-papers-Top-Conference},
author = {Bowen Zhang},
year = {2025},
howpublished = {\url{https://github.com/01Zhangbw/Speech-and-audio-papers-Top-Conference}},
}
This repository is released under the MIT license.