The INTERSPEECH 2025 MLC-SLM Challenge Dataset, curated by Datatang, is derived from fifteen proprietary conversational speech corpora. Distinguished by exceptional annotation accuracy and operational reliability, this dataset is engineered to address critical challenges in multilingual automatic speech recognition (ASR) and long-context comprehension. It meticulously replicates real-world complexities including spontaneous interruptions and speaker overlaps across 11 languages (1500 hours total duration), thereby providing robust training resources for developing world-ready ASR systems.
For more details, please refer to the link: https://www.nexdata.ai/datasets/speechrecog/1892?source=Github
16kHz, 16bit, uncompressed wav, mono channel;
quiet indoor environment, without echo;
dozens of topics are specified, and the speakers make dialogue under those topics while the recording is performed;
annotating for the transcription text, speaker identification, gender;
Android mobile phone, iPhone;
American English/British English/Filipino English/Australian English/Indian English/French/German/Italian/Japanese/Korean/Portuguese(Europe)/Russian/Spanish(Spain)/Thai/Vietnamese.
Commercial License