Skip to content
This repository was archived by the owner on Mar 8, 2023. It is now read-only.

LIST OF AUDIO+TEXT DATASETS #114

@nefastosaturo

Description

@nefastosaturo

LIST OF ALL ITALIAN DATASETS FOUND

From issue #90 I'm putting here all the datasets that have been discovered.
Some of them are plug-and-play for Deepspeech others instead need to be created from scratch (splits up audio by sentences)

Feel free to pickup one that has not been done for checking it out.

NOTE

If one of this dataset needs a deeper analysis please do not start a discussion here but open a new issue and I will update this table with the issue reference.

DATASETS

dataset hrs url plug-n-play TODOs doing done note
MLS 279.43 h HOT!!!!
VoxForge #111 20h
  • url replace in DS import_voxforge.py script
  • fix import sys error
MAILABS 127h40m
Evalita2009 5h
MSPKA 3h
SIWIS 4.5h
SUGAR 1.5h sentences are not useful
VociParlateWikipedia #34 ?
  • sync audio with its page revision
EMOVO ~12m
  • align filename codes with their sentences
interesting for emotions (disgust, happy..)
ZIta <1hr transcriptions do not follow recordings (eg: Lett_Z_Sp1_zero.wav)
LIM_Veneti <1hr no audio files?
split-MDb ~46m
  • parse&clean the .wrd files
based on CLIPS
tg60 1h30m
  • long audio files to be split
maybe among the info files there are some timings that could be useful for splitting up?
PraTiD 1h12m
  • long audio files to be split
From CLIPS; maybe among the info files there are some timings that could be useful for splitting up?
ParlatoCinematografico ?
  • long audio files to be split
.lab files with speakers timings
PerugiaCorpusPEC ? a login is needed. License?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions