LIST OF AUDIO+TEXT DATASETS

## LIST OF ALL ITALIAN DATASETS FOUND
From issue #90 I'm putting here all the datasets that have been discovered. 
Some of them are plug-and-play for Deepspeech others instead need to be created from scratch (splits up audio by sentences)

Feel free to pickup one that has not been done for checking it out.

### NOTE
If one of this dataset needs a deeper analysis please do not start a discussion here but open a new issue and I will update this table with the issue reference.

## DATASETS

| dataset  | hrs | url | plug-n-play | TODOs | doing | done | note
| ------------- | -------------	| ------------- | ------------- | ------------- | ------------- | ------------- | ------------- |
|**MLS**  	| 279.43 h		| [&#x2197;](http://openslr.org/94/)| |  | |  | **HOT!!!!**
|VoxForge #111 	| 20h		| [&#x2197;](http://www.repository.voxforge1.org/downloads/it/Trunk/Audio/Main/16kHz_16bit/)| &#10004; | <ul><li>- [x] url replace in DS import_voxforge.py script</li><li>- [x] fix import sys error </li></ul>  | &#10004; | |
|MAILABS	| 127h40m	| [&#x2197;](https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/)|  &#10004; | | | &#10004; |
|Evalita2009	| 5h		| [&#x2197;](http://www.evalita.it/2009/tasks/digits)|  | | | &#10004; |
|MSPKA		| 3h		| [&#x2197;](http://www.mspkacorpus.it/)| | | | &#10004; |
|SIWIS		| 4.5h		| [&#x2197;](https://phonogenres.unige.ch/index.php?page=téléchargement)| | | | &#10004; |
|SUGAR		| 1.5h		| [&#x2197;](https://github.com/evalitaunina/SUGAR_Corpus)| | | | | sentences are not useful
|VociParlateWikipedia #34 | ?		| [&#x2197;](https://it.wikipedia.org/wiki/Categoria:Voci_parlate)| |<ul><li>- [ ] sync audio with its page revision</li></ul>  | | | 
|EMOVO		| ~12m		| [&#x2197;](http://voice.fub.it/activities/corpora/emovo/index.html)| |<ul><li>- [ ] align filename codes with their sentences </li></ul> | | | interesting for emotions (disgust, happy..)
|ZIta		|  <1hr		| [&#x2197;](https://github.com/ChMeluzzi/ZIta)| | | | | transcriptions do not follow recordings (eg: Lett_Z_Sp1_zero.wav)
|LIM_Veneti		|  <1hr		| [&#x2197;](https://github.com/ChMeluzzi/LIM_Veneti)| | | | | no audio files?
|split-MDb		|  ~46m		| [&#x2197;](http://www.parlaritaliano.it/index.php/en/corpora/644-spit-mdb-spoken-italian-multilevel-database)| |<ul><li>- [ ] parse&clean the .wrd files </li></ul> | | | based on CLIPS
|tg60		|  1h30m		| [&#x2197;](http://www.parlaritaliano.it/index.php/it/dati/650-corpus-di-parlato-telegiornalistico-anni-sessanta-vs-2005)| |<ul><li>- [ ] long audio files to be split </li></ul> | | | maybe among the info files there are some timings that could be useful for splitting up?
|PraTiD		|  1h12m		| [&#x2197;](http://www.parlaritaliano.it/index.php/en/corpora/645-corpus-pratid)| |<ul><li>- [ ] long audio files to be split </li></ul> | | | From CLIPS; maybe among the info files there are some timings that could be useful for splitting up?
|ParlatoCinematografico	|  ?		| [&#x2197;](http://www.parlaritaliano.it/index.php/it/dati/659-corpus-di-parlato-cinematografico)| |<ul><li>- [ ] long audio files to be split </li></ul> | | | .lab files with speakers timings
|PerugiaCorpusPEC	|  ?		| [&#x2197;](https://www.unistrapg.it/cqpwebnew/)| | | | | a login is needed. License?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LIST OF AUDIO+TEXT DATASETS #114

LIST OF ALL ITALIAN DATASETS FOUND

NOTE

DATASETS

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

dataset	hrs	url	plug-n-play	TODOs	doing	done	note
MLS	279.43 h	↗					HOT!!!!
VoxForge #111	20h	↗	✔	url replace in DS import_voxforge.py script fix import sys error	✔
MAILABS	127h40m	↗	✔			✔
Evalita2009	5h	↗				✔
MSPKA	3h	↗				✔
SIWIS	4.5h	↗				✔
SUGAR	1.5h	↗					sentences are not useful
VociParlateWikipedia #34	?	↗		sync audio with its page revision
EMOVO	~12m	↗		align filename codes with their sentences			interesting for emotions (disgust, happy..)
ZIta	<1hr	↗					transcriptions do not follow recordings (eg: Lett_Z_Sp1_zero.wav)
LIM_Veneti	<1hr	↗					no audio files?
split-MDb	~46m	↗		parse&clean the .wrd files			based on CLIPS
tg60	1h30m	↗		long audio files to be split			maybe among the info files there are some timings that could be useful for splitting up?
PraTiD	1h12m	↗		long audio files to be split			From CLIPS; maybe among the info files there are some timings that could be useful for splitting up?
ParlatoCinematografico	?	↗		long audio files to be split			.lab files with speakers timings
PerugiaCorpusPEC	?	↗					a login is needed. License?

LIST OF AUDIO+TEXT DATASETS #114

Description

LIST OF ALL ITALIAN DATASETS FOUND

NOTE

DATASETS

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions