Utilities for working with the Common Voice dataset
Developed with Python 3.9.6
The scripts in this repo can be used for:
- scraping the download URLs and downloading each lanaguage dataset from the Common Voice website (https://commonvoice.mozilla.org/en/datasets)
- Creating symlinks to the validated audio files by reading the
validated.tsv
file and matching it with the corrisponding audio file in directoryclips
- Renaming the validated TSV files (just housekeeping really)
Modifications and comments welcome.
This project requires Python 3.8 or newer.
This project uses webdriver-manager to handle Chromedriver installation automatically.
No manual Chromedriver download required!
- Clone this repository
- Install dependencies:
pip install -r requirements.txt