-
Notifications
You must be signed in to change notification settings - Fork 25
Description
Currently, the upper limit on the duration of audio chunks taken as input by Persephone is 10 seconds. This is an issue for the real-world deployment of Persephone, because many documents in archives such as the Pangloss Collection are divided into longer chunks.
Thus, the document “Romanmangan, the fairy from the other world" has a duration of 1,890 seconds, and is divided into 212 sentences. Seventy sentences, amounting to more than half of the total duration of this substantial story, are above the 10-second limit, and thus not used in training.
A suggestion from a reviewer of a paper at SLTU is to perform Voice Activity Detection (VAD), to detect silence and non-silence, and then cut the long waveform at silence part into smaller pieces. This way, we may still use all the data for training.