-
Notifications
You must be signed in to change notification settings - Fork 10
Open
Description
Hi,
Thanks for creating this great project.
The annotate_audio.py is working well to give emotion annotations, but does not output an accompanying transcription of the speech. This may be my mis-understanding of your use of the word 'transcription'. Your Example Output suggests caption contains 'transcribed text' but actually is a high level description of the text. It would be worth improving naming conventions or adding transcription of spoken text to avoid further confusion.
Thanks,
Caspar
E.g my_audio_file.json:
"caption": "AA medium-quality recording of a male speaker describing a painting. The speaker sounds calm and informative, with a slightly nostalgic tone. The recording quality is decent, with no noticeable background noise.",There are no obvious error messages although in the console. Although perhaps these are relevant.
Transcription using a multilingual Whisper will default to language detection followed by transcription instead of translation to English. This might be a breaking change for your use case. If you want to instead always translate your audio to English, make sure to pass `language='en'`. See https://github.com/huggingface/transformers/pull/28687 for more details.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Metadata
Metadata
Assignees
Labels
No labels