Difference between transcribing whole audio and transcribing the same audio split into segments #322
Replies: 3 comments 4 replies
-
Hi, this happens, because whisper used context information from previos recognized audio chunks. And its produced directly in And, if you want pass context information to whisper, you can use arg for cli: parser.add_argument("--initial_prompt", type=str, default=None, help="optional text to provide as a prompt for the first window.") Or directly in python (This prompt: Optional[Union[str, List[int]]] = None # text or tokens for the previous context
prefix: Optional[Union[str, List[int]]] = None # text or tokens to prefix the current context Also tou can check this discussion #117 |
Beta Was this translation helpful? Give feedback.
-
@AntonioBuccola BTW, the possible solution could be to put silence block between diarized segments, more info in https://github.com/Majdoddin/nlp |
Beta Was this translation helpful? Give feedback.
-
Hi @AntonioBuccola, |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello, I am using whisper to transcribe audios in Italian language and I noticed a huge difference in these two cases.
In the first one, the transcription is obtained passing the whole audio to the
model.transcribe
method and it works fine; on the other hand, when cropping the audio into segments and then transcribe each of them separately, the results are awful.I need to perform cropping since I am facing a speaker diarization problem: though whisper already voice activity detection, the results of speaker identification using the segments provided by whisper are not satisfactory (diarization error rate of about 20%, against an already obtained of 8%).
For completeness, I am using the
spectralcluster
module to perform speaker identification after audio embedding withpyannote
.Is there a reason why this happens?
Thanks in advance.
Beta Was this translation helpful? Give feedback.
All reactions