-
I have tried whisper on audios with overlapping speech, that is talking simultaneously. check it here . Whisper is very able to separate overlapping speech, but only generates transcription for one of them (I don't know on how it chooses one). So my question is: Is it an architectural limitation, that whisper has to ignore one of the overlapping speakers? Or can whisper be fine-tuned to generate transcriptions for both? |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 2 replies
-
Any chance for a comment OpenAI? @jongwook ? |
Beta Was this translation helpful? Give feedback.
-
This is a limitation of the model because the training data often had transcription for one speaker while treating other voices as background noise. You might be able to nudge the model to produce overlapping transcription by prompting what each speaker would be saying 30 seconds ago, potentially with speaker labels like Line 248 in 9f70a35 If you have a dataset of overlapping speech (or create one yourself by mixing single-speaker audio), I think fine-tuning on it can be a more effective/reliable/less hacky way to transcribe overlapping audio. This could be an interesting research project! |
Beta Was this translation helpful? Give feedback.
This is a limitation of the model because the training data often had transcription for one speaker while treating other voices as background noise. You might be able to nudge the model to produce overlapping transcription by prompting what each speaker would be saying 30 seconds ago, potentially with speaker labels like
--prefix "[Bob] So I was saying that [Alice] But there was always this"
. Please note,[
and]
are suppressed from the output by default, and you'll need to edit this line to re-enable them:whisper/whisper/tokenizer.py
Line 248 in 9f70a35
If you have a dataset of overlapping speech (or create one yourself b…