overlapping speech #434

Majdoddin · 2022-10-29T13:09:19Z

Majdoddin
Oct 29, 2022

I have tried whisper on audios with overlapping speech, that is talking simultaneously. check it here . Whisper is very able to separate overlapping speech, but only generates transcription for one of them (I don't know on how it chooses one). So my question is: Is it an architectural limitation, that whisper has to ignore one of the overlapping speakers? Or can whisper be fine-tuned to generate transcriptions for both?

Answered by jongwook

Nov 14, 2022

This is a limitation of the model because the training data often had transcription for one speaker while treating other voices as background noise. You might be able to nudge the model to produce overlapping transcription by prompting what each speaker would be saying 30 seconds ago, potentially with speaker labels like --prefix "[Bob] So I was saying that [Alice] But there was always this". Please note, [ and ] are suppressed from the output by default, and you'll need to edit this line to re-enable them:

whisper/whisper/tokenizer.py

Line 248 in 9f70a35

symbols = list("\"#()*+/:;<=>@[\\]^_`{|}~「」『』")

If you have a dataset of overlapping speech (or create one yourself b…

View full answer

Tetsujinfr · 2022-11-13T22:24:17Z

Tetsujinfr
Nov 13, 2022

Any chance for a comment OpenAI? @jongwook ?
thanks for the great repo/research.

1 reply

Tejasvidash Nov 14, 2022

I am working for speaker diarization where I need to build the model which will automatically detect the speakers. can you please help me with this?

jongwook · 2022-11-14T22:44:58Z

jongwook
Nov 14, 2022
Maintainer

This is a limitation of the model because the training data often had transcription for one speaker while treating other voices as background noise. You might be able to nudge the model to produce overlapping transcription by prompting what each speaker would be saying 30 seconds ago, potentially with speaker labels like --prefix "[Bob] So I was saying that [Alice] But there was always this". Please note, [ and ] are suppressed from the output by default, and you'll need to edit this line to re-enable them:

whisper/whisper/tokenizer.py

Line 248 in 9f70a35

symbols = list("\"#()*+/:;<=>@[\\]^_`{|}~「」『』")

If you have a dataset of overlapping speech (or create one yourself by mixing single-speaker audio), I think fine-tuning on it can be a more effective/reliable/less hacky way to transcribe overlapping audio. This could be an interesting research project!

1 reply

loretoparisi Apr 5, 2024

@jongwook this is interresting! how you could prompt whisper about the specific speaker? Thank you

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

overlapping speech #434

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

overlapping speech #434

Majdoddin Oct 29, 2022

Replies: 2 comments · 2 replies

Tetsujinfr Nov 13, 2022

Tejasvidash Nov 14, 2022

jongwook Nov 14, 2022 Maintainer

loretoparisi Apr 5, 2024

Majdoddin
Oct 29, 2022

Replies: 2 comments 2 replies

Tetsujinfr
Nov 13, 2022

jongwook
Nov 14, 2022
Maintainer