Pre-processings to reduce hallucinations from noisy audio #2378
Replies: 3 comments 2 replies
-
Did you use the GUI for MDX-Net or were you able to just refer to that specific model in your code? |
Beta Was this translation helpful? Give feedback.
-
I have noticed that
This .transcribe should have filtered out everything above 2.1, but when I do
I get some crazy values
I recomend that you filter your segments manually after transcribing, anything above ~2-2.2 is likely a halucination. |
Beta Was this translation helpful? Give feedback.
-
For me OpenWhisper is simply unusable even trying speech to text on a clean 50 min monologue I get missing text. All the small models including turbo (but not large v2/v3) make English word recognition errors so unusable. Thanks for https://github.com/jhj0517/Whisper-WebUI - ONLY using Silero VAD I get perfect speech to text on my 50 min monologue without cutouts. |
Beta Was this translation helpful? Give feedback.
-
Hi everyone. Thanks for all your great efforts for the really cool open source project.
Here's my experience in reducing hallucinations from noisy audio.
This is the sample that can reproduce hallucinations :
https://www.youtube.com/watch?v=Eek0cOjLrV0
There's a long noise between 0:03~1:05 in the sample.
The main thing I focused on was to avoid this noise part as much as possible, so that Whisper would cause less hallucinations.
This is a human-made transcription with no errors:
Below one is the transcription using whisper
large-v2
. The reason I usedlarge-v2
is that I got much worse results withlarge-v3
:For the hyperparameters
beam_size
is 5, all other defaults are used. Each line break represents a different segment detected bylarge-v2
.Whisper has transcribed the long noise part (0:03 ~ 1:05) into
♪♪
s.The problem is that large-v2 also transcribed some speeches as just
♪♪
s.What I used to reduce such hallucinations is Silero-VAD to detect voices and MDX-Net model from UVR to remove the noise itself from the audio.
This is the result with VAD-only:
VAD successfully skipped the long noise part (0:03 ~ 1:05), but also some speeches ("Love. It's why we're all here" part). This is because Silero VAD also caused hallucinations by the long noise part of the audio. I made more attempts with different VAD parameters, but couldn't get a better result. In my experience, tweaking hyperparameters often leads to unexpected hallucinations, so I prefer pre-processing if possible.
In my opinion, just remove the noise with the MDX-Net model (Or any UVR models that can separate the noise from the audio. I haven't tested them all. ) is the best way to reduce hallucinations in such cases.
Here's the result with MDX-Net + VAD :
It skipped the long noise part 0:03 ~ 1:05, didn't miss the few lines of speech either.
The hallucination this version made is to add "Welcome to the" in the very first line, it's most accurate one so far.
Since UVR models needs GPU to run ( about ~8GB VRAM in my test ) for faster speed and it's not as lightweight as Silero VAD —Silero VAD is super fast with CPU as it took 1ms to be processed on a single CPU thread for a one 30 sec audio chunk—, it might feel like a hassle to add a pre-processing pipeline. But it gives me the best result so far.
If you want to try these opt-in pre-processings with whisper, you can try it in the Whisper-WebUI.
Beta Was this translation helpful? Give feedback.
All reactions