A possible solution to Whisper hallucination #679
Replies: 33 comments 136 replies
-
Thanks @KaiserChr #current working name for a threshold to determine permissible chunk length for a healthy transcript
lucid_threshold = 0.3
#first chunk, ergo no context or next chunk will be fully within num_frames
if ((seek+N_FRAMES)/num_frames < 1.0) or (seek == 0):
if "prompt" in decode_options:
decode_options["prompt"] = all_tokens[prompt_reset_since:]
else: #next chunk is not first chunk and will not be fully within num_frames i.e. last chunk, calculate lucid_score
lucid_score = (num_frames - seek) / N_FRAMES
if lucid_score < lucid_threshold and "prompt" in decode_options:
decode_options["prompt"] = []
else:
decode_options["prompt"] = all_tokens[prompt_reset_since:] |
Beta Was this translation helpful? Give feedback.
-
Well done @KaiserChr! What is the right step to suggest merging this into main Whisper code? Would be great to get a solution like this as early as possible. |
Beta Was this translation helpful? Give feedback.
-
But the output is still the same:
I'm doing something wrong?
|
Beta Was this translation helpful? Give feedback.
-
Yes, I didn't set
in the previous example. I have now launched Whisper with:
but the result is still not good:
Thank you for your effort |
Beta Was this translation helpful? Give feedback.
-
In my shop we hacked together a python script to clean up the vtt output to make it a bit more normative (always add 00: hour timestamps), add cue IDs, shorten the line lengths, add some NOTE metadata and misc other changes. For this problem we read the file into the webvtt library and compare the text of each cue to the previous. If they are exactly equal, we drop the current and move on to the next. Not perfect by any stretch and you still will have the 1st ghost but it beats having the endless repeats. |
Beta Was this translation helpful? Give feedback.
-
The best way is to split the audio file up into audio slices using VAD and feeding into whisper. There is a parameter "condition_on_previous_text", set it to false to force it to forget, the problem is that when this parameter is true, it remembers what it output previously and if the current output cannot produce anything, it will just use the last output. With the parameter set to false, it will forget and if it cannot decipher the output it will just output blank. But I find that it is still best to just transcribe slices of audio, especially if it is conversational, because every statement itself will express an idea unlike long essays when the whole paragraph expresses an idea. Sometimes being able to infer from the previous statement may not be the best idea for conversations. Remember whisper is trained using mostly subtitles from youtube, so you will get funny outputs sometimes. |
Beta Was this translation helpful? Give feedback.
-
yeah this is a riot, trying it against a few difficult cases now./ |
Beta Was this translation helpful? Give feedback.
-
cool, will have a look at that. |
Beta Was this translation helpful? Give feedback.
-
as soon as i get a chance to test this, i will post here. we have 10M samples to work from LOL. |
Beta Was this translation helpful? Give feedback.
-
Hi, I looked into the problem some more and found out a couple of things: kwargs['language'] = 'de'
kwargs['verbose'] = True
kwargs['task'] = 'transcribe'
kwargs['temperature'] = 0
kwargs['best_of'] = None
kwargs['beam_size'] = None
kwargs['patience'] = None
kwargs['length_penalty'] = None
kwargs['suppress_tokens'] = "-1"
kwargs['initial_prompt'] = None
kwargs['condition_on_previous_text'] = False
kwargs['fp16'] = True #for GPU
kwargs['compression_ratio_threshold'] = 2.4
kwargs['logprob_threshold'] = -0.5
kwargs['no_speech_threshold'] = 0.2 seem to work really well IF you use Cuda (for medium model), as the performance on CPU was so slow that the queue was filling up in an unacceptable manner, because the model sometimes tries to understand a silent part of the audio multiple times (a behavior I tried to curb with the parameters given above), and without GPU that is just too slow. Further debugging makes me believe the decoder is the source for this slowdown, as the With GPU this loop is so fast that no significant slowdown is noticable, albeit at a large hardware cost. Once we have a working version with Silero I plan to put the code we use up here on Github so anyone interested can take a look. Please note that this version does not require the code I posted on the start of the Thread, as with a VAD typically utterances are much less than 30 seconds and therefore its seems more stable to deactivate |
Beta Was this translation helpful? Give feedback.
-
I am sorting out the worst case recordings we have to be able to test this, will post back when done. |
Beta Was this translation helpful? Give feedback.
-
correct. it may not be possible be we are also looking at the inference side of whisper to see which settings could be optimized. |
Beta Was this translation helpful? Give feedback.
-
I looked into the best combination of parameters some more and right now find utterance splitting via a vad is the most stable solution for clean chunking. The current version of the code I use for real time transcription in german is using gpu and the medium model, as this gives a very nice combination of speed and precision. Find the code attached below if you want to try it for yourself! # asr using whisper and Silero-VAD (https://github.com/snakers4/silero-vad)
# structure based on the very nice work of Oliver Guhr over at https://github.com/oliverguhr/wav2vec2-live
import pyaudio
import numpy as np
import threading
import time
from sys import exit
from queue import Queue
import matplotlib.pylab as plt
import wave
import whisper
import struct
import multiprocessing
import torch
filename = 'audio_provided.wav' #for Debugging: save the audiostream that was provided to whisper after sending through queue
filename_orig = 'audio_recorded.wav' #for Debugging: save the audiostream that was actually recorded pre sending.
class Realtime_Whisper():
exit_event = threading.Event()
def __init__(self, model_name, device_name="default"):
self.model_name = model_name
self.device_name = device_name
def stop(self):
"""stop the asr process"""
Realtime_Whisper.exit_event.set()
self.asr_input_queue.put("close")
print("asr stopped")
def start(self):
"""start the asr process"""
manager = multiprocessing.Manager()
self.asr_output_queue = Queue()
self.asr_input_queue = Queue()
self.visualization_input_queue = manager.Queue() #currently not used, the queue is still in for convenience...
self.asr_process = threading.Thread(target=Realtime_Whisper.asr_process, args=(
self.model_name, self.asr_input_queue, self.asr_output_queue,))
self.asr_process.daemon = True
self.asr_process.start()
time.sleep(5) # start vad after asr model is loaded
self.vad_process = threading.Thread(target=Realtime_Whisper.vad_process, args=(
self.device_name, self.asr_input_queue, self.visualization_input_queue, ))
self.vad_process.daemon = True
self.vad_process.start()
#Debug optional visualization
#self.visualization_process = multiprocessing.Process(target=Realtime_Whisper.plot_stream, args=(
# self.visualization_input_queue,))
# self.visualization_process = threading.Thread(target=Realtime_Whisper.plot_stream, args=(
# self.visualization_input_queue,))
#self.visualization_process.daemon = True
#self.visualization_process.start()
def int2float(sound):
"""convert the wav pcm16 format to one suitable for silero vad"""
_sound = np.copy(sound) # may be not necessary
#abs_max = np.abs(_sound).max()
abs_max = 32767
_sound = _sound.astype('float32')
if abs_max > 0:
_sound *= 1 / abs_max
_sound = _sound.squeeze() # depends on the use case
return _sound
def plot_stream(instream):
"""plot audio stream via matplotlib"""
CHUNK = 160
CHANNELS = 1
RATE = 16000
fig, ax = plt.subplots()
x = np.arange(0, 2 * CHUNK, 2)
line, = ax.plot(x, np.random.rand(CHUNK), 'r')
ax.set_ylim(-20000, 20000)
ax.set_xlim(0, CHUNK)
fig.show()
while True:
data = instream.get()
dataInt = struct.unpack(str(CHUNK) + 'h', data)
line.set_ydata(dataInt)
fig.canvas.draw()
fig.canvas.flush_events()
def vad_process(device_name, asr_input_queue, vis_input_queue):
"""voice activity detection using silero-vad"""
model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad',
model='silero_vad',
force_reload=False,
onnx=False)
(get_speech_timestamps,
save_audio,
read_audio,
VADIterator,
collect_chunks) = utils
#not sure this is useful, but I leave it in for now...
vad_iterator = VADIterator(model)
audio = pyaudio.PyAudio()
FORMAT = pyaudio.paInt16
CHANNELS = 1
RATE = 16000
FRAME_DURATION = 60
CHUNK = int(RATE * FRAME_DURATION / 1000)
SPEECH_PROB_THRESHOLD = 0.2 # This probably needs a bit of tweaking
microphones = Realtime_Whisper.list_microphones(audio)
selected_input_device_id = Realtime_Whisper.get_input_device_id(
device_name, microphones)
print('input device id')
print(microphones)
print(selected_input_device_id)
#this should be the default mic, tweak as needed ...
selected_input_device_id = 1
stream = audio.open(input_device_index=selected_input_device_id,
format=FORMAT,
channels=CHANNELS,
rate=RATE,
input=True,
frames_per_buffer=CHUNK)
#framebuffer for queue
frames = b''
#masterframebuffer for saving the data send to asr
masterframes_asr = b''
last_speech_prob = 0
while True:
if Realtime_Whisper.exit_event.is_set():
break
frame = stream.read(CHUNK, exception_on_overflow=False)
frame_tensor = torch.from_numpy(Realtime_Whisper.int2float(np.frombuffer(frame, dtype=np.int16)))
speech_prob = model(frame_tensor, RATE).item()
#turn this on for debugging and tweaking the threshold...
#print(speech_prob)
#accumulate frames in frame buffer if speech is detected and the total length is < 30 sec (max size of whisper chunk)
if speech_prob > SPEECH_PROB_THRESHOLD and len(frames) < 480000: #THIS NEEDS TO BE LOOKED AT AGAIN MAYBE A FULL 30s WHISPER CHUNK IS TOO MUCH
frames += frame
#if there was speech and now there is none (i.e. an utterance has finished or the max length is exceeded, write to queue
elif (speech_prob <= SPEECH_PROB_THRESHOLD < last_speech_prob) or (len(frames) >= 480000):
asr_input_queue.put(frames)
masterframes_asr += frames
frames = b''
last_speech_prob = speech_prob
stream.stop_stream()
stream.close()
audio.terminate()
# Open and Set the data of the WAV file
file = wave.open(filename_orig, 'wb')
file.setnchannels(1)
file.setsampwidth(2)
file.setframerate(16000)
# Write and Close the File
file.writeframes(b''.join(np.frombuffer(masterframes_asr, dtype=np.int16)))
file.close()
def asr_process(model_name, in_queue, output_queue):
"""transcribe using whisper"""
model = whisper.load_model(model_name, device='cuda') #use cuda for everything > base model
#with current settings always excepts to 0, but left in to play around with the setting...
temperature_increment_on_fallback = 0
temperature = 0
try:
temperature = tuple(np.arange(temperature, 1.0 + 1e-6, temperature_increment_on_fallback))
except:
temperature = 0
kwargs = {}
kwargs['language'] = 'de'
kwargs['verbose'] = True
kwargs['task'] = 'transcribe'
kwargs['temperature'] = temperature
kwargs['best_of'] = None
kwargs['beam_size'] = None
kwargs['patience'] = None
kwargs['length_penalty'] = None
kwargs['suppress_tokens'] = "-1"
kwargs['initial_prompt'] = None
kwargs['condition_on_previous_text'] = False # seems source of false Transcripts
kwargs['fp16'] = True #set false if using cpu
kwargs['compression_ratio_threshold'] = None #2.4
kwargs['logprob_threshold'] = None #-1.0 #-0.5
kwargs['no_speech_threshold'] = None #0.6 #0.2
#masterframes = ''
masterframes = b''
while True:
audio_file = in_queue.get()
if audio_file == "close":
break
print("\nlistening to your beautiful voice\n")
masterframes += audio_file
audio_tensor = torch.from_numpy(Realtime_Whisper.int2float(np.frombuffer(audio_file, dtype=np.int16)))
result = model.transcribe(audio_tensor, **kwargs)
if result != "":
output_queue.put(result["segments"])
# Open and Set the data of the WAV file
file = wave.open(filename, 'wb')
file.setnchannels(1)
file.setsampwidth(2)
file.setframerate(16000)
file.writeframes(b''.join(np.frombuffer(masterframes, dtype=np.int16)))
file.close()
def get_input_device_id(device_name, microphones):
for device in microphones:
if device_name in device[1]:
return device[0]
def list_microphones(pyaudio_instance):
info = pyaudio_instance.get_host_api_info_by_index(0)
numdevices = info.get('deviceCount')
result = []
for i in range(0, numdevices):
if (pyaudio_instance.get_device_info_by_host_api_device_index(0, i).get('maxInputChannels')) > 0:
name = pyaudio_instance.get_device_info_by_host_api_device_index(
0, i).get('name')
result += [[i, name]]
return result
def get_last_text(self):
"""returns the text, sample length and inference time in seconds."""
return self.asr_output_queue.get()
if __name__ == "__main__":
print("Live ASR")
#param is model size
asr = Realtime_Whisper("medium")
asr.start()
last_text = 'Start'
try:
while True:
lastresult = asr.get_last_text()
for segment in lastresult:
print('ID: ' + str(segment['id']) + ' START: ' + str(round(segment['start'], 1)) + ' END: ' + str(round(segment['end'], 1)) + ' TEXT: ' + segment['text'])
except KeyboardInterrupt:
asr.stop()
exit()
|
Beta Was this translation helpful? Give feedback.
-
Not sure this is the same root cause, but besides the duplicates everything starting at 28:26:00 is total rubbish (translates to "subtitles on behalf of ZDF"). Obviously the source audio does not include someone saying anything about subtitles. I used the small model on CLI. Haven't tried --condition_on_previous_text yet, but trying the medium model now. |
Beta Was this translation helpful? Give feedback.
-
My small contribution to this subject: each of the following reduces the amount of "hallucinated texts" (but the problem still occurs).
|
Beta Was this translation helpful? Give feedback.
-
Loudness normalization may also be useful, by adding
PS: updated. With ffmpeg multiple filters should be provided at once comma separated. |
Beta Was this translation helpful? Give feedback.
-
As said above, hallucinations are certainly due to bad sync between sounds and texts in subtitle training data (and copyright added to them). Thus, they are more probably at the beg or end of the output. I observed that hallucinations are very depend on very particular sound configurations. Certainly, if the input is changed a bit, it is highly probable that the hallucination disappears. My new idea is to add markers, easily recognized by Whisper, at the beg and end of the sound. If the markers are well obtained in the beg and end of the output, simply remove them. If not, something was wrong, certainly an hallucination added, try the same inverting the markers (to try something a bit different). If the markers are still not properly obtained in the output, try with the original sound (to avoid trouble possibly added by the markers). If timestamps are needed, it's easy to restore them knowing the time length of the markers. Here is the code:
Used markers (GitHub does not accept attached WAV files): |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
@dgoryeo If processing time is not a problem for you, here is perhaps the ultimate way to get a good SRT.
:-) |
Beta Was this translation helpful? Give feedback.
-
A new version of WhisperHallu is available. It adds Deezer Spleeter to extract voices and eliminate noises. |
Beta Was this translation helpful? Give feedback.
-
works for me. |
Beta Was this translation helpful? Give feedback.
-
"condition_on_previous_text": False, will probably degrade quality |
Beta Was this translation helpful? Give feedback.
-
Is an "optimal" value for |
Beta Was this translation helpful? Give feedback.
-
Me during an online Blender class, hoping it didn't crash: Meanwhile, Whisper... 01:31:44.240 --> 01:31:44.880 01:32:14.240 --> 01:32:17.240 01:32:44.240 --> 01:32:47.240 01:33:14.240 --> 01:33:17.240 01:33:44.240 --> 01:33:47.240 01:34:14.240 --> 01:34:17.240 01:34:44.240 --> 01:34:47.240 01:35:14.240 --> 01:35:17.240 01:35:18.240 --> 01:35:21.240 01:35:22.240 --> 01:35:25.240 01:35:26.240 --> 01:35:29.240 01:35:30.240 --> 01:35:33.240 01:35:34.240 --> 01:35:37.240 01:35:38.240 --> 01:35:41.240 ... and so on. Language is Italian. Thank you, |
Beta Was this translation helpful? Give feedback.
-
Noise removal + VAD + remove segments with low likelihood of speech is working for me. The latter was not mentioned in this thread and it was what has given me a significant improvement in the results, it is mentioned here #928 (comment) |
Beta Was this translation helpful? Give feedback.
-
Hi everyone, this is the first time I'm commenting here but I've been following this topic for some time, and I believe I achieved a very good result in my code, I combined several techniques already mentioned in this topic and others that perhaps haven't been thought of yet. I apologize for some sentences in Portuguese but I believe I have translated all the variables into English, so everyone can understand the code. Anyone who wants to test it will have to adapt it to their own code as this is just an excerpt from my code, my idea is to show how I achieved a good result and almost 100% avoid Whisper's hallucinations and keep it fast enough to make transcriptions In real time. I want to thank the people at Faster Whisper for the excellent work converting it to fast16. class AudioTranscriber:
The "speech_filter" function serves to filter some hallucinations, and hide the output, however with my latest modifications, this was no longer necessary. But whoever wants to use it, follow below. def filtro_de_fala(texto): |
Beta Was this translation helpful? Give feedback.
-
There is a bug introduced with #1279, it can make bad hallucination loops. There is a bugfix -> #1903 [not merged at the moment] |
Beta Was this translation helpful? Give feedback.
-
just a quick thought: since this tends to happen on short audios, checking the duration per word might be a way to detect anomalies. for example, when i get:
for a recording of duration 0.128 second then it is definitely an hallucination. |
Beta Was this translation helpful? Give feedback.
-
We also thought along those lines and specifically created a finetune with better timestamps and dedicated and trained alignment heads to make hallucination detection even more robust. Feel free to check it out. |
Beta Was this translation helpful? Give feedback.
-
I have no solution . |
Beta Was this translation helpful? Give feedback.
-
Hi all,
I was having trouble with whisper creating "ghost transcripts" at the end of a given sound file. These often consist of repeats and shuffles of the text of previous chunks and turned out to be quite detrimental to the overall quality of the transcript.
I looked into this a little and found a possible solution:
The parameter: 'condition_on_previous_text'
This is set to True by default and helps whisper (to my understanding) to keep the context going between chunks.
My working hypothesis is, that the problem arises if the last chunk is short (like a couple of seconds) compared to the text initializing the next chunks transcription. Then the models seems to somehow have a problem with disambiguation and starts "seeing things".
A more elegant solution than just setting condition_on_previous_text to False would be something like this (not properly debugged yet):
After line 178 of whisper/transcribe.py exchange:
decode_options["prompt"] = all_tokens[prompt_reset_since:]
with something like this:
Im still tinkering and debugging, but so far it worked for me and I just wanted to post it in case it helps anybody else!
p.s.
I was encountering the problem when working with german audio on the medium and large models
EDIT: 14.12.2022 further debugged and formatted code a bit better
UPDATE: 21.12.2022 Approach using VAD, see my post below
Beta Was this translation helpful? Give feedback.
All reactions