-
I am using Whisper to transcribe subtitles, and the lines end up being fairly long. Is it possible to nudge Whisper to output shorter lines / break them more often? |
Beta Was this translation helpful? Give feedback.
Replies: 9 comments 12 replies
-
You can reformat output in your shell by piping to
results in
https://unix.stackexchange.com/questions/146089/format-output-to-a-specific-line-length |
Beta Was this translation helpful? Give feedback.
-
The current decoding strategy for timestamps is that it opts to sample a timestamp token when the sum of probability over timestamps is above any other text tokens: Lines 431 to 437 in 0b1ba3d You could tune the threshold in line 436 with, e.g. relatively: if timestamp_logprob > max_text_token_logprob * 0.1: or absolutely: if timestamp_logprob > 0.01: to tune how likely the model samples the timestamp tokens, which will determine the length of each phrase. |
Beta Was this translation helpful? Give feedback.
-
Are there any strategies for getting segments on a per-word basis? Regardless of how much I change the coefficient I can never get segments smaller than 3-5 words |
Beta Was this translation helpful? Give feedback.
-
Managed to play around at get it working.
Try messing around with numbers in the range of the tensor() output
|
Beta Was this translation helpful? Give feedback.
-
I, too, want this. Because I'm creating karaoke files. We don't want an entire verse as one line I, too, would like to nudge it to use smaller segments. |
Beta Was this translation helpful? Give feedback.
-
Here is the caption Text segment configuration. I have tested this code and working fine for me. import whisper
from whisper.utils import get_writer
audio = './audio.mp3'
model = whisper.load_model(model='small')
result = model.transcribe(audio=audio, language='en', word_timestamps=True, task="transcribe")
# Set VTT Line and words width
word_options = {
"highlight_words": False,
"max_line_count": 1,
"max_line_width": 42
}
vtt_writer = get_writer(output_format='vtt', output_dir='./')
vtt_writer(result, audio, word_options) |
Beta Was this translation helpful? Give feedback.
-
Had this problem as well, so I just recursively halved subtitles that too long:
|
Beta Was this translation helpful? Give feedback.
-
you can do it by your self since words timestamp are available. there is . and ? in the word. it can detect the end of each sentence by them and splite one to more segments as will. |
Beta Was this translation helpful? Give feedback.
-
After spending longer than i'd like to admit to do this with an algorithm, i thought "why don't just ask the AI?". This works perfectly: prompt = "Please transcribe this audio. Please make segments no longer than 25 words."
transcription = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
response_format="vtt",
language=language,
prompt=prompt,
temperature=0.7
) |
Beta Was this translation helpful? Give feedback.
The current decoding strategy for timestamps is that it opts to sample a timestamp token when the sum of probability over timestamps is above any other text tokens:
whisper/whisper/decoding.py
Lines 431 to 437 in 0b1ba3d
Y…