A way to change segment length? #223

neongreen · 2022-10-02T01:30:18Z

neongreen
Oct 2, 2022

I am using Whisper to transcribe subtitles, and the lines end up being fairly long. Is it possible to nudge Whisper to output shorter lines / break them more often?

Answered by jongwook

Oct 3, 2022

The current decoding strategy for timestamps is that it opts to sample a timestamp token when the sum of probability over timestamps is above any other text tokens:

whisper/whisper/decoding.py

Lines 431 to 437 in 0b1ba3d

     # if sum of probability over timestamps is above any other token, sample timestamp  
   logprobs = F.log_softmax(logits.float(), dim=-1)  
   for k in range(tokens.shape[0]):  
   timestamp_logprob = logprobs[k, self.tokenizer.timestamp_begin :].logsumexp(dim=-1)  
   max_text_token_logprob = logprobs[k, : self.tokenizer.timestamp_begin].max()  
   if timestamp_logprob > max_text_token_logprob:  
   logits[k, : self.tokenizer.timestamp_begin] = -np.inf  

 

Y…

View full answer

glangford · 2022-10-03T19:07:58Z

glangford
Oct 3, 2022

You can reformat output in your shell by piping to fold, for example

echo "This line is longer than I would like to have in the output from whisper" | fold -s -w 30

results in

This line is longer than I 
would like to have in the 
output from whisper

https://unix.stackexchange.com/questions/146089/format-output-to-a-specific-line-length

1 reply

jongwook Oct 3, 2022
Maintainer

I think the OP wanted shorter phrases each with begin/end timestamps, so a simple fold wouldn't be sufficient.

jongwook · 2022-10-03T20:18:11Z

jongwook
Oct 3, 2022
Maintainer

The current decoding strategy for timestamps is that it opts to sample a timestamp token when the sum of probability over timestamps is above any other text tokens:

whisper/whisper/decoding.py

Lines 431 to 437 in 0b1ba3d

    
           # if sum of probability over timestamps is above any other token, sample timestamp 
        
           logprobs = F.log_softmax(logits.float(), dim=-1) 
        
           for k in range(tokens.shape[0]): 
        
               timestamp_logprob = logprobs[k, self.tokenizer.timestamp_begin :].logsumexp(dim=-1) 
        
               max_text_token_logprob = logprobs[k, : self.tokenizer.timestamp_begin].max() 
        
               if timestamp_logprob > max_text_token_logprob: 
        
                   logits[k, : self.tokenizer.timestamp_begin] = -np.inf

You could tune the threshold in line 436 with, e.g. relatively:

if timestamp_logprob > max_text_token_logprob * 0.1:

or absolutely:

if timestamp_logprob > 0.01:

to tune how likely the model samples the timestamp tokens, which will determine the length of each phrase.

3 replies

timminata Jan 6, 2023

I tried tweaking that line and for some reason it did not affect two different audio files I was testing with. The one file is 27 seconds long and does not get broken down at all and the other file is 24 seconds and gets broken down into 3 segments. Ideally I'd want every file to be broken down into 5-10 second segments. Also since this involves editing the source code, it won't work when I install via poetry in production. Any suggestions for how I should approach this or if it might be a parameter at some point?

yeetus1992 Feb 27, 2023

I am wondering this too!

couchpotatochip21 Feb 22, 2024

This does not work, even trying smaller values does not work either

DanielHabib · 2022-11-23T17:33:28Z

DanielHabib
Nov 23, 2022

Are there any strategies for getting segments on a per-word basis? Regardless of how much I change the coefficient I can never get segments smaller than 3-5 words

1 reply

strukturedkaos Apr 20, 2023

@DanielHabib did you ever figure out how to do this?

harryy38 · 2023-01-09T08:56:05Z

harryy38
Jan 9, 2023

Managed to play around at get it working.

# if sum of probability over timestamps is above any other token, sample timestamp logprobs = F.log_softmax(logits.float(), dim=-1) for k in range(tokens.shape[0]): timestamp_logprob = logprobs[k, self.tokenizer.timestamp_begin :].logsumexp(dim=-1) max_text_token_logprob = logprobs[k, : self.tokenizer.timestamp_begin].max() #if timestamp_logprob > max_text_token_logprob: print(timestamp_logprob,max_text_token_logprob) if timestamp_logprob > -5 : logits[k, : self.tokenizer.timestamp_begin] = -np.inf

Try messing around with numbers in the range of the tensor() output

print(timestamp_logprob,max_text_token_logprob)

5 replies

yeetus1992 Feb 27, 2023

Hi, I am currently using whisper for a subtitles bot and got everything working. I too, want to change the segmenth length, though. If I want to make the changes you said, do I need to install the entire github repository for whisper? Because currently, I only did

import whisper
model = whisper.load_model("tiny.en")
result = model.transcribe("speech.mp3")
so, how do I configure these changes?
You can probably tell from this question that I am a beginner! You don't know how much youd help me with an answer,

sincerely:)

harryy38 Feb 27, 2023

So you need Whisper installed, CD into this directory (/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/whisper) at go to decoding.py. Head to line 441 & change the number in the if statement, if you mess around with it you should be able to find something you can work with!

yeetus1992 Mar 1, 2023

Thank you for the reply, and very much for the info, too!.

Yet, when I run
whisper file.mp3 --language English
I still get the same 6-second-ish timestamps...
[00:00.000 --> 00:06.000] Scientists believe that with global warming, we can expect more severe weather patterns [00:06.000 --> 00:11.940] including heat waves, hurricanes, floods, and drought. [00:11.940 --> 00:14.860] The oceans may become more acid. [00:14.860 --> 00:22.260] Weather events like these can increase health risks, damage economies, destroy habitats, [00:22.260 --> 00:29.260] and affect our quality of life.

Even when tweeking the mentioned line.
Could you help me on what I'm still doing wrong?

Sincerely

wait, maybe the issue is that i didn't change the installed whisper library and instead changed smth in the cloned git repository in vsc.. Don't waste your time on my reply, I'll try to fix it now and let you know!

timminata Mar 1, 2023

I also don't know how I would deploy this since I'm installing my dependencies via poetry in a dockerfile - so tweaking the source code doesn't really seem like a great way to go about it - I would have to copy the source code into my repository and edit it there I suppose.

yeetus1992 Mar 1, 2023

but how is it even possible to change the source code /site-packages/whisper?? I always get permission denied error

ClaireCJS · 2023-07-06T08:20:56Z

ClaireCJS
Jul 6, 2023

I, too, want this.

Because I'm creating karaoke files.

We don't want an entire verse as one line

I, too, would like to nudge it to use smaller segments.

1 reply

MatteoFasulo Jul 6, 2023

Hi,
Take a look at https://github.com/jianfch/stable-ts

Actually with this you can have a timestamp for each word and then highlight them for karaoke.

rexsateesh · 2023-10-10T11:01:05Z

rexsateesh
Oct 10, 2023

Here is the caption Text segment configuration. I have tested this code and working fine for me.

import whisper
from whisper.utils import get_writer 

audio = './audio.mp3'
model = whisper.load_model(model='small')
result = model.transcribe(audio=audio, language='en', word_timestamps=True, task="transcribe")

# Set VTT Line and words width
word_options = {
    "highlight_words": False,
    "max_line_count": 1,
    "max_line_width": 42
}
vtt_writer = get_writer(output_format='vtt', output_dir='./')
vtt_writer(result, audio, word_options)

1 reply

MatteoFasulo Oct 11, 2023

Is there any way to also change the format of subtitles?

In utils.py it seems to use just the underline HTML formatting with <u> </u>

if "words" in result["segments"][0]:
            for subtitle in iterate_subtitles():
                subtitle_start = self.format_timestamp(subtitle[0]["start"])
                subtitle_end = self.format_timestamp(subtitle[-1]["end"])
                subtitle_text = "".join([word["word"] for word in subtitle])
                if highlight_words:
                    last = subtitle_start
                    all_words = [timing["word"] for timing in subtitle]
                    for i, this_word in enumerate(subtitle):
                        start = self.format_timestamp(this_word["start"])
                        end = self.format_timestamp(this_word["end"])
                        if last != start:
                            yield last, start, subtitle_text

                        yield start, end, "".join(
                            [
                                re.sub(r"^(\s*)(.*)$", r"\1<u>\2</u>", word)
                                if j == i
                                else word
                                for j, word in enumerate(all_words)
                            ]
                        )
                        last = end
                else:
                    yield subtitle_start, subtitle_end, subtitle_text
        else:
            for segment in result["segments"]:
                segment_start = self.format_timestamp(segment["start"])
                segment_end = self.format_timestamp(segment["end"])
                segment_text = segment["text"].strip().replace("-->", "->")
                yield segment_start, segment_end, segment_text

Just for instance, something like this:

Bold formatting is still working but there is no way to change it with custom options

Serbernari · 2024-02-21T22:49:11Z

Serbernari
Feb 21, 2024

Had this problem as well, so I just recursively halved subtitles that too long:

 # eliminate lines that too long
  import copy
  import pysubs2
  subs_fr = pysubs2.load("long_subtitles.srt")

  new_i = 0
  triggered = True
  while triggered:
    triggered = False
    new_subtitles = pysubs2.SSAFile()
    for i in range(len(subs_fr.events)):
      if len(subs_fr.events[i].text) >= 100 and subs_fr.events[19].duration >= 2000:
        triggered = True
        words = len(subs_fr.events[i].text.split(" ")) // 2
        text1 = " ".join(subs_fr.events[i].text.split(" ")[:words]) #divide in two
        text2 = " ".join(subs_fr.events[i].text.split(" ")[words:])
        new_event1 =  copy.deepcopy(subs_fr.events[i])
        new_event2 =  copy.deepcopy(subs_fr.events[i])

        new_event1.text = text1
        new_event2.text = text2

        delta_time = (subs_fr.events[i].end - subs_fr.events[i].start) // 2

        new_event1.end = subs_fr.events[i].start + delta_time
        new_event2.start = subs_fr.events[i].start + delta_time

        new_subtitles.insert(new_i, new_event1)
        new_i += 1
        new_subtitles.insert(new_i, new_event2)
        new_i += 1
      else:
        new_event = copy.deepcopy(subs_fr.events[i])
        new_subtitles.insert(new_i, new_event)
        new_i += 1
    subs_fr = copy.deepcopy(new_subtitles)
new_subtitles.save("short_subs.srt")

0 replies

dqshll · 2024-04-12T01:26:18Z

dqshll
Apr 12, 2024

you can do it by your self since words timestamp are available. there is . and ? in the word. it can detect the end of each sentence by them and splite one to more segments as will.

0 replies

phoehnel · 2024-06-08T20:13:00Z

phoehnel
Jun 8, 2024

After spending longer than i'd like to admit to do this with an algorithm, i thought "why don't just ask the AI?".

This works perfectly:

 prompt = "Please transcribe this audio. Please make segments no longer than 25 words."
 transcription = client.audio.transcriptions.create(
        model="whisper-1", 
        file=audio_file,
        response_format="vtt",
        language=language,
        prompt=prompt,
        temperature=0.7
 )

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A way to change segment length? #223

{{title}}

Replies: 9 comments 12 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

	# if sum of probability over timestamps is above any other token, sample timestamp
	logprobs = F.log_softmax(logits.float(), dim=-1)
	for k in range(tokens.shape[0]):
	timestamp_logprob = logprobs[k, self.tokenizer.timestamp_begin :].logsumexp(dim=-1)
	max_text_token_logprob = logprobs[k, : self.tokenizer.timestamp_begin].max()
	if timestamp_logprob > max_text_token_logprob:
	logits[k, : self.tokenizer.timestamp_begin] = -np.inf

A way to change segment length? #223

Replies: 9 comments · 12 replies

jongwook Oct 3, 2022 Maintainer

jongwook Oct 3, 2022 Maintainer

Is there any way to also change the format of subtitles?

Replies: 9 comments 12 replies

jongwook Oct 3, 2022
Maintainer

jongwook
Oct 3, 2022
Maintainer