Add support for Whisper timestamps and task/language configuration #238

jonatanklosko · 2023-09-08T10:44:28Z

This adds the following options to the speech-to-text serving: :timestamps, :language, :task. By default no language is assumed and the model infers it on its own.

I deprecated Audio.speech_to_text in favour of Audio.speech_to_text_whisper. Initially I added extra_options: [...] to the serving and delegated some of the logic to the Text.Generation behaviour that Audio.Whisper implements, but that could be confusing form the user perspective, since they would need to lookup options in other modules. Also, there were still some Whisper-specific bits, so I think it's most practical to have a separate serving.

We have the %Text.GenerationConfig{} struct for loading generation options (sequence length, some token information, sampling options), but Whisper has a number of specific options on its own. I don't think it makes sense to add those fields to the generic struct, so instead we load the config as %Text.Generation{extra_config: %Text.WhisperGeneration{}}.

examples/phoenix/speech_to_text.exs

lib/bumblebee/audio/speech_to_text_whisper.ex

josevalim · 2023-09-10T20:25:53Z

lib/bumblebee/audio.ex

+        * `:translate` - generate translation of the given speech in
+          English
+
+    * `:timestamps` - when `true`, the model predicts timestamps for


Would there be a reason to not have this always on? If it is slower, then perhaps we can allow it to be turned off, but I would have it on by default. Also please update the examples, so we know how to match on timestamps, and so that we also specify its format (ms? s?). :)

Also, it is generally a bad practice to change the output based on an option, which I assume is the case here. This may particularly annoying once we have the type system. So we should consider either different entry-point functions or, when timestamps is false, we use bogus timestamps (maybe -1 to -1)?

Would there be a reason to not have this always on? If it is slower, then perhaps we can allow it to be turned off, but I would have it on by default

I thought the same, but it the difference is that with timestamps disabled we enforce the <notimestamps> token and so the model does not generate timestamps at all, so we do not "waste" model iterations. In practice it doesn't seem to make much difference though. Note that we can also add timestamps: :word for per-word timestamps, so making the user opt-in as needed may make more sense.

and so that we also specify its format (ms? s?)

start_timestamp_seconds, end_timestamp_seconds?

Also, it is generally a bad practice to change the output based on an option, which I assume is the case here.

It's not! I need to update the example :D It was one of the reasons for a separate serving, now it's fine to have a more whisper-specific output spec. We just allow timestamps to be nil. The only weird thing is that without timestamps we return :chunks, which is a single element with nil start and end, but that should be fine.

If we will have timestamps: :words, maybe this should be timestamps: :sentences?

Should we still return the text if we are computing the chunks? It may be the that we are building the text, only to never use it. I also see the chunks and the texts are slightly different when it comes to spacing, but I assume that's easy to post-process.

What if we always returns chunks and we have a function called BBB.Audio.chunks_to_string?

Always returning chunks may help make it consistent with streams too. I am fine if you want to postpone this decision until we have streaming.

If we will have timestamps: :words, maybe this should be timestamps: :sentences?

It is not really sentences, the model outputs timestamps whenever it feels like. It could be timestamps: :segments, just a bit vague?

Should we still return the text if we are computing the chunks?

I wasn't sure, but thinking about streaming I am leaning towards that. FWIW the post processing is just join + trim, so it's fine to leave this up to the user.

:segments is good. Agreed on everything else too!

Updated, I will remove :text later with streaming :)

josevalim

Awesome job, this was a massive amount of work. Are we good to ship a new Nx too? :)

jonatanklosko · 2023-09-11T04:20:37Z

Are we good to ship a new Nx too? :)

I will look at streaming next, so if we want to play safe, we can wait for that :)

Add support for Whisper timestamps and task/language configuration

ed947e1

josevalim reviewed Sep 10, 2023

View reviewed changes

examples/phoenix/speech_to_text.exs Show resolved Hide resolved

josevalim reviewed Sep 10, 2023

View reviewed changes

lib/bumblebee/audio/speech_to_text_whisper.ex Outdated Show resolved Hide resolved

josevalim reviewed Sep 10, 2023

View reviewed changes

josevalim approved these changes Sep 10, 2023

View reviewed changes

jonatanklosko added 2 commits September 11, 2023 11:27

Use the right generation config

89c4feb

Merge iterations

44d5bdd

josevalim approved these changes Sep 11, 2023

View reviewed changes

Update naming and examples

9a9ab49

jonatanklosko merged commit 2358aff into main Sep 11, 2023
2 checks passed

jonatanklosko deleted the jk-timestamps branch September 11, 2023 10:45

jonatanklosko mentioned this pull request Sep 11, 2023

Expand speech-to-text #187

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for Whisper timestamps and task/language configuration #238

Add support for Whisper timestamps and task/language configuration #238

jonatanklosko commented Sep 8, 2023

josevalim Sep 10, 2023

jonatanklosko Sep 11, 2023 •

edited

Loading

josevalim Sep 11, 2023

josevalim Sep 11, 2023

jonatanklosko Sep 11, 2023

josevalim Sep 11, 2023

jonatanklosko Sep 11, 2023

josevalim left a comment

jonatanklosko commented Sep 11, 2023

Add support for Whisper timestamps and task/language configuration #238

Add support for Whisper timestamps and task/language configuration #238

Conversation

jonatanklosko commented Sep 8, 2023

josevalim Sep 10, 2023

Choose a reason for hiding this comment

jonatanklosko Sep 11, 2023 • edited Loading

Choose a reason for hiding this comment

josevalim Sep 11, 2023

Choose a reason for hiding this comment

josevalim Sep 11, 2023

Choose a reason for hiding this comment

jonatanklosko Sep 11, 2023

Choose a reason for hiding this comment

josevalim Sep 11, 2023

Choose a reason for hiding this comment

jonatanklosko Sep 11, 2023

Choose a reason for hiding this comment

josevalim left a comment

Choose a reason for hiding this comment

jonatanklosko commented Sep 11, 2023

jonatanklosko Sep 11, 2023 •

edited

Loading