-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for Whisper timestamps and task/language configuration #238
Conversation
lib/bumblebee/audio.ex
Outdated
* `:translate` - generate translation of the given speech in | ||
English | ||
|
||
* `:timestamps` - when `true`, the model predicts timestamps for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would there be a reason to not have this always on? If it is slower, then perhaps we can allow it to be turned off, but I would have it on by default. Also please update the examples, so we know how to match on timestamps, and so that we also specify its format (ms? s?). :)
Also, it is generally a bad practice to change the output based on an option, which I assume is the case here. This may particularly annoying once we have the type system. So we should consider either different entry-point functions or, when timestamps is false, we use bogus timestamps (maybe -1 to -1)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would there be a reason to not have this always on? If it is slower, then perhaps we can allow it to be turned off, but I would have it on by default
I thought the same, but it the difference is that with timestamps disabled we enforce the <notimestamps>
token and so the model does not generate timestamps at all, so we do not "waste" model iterations. In practice it doesn't seem to make much difference though. Note that we can also add timestamps: :word
for per-word timestamps, so making the user opt-in as needed may make more sense.
and so that we also specify its format (ms? s?)
start_timestamp_seconds
, end_timestamp_seconds
?
Also, it is generally a bad practice to change the output based on an option, which I assume is the case here.
It's not! I need to update the example :D It was one of the reasons for a separate serving, now it's fine to have a more whisper-specific output spec. We just allow timestamps to be nil
. The only weird thing is that without timestamps we return :chunks
, which is a single element with nil start and end, but that should be fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we will have timestamps: :words
, maybe this should be timestamps: :sentences
?
Should we still return the text if we are computing the chunks? It may be the that we are building the text, only to never use it. I also see the chunks and the texts are slightly different when it comes to spacing, but I assume that's easy to post-process.
What if we always returns chunks and we have a function called BBB.Audio.chunks_to_string
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Always returning chunks may help make it consistent with streams too. I am fine if you want to postpone this decision until we have streaming.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we will have timestamps: :words, maybe this should be timestamps: :sentences?
It is not really sentences, the model outputs timestamps whenever it feels like. It could be timestamps: :segments
, just a bit vague?
Should we still return the text if we are computing the chunks?
I wasn't sure, but thinking about streaming I am leaning towards that. FWIW the post processing is just join + trim, so it's fine to leave this up to the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:segments is good. Agreed on everything else too!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, I will remove :text
later with streaming :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome job, this was a massive amount of work. Are we good to ship a new Nx too? :)
I will look at streaming next, so if we want to play safe, we can wait for that :) |
This adds the following options to the speech-to-text serving:
:timestamps
,:language
,:task
. By default no language is assumed and the model infers it on its own.I deprecated
Audio.speech_to_text
in favour ofAudio.speech_to_text_whisper
. Initially I addedextra_options: [...]
to the serving and delegated some of the logic to theText.Generation
behaviour thatAudio.Whisper
implements, but that could be confusing form the user perspective, since they would need to lookup options in other modules. Also, there were still some Whisper-specific bits, so I think it's most practical to have a separate serving.We have the
%Text.GenerationConfig{}
struct for loading generation options (sequence length, some token information, sampling options), but Whisper has a number of specific options on its own. I don't think it makes sense to add those fields to the generic struct, so instead we load the config as%Text.Generation{extra_config: %Text.WhisperGeneration{}}
.