Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Whisper timestamps and task/language configuration #238

Merged
merged 4 commits into from
Sep 11, 2023

Conversation

jonatanklosko
Copy link
Member

This adds the following options to the speech-to-text serving: :timestamps, :language, :task. By default no language is assumed and the model infers it on its own.

I deprecated Audio.speech_to_text in favour of Audio.speech_to_text_whisper. Initially I added extra_options: [...] to the serving and delegated some of the logic to the Text.Generation behaviour that Audio.Whisper implements, but that could be confusing form the user perspective, since they would need to lookup options in other modules. Also, there were still some Whisper-specific bits, so I think it's most practical to have a separate serving.

We have the %Text.GenerationConfig{} struct for loading generation options (sequence length, some token information, sampling options), but Whisper has a number of specific options on its own. I don't think it makes sense to add those fields to the generic struct, so instead we load the config as %Text.Generation{extra_config: %Text.WhisperGeneration{}}.

* `:translate` - generate translation of the given speech in
English

* `:timestamps` - when `true`, the model predicts timestamps for
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would there be a reason to not have this always on? If it is slower, then perhaps we can allow it to be turned off, but I would have it on by default. Also please update the examples, so we know how to match on timestamps, and so that we also specify its format (ms? s?). :)

Also, it is generally a bad practice to change the output based on an option, which I assume is the case here. This may particularly annoying once we have the type system. So we should consider either different entry-point functions or, when timestamps is false, we use bogus timestamps (maybe -1 to -1)?

Copy link
Member Author

@jonatanklosko jonatanklosko Sep 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would there be a reason to not have this always on? If it is slower, then perhaps we can allow it to be turned off, but I would have it on by default

I thought the same, but it the difference is that with timestamps disabled we enforce the <notimestamps> token and so the model does not generate timestamps at all, so we do not "waste" model iterations. In practice it doesn't seem to make much difference though. Note that we can also add timestamps: :word for per-word timestamps, so making the user opt-in as needed may make more sense.

and so that we also specify its format (ms? s?)

start_timestamp_seconds, end_timestamp_seconds?

Also, it is generally a bad practice to change the output based on an option, which I assume is the case here.

It's not! I need to update the example :D It was one of the reasons for a separate serving, now it's fine to have a more whisper-specific output spec. We just allow timestamps to be nil. The only weird thing is that without timestamps we return :chunks, which is a single element with nil start and end, but that should be fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we will have timestamps: :words, maybe this should be timestamps: :sentences?

Should we still return the text if we are computing the chunks? It may be the that we are building the text, only to never use it. I also see the chunks and the texts are slightly different when it comes to spacing, but I assume that's easy to post-process.

What if we always returns chunks and we have a function called BBB.Audio.chunks_to_string?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Always returning chunks may help make it consistent with streams too. I am fine if you want to postpone this decision until we have streaming.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we will have timestamps: :words, maybe this should be timestamps: :sentences?

It is not really sentences, the model outputs timestamps whenever it feels like. It could be timestamps: :segments, just a bit vague?

Should we still return the text if we are computing the chunks?

I wasn't sure, but thinking about streaming I am leaning towards that. FWIW the post processing is just join + trim, so it's fine to leave this up to the user.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:segments is good. Agreed on everything else too!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, I will remove :text later with streaming :)

Copy link
Contributor

@josevalim josevalim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome job, this was a massive amount of work. Are we good to ship a new Nx too? :)

@jonatanklosko
Copy link
Member Author

Are we good to ship a new Nx too? :)

I will look at streaming next, so if we want to play safe, we can wait for that :)

@jonatanklosko jonatanklosko merged commit 2358aff into main Sep 11, 2023
2 checks passed
@jonatanklosko jonatanklosko deleted the jk-timestamps branch September 11, 2023 10:45
@jonatanklosko jonatanklosko mentioned this pull request Sep 11, 2023
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants