We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
根据video-salmonn的代码,speech是将语音信号转为梅尔语谱图,在用whisper的encoder提取特征,encoder中的操作是卷积+transformer;audio是将语音信号转为fbank特征,再用BEATs的encoder提取特征,也是一个transformer结构,请问这样提取的audio特征跟speech特征意义是有什么不同吗
The text was updated successfully, but these errors were encountered:
只是由于 Whisper 和 BEATs 接受的输入特征不同
Sorry, something went wrong.
那请问为什么这里音频数据要采用两种处理方法呢(whisper和BEATs)
No branches or pull requests
根据video-salmonn的代码,speech是将语音信号转为梅尔语谱图,在用whisper的encoder提取特征,encoder中的操作是卷积+transformer;audio是将语音信号转为fbank特征,再用BEATs的encoder提取特征,也是一个transformer结构,请问这样提取的audio特征跟speech特征意义是有什么不同吗
The text was updated successfully, but these errors were encountered: