SoCodec: A Semantic-Ordered Multi-Stream Speech Codec for Efficient Language Model based Text-To-Speech Synthesis
This repository contains inference scripts for SoCodec, an ultra-low-bitrate speech codec, dedicated to speech language models, introduced in the paper titled SoCodec: A Semantic-Ordered Multi-Stream Speech Codec for Efficient Language Model based Text-To-Speech Synthesis.
Paper
📈 Demo Site
⚙ Model Weights
👉 With SoCodec, you can compress audio into discrete codes at an ultra low 0.47 kbps bitrate and a short 120ms frameshift.
👌 It can be used as a drop-in replacement for EnCodec or other multi-stream codecs for speech language modeling applications.
📚 The released checkpoint only supports Chinese now. The training of the multi-lingual version is in progress.
- Sep 2024 (v1.0):
- We have released the checkpoint and inference code of SoCodec
Clone the repository and install dependencies:
git clone https://github.com/hhguo/SoCodec
cd SoCodec
mkdir ckpts && cd ckpts
wget https://huggingface.co/TencentGameMate/chinese-hubert-large/resolve/main/chinese-hubert-large-fairseq-ckpt.pt
wget https://huggingface.co/hhguo/SoCodec/resolve/main/socodec_16384x4_120ms_16khz_chinese.safetensors
wget https://huggingface.co/hhguo/SoCodec/resolve/main/mel_vocoder_80dim_10ms_16khz.safetensors
# For analysis-synthesis
python example.py -i ground_truth.wav -o synthesis.wav
# For speech analysis
python example.py -i ground_truth.wav -o features.pt
# For token-to-audio synthesis
python example.py -i features.pt -o synthesis.wav
We provide the pretrained models on Hugging Face Collections.
Model Name | Frame Shift | Codebook Size | Number of Streams | Dataset |
---|---|---|---|---|
socodec_16384x4_120ms_16khz_chinese | 120ms | 16384 | 4 | WenetSpeech4TTS |
We also provide the pretrained vocoders to convert the Mel spectrogram from socodec to the waveform.
Model Name | Frame Shift | Mel Bins | fmax | Upsampling Ratio | Dataset |
---|---|---|---|---|---|
mel_vocoder_80dim_10ms_16khz | 16 kHz | 80 | 8000 | 160 | WenetSpeech4TTS |
- Provide the checkpoint and inference code of multi-stream LLM
- Provide the single-codebook version
- Provide a higher-quality neural vocoder
- Provide a multi-lingual version (Chinese, English, etc.)