Skip to content

Speech, Language, Audio, Music Processing with Large Language Model

License

Notifications You must be signed in to change notification settings

X-LANCE/SLAM-LLM

Repository files navigation

SLAM-LLM

SLAM-LLM is a deep learning toolkit that allows researchers and developers to train custom multimodal large language model (MLLM), focusing on Speech, Language, Audio, Music processing. We provide detailed recipes for training and high-performance checkpoints for inference.

SLAM-LLM Logo

version version python mit

Table of Contents

  1. News
  2. Installation
  3. Usage
  4. Features
  5. Acknowledge
  6. Citation

News

Installation

git clone https://github.com/huggingface/transformers.git
cd transformers
git checkout tags/v4.35.2
pip install -e .
cd ..
git clone https://github.com/huggingface/peft.git
cd peft
git checkout tags/v0.6.0
pip install -e .
cd ..
pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu118
git clone https://github.com/ddlBoJack/SLAM-LLM.git
cd SLAM-LLM
pip install  -e .

For some examples, you may need to use fairseq, the command line is as follows:

# you need to install fairseq before SLAM-LLM
git clone https://github.com/pytorch/fairseq
cd fairseq
pip install --editable ./

We also provide a docker image for convenience:

# build docker image
docker build -t slam-llm:latest .

# run docker image with gpu
docker run -it --gpus all --name slam --shm-size=256g slam-llm:latest /bin/bash

Usage

List of Recipes

We provide reference implementations of various LLM-based speech, audio, and music tasks:

Configuration Priority

We provide hierarchical configuration inheritance relationships as follows:

command-line (shell file) > Hydra configuration (yaml file) > dataclass configuration (Python file)

Features

  • Easily extend to new models and tasks.
  • Detailed recipes for training and high-performance checkpoints for inference.
  • Mixed precision training which trains faster with less GPU memory on NVIDIA tensor cores.
  • Multi-GPU training with data and model parallel, supporting DDP, FSDP and deepspeed (still need to be improved).
  • Flexible configuration based on Hydra and dataclass allowing a combination of code, command-line and file based configuration.

Acknowledge

  • We borrow code from Llama-Recipes for the training process.
  • We borrow code from Fairseq for deepspeed configuration.
  • We thank the contributors for providing diverse recipes.

Citation

Speech Task

SLAM-ASR:

@article{ma2024embarrassingly,
  title={An Embarrassingly Simple Approach for LLM with Strong ASR Capacity},
  author={Ma, Ziyang and Yang, Guanrou and Yang, Yifan and Gao, Zhifu and Wang, Jiaming and Du, Zhihao and Yu, Fan and Chen, Qian and Zheng, Siqi and Zhang, Shiliang and others},
  journal={arXiv preprint arXiv:2402.08846},
  year={2024}
}

Mala-ASR:

@article{yang2024mala,
  title={MaLa-ASR: Multimedia-Assisted LLM-Based ASR},
  author={Yang, Guanrou and Ma, Ziyang and Yu, Fan and Gao, Zhifu and Zhang, Shiliang and Chen, Xie},
  journal={Proc. INTERSPEECH},
  year={2024}
}

LLM-Based Contextual ASR:

@article{yang2024ctc,
  title={CTC-Assisted LLM-Based Contextual ASR},
  author={Yang, Guanrou and Ma, Ziyang and Gao, Zhifu and Zhang, Shiliang and Chen, Xie},
  journal={Proc. SLT},
  year={2024}
}

CoT-ST:

@article{du2024cot,
  title={CoT-ST: Enhancing LLM-based Speech Translation with Multimodal Chain-of-Thought},
  author={Du, Yexing and Ma, Ziyang and Yang, Yifan and Deng, Keqi and Chen, Xie and Yang, Bo and Xiang, Yang and Liu, Ming and Qin, Bing},
  journal={arXiv preprint arXiv:2409.19510},
  year={2024}
}

Audio Task

SLAM-AAC:

@article{chen2024slam,
  title={SLAM-AAC: Enhancing Audio Captioning with Paraphrasing Augmentation and CLAP-Refine through LLMs},
  author={Chen, Wenxi and Ma, Ziyang and Li, Xiquan and Xu, Xuenan and Liang, Yuzhe and Zheng, Zhisheng and Yu, Kai and Chen, Xie},
  journal={arXiv preprint arXiv:2410.09503},
  year={2024}
}

DRCap:

@article{li2024drcap,
  title={DRCap: Decoding CLAP Latents with Retrieval-augmented Generation for Zero-shot Audio Captioning},
  author={Li, Xiquan and Chen, Wenxi and Ma, Ziyang and Xu, Xuenan and Liang, Yuzhe and Zheng, Zhisheng and Kong, Qiuqiang and Chen, Xie},
  journal={arXiv preprint arXiv:2410.09472},
  year={2024}
}

BAT:

@article{zheng2024bat,
  title={BAT: Learning to Reason about Spatial Sounds with Large Language Models},
  author={Zheng, Zhisheng and Peng, Puyuan and Ma, Ziyang and Chen, Xie and Choi, Eunsol and Harwath, David},
  journal={Proc. ICML},
  year={2024}
}