-
Notifications
You must be signed in to change notification settings - Fork 576
inference_with_transformers_en
We provide two scripts to use the native Transformers for inference: a command-line interface and a web graphical interface.
Taking the loading of the Chinese-LLaMA-2-7B/Chinese-Alpaca-2-7B model as an example:
If you are using full model, or have already merged the LoRA models with original Llama-2-hf using merge_llama2_with_chinese_lora_low_mem.py
, you can directly perform inference (DO NOT specify --lora_model
).
python scripts/inference/inference_hf.py \
--base_model path_to_merged_llama2_or_alpaca2_hf_dir \
--with_prompt \
--interactive
If you wish to use LoRA along with original Llama-2-hf, you can use the following command.
python scripts/inference/inference_hf.py \
--base_model path_to_original_llama_2_hf_dir \
--lora_model path_to_chinese_llama2_or_alpaca2_lora \
--with_prompt \
--interactive
This method also supports use vLLM for LLM inference and serving. You need to install vLLM
:
pip install vllm
Currently vLLM does not support load LoRA files (--lora_model
), and can only be used in 8bit mode (--load_in_8bit
), or pure CPU inference (--only_cpu
).
python scripts/inference/inference_hf.py \
--base_model path_to_merged_llama2_or_alpaca2_hf_dir \
--with_prompt \
--interactive \
--use_vllm
This method also supports use speculative sampling for LLM inference. You can use a small model (Chinese-LLaMA-2-1.3B, Chinese-Alpaca-2-1.3B) as the Draft Model
to accelerate inference for the LLM.
See Speculative Sampling for method details.
python scripts/inference/inference_hf.py \
--base_model path_to_merged_llama2_or_alpaca2_hf_dir \
--with_prompt \
--interactive \
--speculative_sampling \
--draft_k 4 \
--draft_base_model path_to_llama2_1.3b_or_alpaca2_1.3b_hf_dir
-
--base_model {base_model}
: Directory containing the LLaMA-2 model weights and configuration files in HF format. -
--lora_model {lora_model}
: Directory of the Chinese LLaMA-2/Alpaca-2 LoRA files after decompression, or the 🤗Model Hub model name. If this parameter is not provided, only the model specified by--base_model
will be loaded. -
--tokenizer_path {tokenizer_path}
: Directory containing the corresponding tokenizer. If this parameter is not provided, its default value is the same as--lora_model
; if the--lora_model
parameter is not provided either, its default value is the same as--base_model
. -
--with_prompt
: Whether to merge the input with the prompt template. If you are loading an Alpaca model, be sure to enable this option! -
--interactive
: Launch interactively for multiple single-round question-answer sessions (this is not the contextual dialogue in llama.cpp). -
--data_file {file_name}
: In non-interactive mode, read the content offile_name
line by line for prediction. -
--predictions_file {file_name}
: In non-interactive mode, write the predicted results in JSON format tofile_name
. -
--only_cpu
: Only use CPU for inference. -
--gpus {gpu_ids}
: the GPU id(s) to use, default 0. You can specify multiple GPUs, for instance0,1,2
.
-
--alpha {alpha}
:The coefficient in NTK scaling method, which can effectively increase the max context size. Default value is 1. If you do not konw how to set the parameter, just leave it default, or set it to"auto"
. -
--load_in_8bit
or--load_in_4bit
:Load the model in the 8bit or 4bit mode. -
--system_prompt {system_prompt}
: Set system prompt. The default value is the string in alpaca-2.txt. -
--use_vllm
:use vLLM as LLM backend for inference and serving. -
--guidance_scale {guidance_scale}
: The strength of Classifier-Free Guidance (CFG) sampling. Default is 1. CFG sampling is enabled ifguidance_scale>1
. CFG sampling is not compatible with--use_vllm
. -
--negative_prompt {negative_prompt}
: Negative prompt for CFG sampling. Default isNone
. -
--speculative_sampling
: use speculative sampling for LLM inference. -
--draft_k {draft_k}
: The number of tokens generated per step for small model in speculative sampling. The default value is 0. When a non-positive integer is provided, it uses dynamically changingdraft_k
. -
--draft_base_model {draft_base_model}
: Directory containing the LLaMA-2 small model weights and configuration files in HF format. -
--draft_lora_model {draft_lora_model}
: Directory of the Chinese LLaMA-2/Alpaca-2 LoRA files of the small model. If this parameter is not provided, only the model specified by--base_model
will be loaded. -
--draft_model_load_in_8bit
or--draft_model_load_in_4bit
: Load the small model in the 8bit or 4bit mode. -
--use_flash_attention_2
: use Flash-Attention to speed up inference. -
--use_ntk
: Using dynamic-ntk to extend the context window. Does not work with the 64K version of the long context model.
This method will start a web frontend page for interaction and support multi-turn conversations. In addition to Transformers
, you need to install Gradio
and mdtex2html
:
pip install gradio
pip install mdtex2html
If you are using full model, or have already merged the LoRA models with original Llama-2-hf using merge_llama2_with_chinese_lora_low_mem.py
, you can directly perform inference (DO NOT specify --lora_model
).
python scripts/inference/gradio_demo.py --base_model path_to_merged_alpaca2_hf_dir
If you wish to use LoRA along with original Llama-2-hf, you can use the following command.
python scripts/inference/gradio_demo.py \
--base_model path_to_original_llama_2_hf_dir \
--lora_model path_to_chinese_alpaca2_lora
This method also supports use vLLM for LLM inference and serving. You need to install vLLM
(about 8-10 mins).
pip install vllm
Currently vLLM does not support load LoRA files (--lora_model
), and can only be used in 8bit mode (--load_in_8bit
), or pure CPU inference (--only_cpu
).
python scripts/inference/gradio_demo.py --base_model path_to_merged_alpaca2_hf_dir --use_vllm
This method also supports use speculative sampling for LLM inference. You can use a small model (Chinese-LLaMA-2-1.3B, Chinese-Alpaca-2-1.3B) as the Draft Model
to accelerate inference for the LLM.
See Speculative Sampling for method details.
python scripts/inference/gradio_demo.py \
--base_model path_to_merged_alpaca2_hf_dir \
--speculative_sampling \
--draft_base_model path_to_llama2_1.3b_or_alpaca2_1.3b_hf_dir
-
--base_model {base_model}
: Directory containing the LLaMA-2 model weights and configuration files in HF format. -
--lora_model {lora_model}
: Directory of the Chinese LLaMA-2/Alpaca-2 LoRA files after decompression, or the 🤗Model Hub model name. If this parameter is not provided, only the model specified by--base_model
will be loaded. -
--tokenizer_path {tokenizer_path}
: Directory containing the corresponding tokenizer. If this parameter is not provided, its default value is the same as--lora_model
; if the--lora_model
parameter is not provided either, its default value is the same as--base_model
. -
--use_cpu
: Only use CPU for inference. -
--gpus {gpu_ids}
: the GPU id(s) to use, default 0. You can specify multiple GPUs, for instance0,1,2
. -
--alpha {alpha}
:The coefficient in NTK scaling method, which can effectively increase the max context size. Default value is 1. If you do not konw how to set the parameter, just leave it default, or set it to"auto"
. -
--load_in_8bit
:Load the model in the 8bit mode. -
--load_in_4bit
:Load the model in the 4bit mode. -
--max_memory
: The max number of history tokens to keep in the multi-turn dialogue. Default value is 1024. -
--use_vllm
:use vLLM as LLM backend for inference and serving. -
--speculative_sampling
: use speculative sampling for LLM inference. -
--draft_base_model {draft_base_model}
: Directory containing the LLaMA-2 small model weights and configuration files in HF format. -
--draft_lora_model {draft_lora_model}
: Directory of the Chinese LLaMA-2/Alpaca-2 LoRA files of the small model. If this parameter is not provided, only the model specified by--base_model
will be loaded. -
--draft_model_load_in_8bit
or--draft_model_load_in_4bit
: Load the small model in the 8bit or 4bit mode. -
--flash_attn
: use Flash-Attention to speed up inference.
- Due to differences in decoding implementation details between different frameworks, this script cannot guarantee to reproduce the decoding effect of llama.cpp.
- This script is for convenient and quick experience only, and has not been optimized for fast inference.
- When running 7B model inference on a CPU, make sure you have 32GB of memory; when running 7B model inference on a single GPU, make sure you have 16GB VRAM.
We have implemented Speculative Sampling in scripts/inference/speculative_sample.py
. Speculative sampling is a decoding acceleration strategy where, in each decoding round, a small model, which is less capable but faster, is used as a Draft Model
to predict multiple tokens. These tokens are then examined by the large model (Target Model
), and parts consistent with its distribution are accepted, ensuring that one or more tokens are generated per round. This method accelerates the inference of the large model. For usage instructions, please refer to the relevant instructions in Command-line Interface and Web Graphical Interface.
The table below provides the speedup results achieved by using the speculative sampling strategy with Chinese-LLaMA-2-1.3B and Chinese-Alpaca-2-1.3B as draft models for speeding up the 7B and 13B LLaMA and Alpaca models for reference. The testing was conducted on a 1*A40-48G system, and it reports the average time per token generated, in units of ms/token. The number of tokens generated per step by the small model (draft_k
) varies dynamically during decoding. The test data for the LLaMA models consists of Chinese summarization task, while the test data for the Alpaca models consists of instruction-based tasks.
Draft Model | Draft Model Speed | Target Model | Target Model Speed | Speculative Sampling Speed |
---|---|---|---|---|
Chinese-LLaMA-2-1.3B | 7.6 | Chinese-LLaMA-2-7B | 49.3 | 36.0(1.37x) |
Chinese-LLaMA-2-1.3B | 7.6 | Chinese-LLaMA-2-13B | 66.0 | 47.1(1.40x) |
Chinese-Alpaca-2-1.3B | 8.1 | Chinese-Alpaca-2-7B | 50.2 | 34.9(1.44x) |
Chinese-Alpaca-2-1.3B | 8.2 | Chinese-Alpaca-2-13B | 67.0 | 41.6(1.61x) |
Under the setting of dynamically changing draft_k
, both Chinese-LLaMA-2-1.3B and Chinese-Alpaca-2-1.3B exhibit acceleration in decoding for the 7B and 13B LLaMA and Alpaca models. If you wish to modify the setting for draft_k
and use a fixed value, you can do so by adjusting the --draft_k
parameter when running the scripts/inference/inference_hf.py
script or by modifying the draft_k
parameter in the web page when using the scripts/inference/gradio_demo.py
script (set it to 0
for dynamical changing). In different usage scenarios, adjusting the value of draft_k
may yield a higher acceleration effect.