Multi-Modal

This document shows how to run multimodal pipelines with TensorRT-LLM, e.g. from image+text input modalities to text output.

Multimodal models' LLM part has an additional parameter --max_multimodal_len compared to LLM-only build commands. Under the hood, max_multimodal_len and max_prompt_embedding_table_size are effectively the same concept, i.e., prepended/concatenated embeddings (either multimodal feature embeddings or prompt tuning embeddings) to the LLM input embeddings. The multimodal features from the visual encoder of shape [batch_size, num_visual_features, visual_hidden_dim] is flattened as [batch_size * num_visual_features, visual_hidden_dim] and passed like a prompt embedding table.

BLIP2-T5

Download Huggingface weights and convert original checkpoint to TRT-LLM checkpoint format following example in examples/enc_dec/README.md.

export MODEL_NAME=flan-t5-xl
git clone https://huggingface.co/google/${MODEL_NAME} tmp/hf_models/${MODEL_NAME}

python ../enc_dec/t5/convert.py -i tmp/hf_models/${MODEL_NAME} \
    -o tmp/trt_models/${MODEL_NAME} --weight_data_type float32 \
    --inference_tensor_para_size 1

Build TRT-LLM engine from TRT-LLM checkpoint

python ../enc_dec/build.py --model_type t5 \
    --weight_dir tmp/trt_models/${MODEL_NAME}/tp1 \
    --output_dir trt_engines/${MODEL_NAME}/1-gpu \
    --engine_name ${MODEL_NAME} \
    --remove_input_padding \
    --use_bert_attention_plugin \
    --use_gpt_attention_plugin \
    --use_gemm_plugin \
    --dtype bfloat16 \
    --max_beam_width 1 \
    --max_batch_size 8 \
    --max_encoder_input_len 924 \
    --max_output_len 100 \
    --max_multimodal_len 256 # 8 (max_batch_size) * 32 (num_visual_features)

NOTE: max_multimodal_len = max_batch_size * num_visual_features, so if you change max_batch_size, max multimodal length MUST be changed accordingly.

The built T5 engines are located in ./trt_engines/${MODEL_NAME}/1-gpu/bfloat16/tp1.

Build TensorRT engines for visual components
```
python build_visual_engine.py --model_name ${MODEL_NAME} --model_path tmp/hf_models/${MODEL_NAME} --max_batch_size 8
```
The built engines are located in ./visual_engines/${MODEL_NAME}.

To run the BLIP2 pipeline with batch size > 1, change --max_batch_size argument to build_visual_engine.py accordingly.

Assemble everything into BLIP2 pipeline

python run.py \
    --blip2_encoder \
    --max_new_tokens 30 \
    --input_text "Question: which city is this? Answer:" \
    --hf_model_dir tmp/hf_models/${MODEL_NAME} \
    --visual_engine_dir visual_engines/${MODEL_NAME} \
    --llm_engine_dir trt_engines/${MODEL_NAME}/1-gpu/bfloat16/tp1

BLIP2-OPT

OPT pipeline needs few minor changes from T5 pipeline

Convert Huggingface weights to TRT-LLM checkpoint format following examples/opt/README.md.
Use trtllm-build command to build TRT-LLM engine for OPT.
Add --decoder-llm argument to inference script, since OPT is a decoder-only LLM.

The full list of commands is as follows:

export MODEL_NAME=opt-2.7b
git clone https://huggingface.co/facebook/${MODEL_NAME} tmp/hf_models/${MODEL_NAME}

python ../opt/convert_checkpoint.py \
    --model_dir tmp/hf_models/${MODEL_NAME} \
    --dtype float16 \
    --output_dir tmp/trt_models/${MODEL_NAME}/fp16/1-gpu

trtllm-build \
    --checkpoint_dir tmp/trt_models/${MODEL_NAME}/fp16/1-gpu \
    --output_dir trt_engines/${MODEL_NAME}/fp16/1-gpu \
    --gemm_plugin float16 \
    --max_beam_width 1 \
    --max_batch_size 8 \
    --max_multimodal_len 256 \
    --max_input_len 924 \
    --max_output_len 100

python build_visual_engine.py --model_name ${MODEL_NAME} --model_path tmp/hf_models/${MODEL_NAME}

python run.py \
    --blip2_encoder \
    --max_new_tokens 30 \
    --input_text "Question: which city is this? Answer:" \
    --hf_model_dir tmp/hf_models/${MODEL_NAME} \
    --visual_engine_dir visual_engines/${MODEL_NAME} \
    --llm_engine_dir trt_engines/${MODEL_NAME}/fp16/1-gpu \
    --decoder_llm

INT8/INT4 weight-only quantization for OPT can be enabled using commands as follows (take INT4 as an example, while INT8 is the default precision for weight-only quantization):

python ../opt/convert_checkpoint.py \
    --model_dir tmp/hf_models/${MODEL_NAME} \
    --dtype float16 \
    --output_dir tmp/trt_models/${MODEL_NAME}/int4_weightonly/1-gpu \
    --use_weight_only \
    --weight_only_precision int4

trtllm-build \
    --checkpoint_dir tmp/trt_models/${MODEL_NAME}/int4_weightonly/1-gpu \
    --output_dir trt_engines/${MODEL_NAME}/int4_weightonly/1-gpu \
    --gemm_plugin float16 \
    --max_beam_width 1 \
    --max_batch_size 8 \
    --max_multimodal_len 256 \
    --max_input_len 924 \
    --max_output_len 100

The built OPT engines lie in trt_engines/${MODEL_NAME}/int4_weightonly/1-gpu. You should use this directory as --llm_engine_dir argument to run.py

NOTE: INT8/INT4 option is not supported for BLIP2-T5, because quantization support has not be added for encoder-decoder models yet.

LLaVA

Download Huggingface model weights. This model has both LLM and visual components unlike BLIP2 example which downloads only LLM components from Huggingface.
```
export MODEL_NAME="llava-1.5-7b-hf"
git clone https://huggingface.co/llava-hf/${MODEL_NAME} tmp/hf_models/${MODEL_NAME}
```

Generate TRT-LLM engine for LLaMA following example in examples/llama/README.md

python ../llama/convert_checkpoint.py \
    --model_dir tmp/hf_models/${MODEL_NAME} \
    --output_dir tmp/trt_models/${MODEL_NAME}/fp16/1-gpu \
    --dtype float16

trtllm-build \
    --checkpoint_dir tmp/trt_models/${MODEL_NAME}/fp16/1-gpu \
    --output_dir trt_engines/${MODEL_NAME}/fp16/1-gpu \
    --gpt_attention_plugin float16 \
    --gemm_plugin float16 \
    --max_batch_size 1 \
    --max_input_len 2048 \
    --max_output_len 512 \
    --max_multimodal_len 576 # 1 (max_batch_size) * 576 (num_visual_features)

Build TensorRT engines for visual components

python build_visual_engine.py --model_name ${MODEL_NAME} --model_path tmp/hf_models/${MODEL_NAME}

Add --decoder-llm argument to inference script, since LLaMA is a decoder-only LLM.

python run.py \
    --max_new_tokens 30 \
    --input_text "Question: which city is this? Answer:" \
    --hf_model_dir tmp/hf_models/${MODEL_NAME} \
    --visual_engine_dir visual_engines/${MODEL_NAME} \
    --llm_engine_dir trt_engines/${MODEL_NAME}/fp16/1-gpu \
    --decoder_llm

INT8/INT4 weight-only quantization for LLaMA can be enabled as follows (take INT4 as an example, while INT8 is the default precision for weight-only quantization):

python ../llama/convert_checkpoint.py \
    --model_dir tmp/hf_models/${MODEL_NAME} \
    --dtype float16 \
    --output_dir tmp/trt_models/${MODEL_NAME}/int4_weightonly/1-gpu \
    --use_weight_only \
    --weight_only_precision int4

trtllm-build \
    --checkpoint_dir tmp/trt_models/${MODEL_NAME}/int4_weightonly/1-gpu \
    --output_dir trt_engines/${MODEL_NAME}/int4_weightonly/1-gpu \
    --gpt_attention_plugin float16 \
    --gemm_plugin float16 \
    --max_batch_size 1 \
    --max_input_len 924 \
    --max_output_len 100 \
    --max_multimodal_len 576

The built engines lie in trt_engines/${MODEL_NAME}/int4_weightonly/1-gpu. You should use this directory as --llm_engine_dir argument to run.py

Nougat

Download Huggingface weights

export MODEL_NAME="nougat-base" # or nougat-small
git clone https://huggingface.co/facebook/${MODEL_NAME} tmp/hf_models/${MODEL_NAME}

Convert Huggingface weights into TRT-LLM checkpoints and build TRT engines using scripts in examples/enc_dec

Nougat uses mBART architecture but replaces the LLM encoder with a Swin Transformer encoder. To achieve this, we add an extra --nougat flag (over mBART example) to bart/convert.py and build.py in examples/enc_dec.

python ../enc_dec/bart/convert.py -i tmp/hf_models/${MODEL_NAME} \
    -o tmp/trt_models/${MODEL_NAME} --weight_data_type float32 \
    --inference_tensor_para_size 1 --nougat

python ../enc_dec/build.py \
    --model_type bart \
    --weight_dir tmp/trt_models/${MODEL_NAME}/tp1 \
    -o trt_engines/${MODEL_NAME}/1-gpu \
    --engine_name $MODEL_NAME \
    --bert_attention_plugin \
    --use_gpt_attention_plugin \
    --use_gemm_plugin \
    --dtype bfloat16 \
    --max_beam_width 1 \
    --max_batch_size 1 \
    --nougat \
    --max_output_len 100 \
    --max_multimodal_len 588 # 1 (max_batch_size) * 588 (num_visual_features)

Generate TensorRT engines for visual components and combine everything into final pipeline.

python build_visual_engine.py --model_name ${MODEL_NAME} --model_path tmp/hf_models/${MODEL_NAME}

python run.py \
    --hf_model_dir tmp/hf_models/${MODEL_NAME} \
    --visual_engine_dir visual_engines/${MODEL_NAME} \
    --llm_engine_dir trt_engines/${MODEL_NAME}/1-gpu/bfloat16/tp1 \
    --nougat

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Multi-Modal

BLIP2-T5

BLIP2-OPT

LLaVA

Nougat

Files

README.md

Latest commit

History

README.md

File metadata and controls

Multi-Modal

BLIP2-T5

BLIP2-OPT

LLaVA

Nougat