Provide comprehensive inference benchmarking for LLM backends (vLLM, TensorRT-LLM, TGI) #3868

yuzisun · 2024-08-18T15:09:03Z

/kind feature

Describe the solution you'd like
Provide comprehensive benchmarking results for inference runtimes (vLLM, TensorRT-LLM and TGI) in a KServe setup.

Latency (TOFT)
Throughput (Token generation rate)

Factors:

Quantization (AWQ/GPTQ)
Number of concurrent virtual users
Feature enablement for prefix caching etc

GPUs:

A100 80GB GPU
H100 80GB GPU

Models:

llama3.1 8B
llama3.1 70B 4bit quantization
llama3.1 405B FP8

Dataset for testing

https://huggingface.co/datasets/databricks/databricks-dolly-15k/resolve/main/databricks-dolly-15k.jsonl

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

Links to the design documents:
[Optional, start with the short-form RFC template to outline your ideas and get early feedback.]
[Required, use the longer-form design doc template to specify and discuss your design in more detail]

yuzisun added the kserve/llm label Aug 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide comprehensive inference benchmarking for LLM backends (vLLM, TensorRT-LLM, TGI) #3868

Provide comprehensive inference benchmarking for LLM backends (vLLM, TensorRT-LLM, TGI) #3868

yuzisun commented Aug 18, 2024 •

edited

Loading

Provide comprehensive inference benchmarking for LLM backends (vLLM, TensorRT-LLM, TGI) #3868

Provide comprehensive inference benchmarking for LLM backends (vLLM, TensorRT-LLM, TGI) #3868

Comments

yuzisun commented Aug 18, 2024 • edited Loading

yuzisun commented Aug 18, 2024 •

edited

Loading