OpenAI Compatible Tensort-LLM Worker

A high-performance inference server that combines the power of TensorRT-LLM for optimized model inference with RunPod's serverless infrastructure. This implementation provides an OpenAI-compatible API interface for easy integration with existing applications.

Features

TensorRT-LLM optimization for faster inference
OpenAI-compatible API endpoints
Flexible configuration through environment variables
Support for model parallelism (tensor and pipeline)
Hugging Face model integration
Streaming response support
RunPod serverless deployment ready

Runtime Constraints

Batch size and sequence length must be determined during engine building time
Dynamic shape support is limited and may impact performance
KV-cache size is fixed at build time and affects memory usage
Changing model parameters requires rebuilding the TensorRT engine

Build Time Impact

Engine building can take significant time (hours for large models)
Each combination of parameters requires a separate engine
Changes to maximum sequence length or batch size require rebuilding

Environment Variables

The server can be configured using the following environment variables:

TRTLLM_MODEL              # Required: Path or name of the model to load
TRTLLM_TOKENIZER          # Optional: Path or name of the tokenizer (defaults to model path)
TRTLLM_MAX_BEAM_WIDTH     # Optional: Maximum beam width for beam search
TRTLLM_MAX_BATCH_SIZE     # Optional: Maximum batch size for inference
TRTLLM_MAX_NUM_TOKENS     # Optional: Maximum number of tokens to generate
TRTLLM_MAX_SEQ_LEN        # Optional: Maximum sequence length
TRTLLM_TP_SIZE            # Optional: Tensor parallelism size (default: 1)
TRTLLM_PP_SIZE            # Optional: Pipeline parallelism size (default: 1)
TRTLLM_KV_CACHE_FREE_GPU_MEMORY_FRACTION  # Optional: GPU memory fraction for KV cache (default: 0.9)
TRTLLM_TRUST_REMOTE_CODE  # Optional: Whether to trust remote code (default: false)
HF_TOKEN                  # Optional: Hugging Face API token for protected models

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.github		.github
builder		builder
src		src
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OpenAI Compatible Tensort-LLM Worker

Features

Runtime Constraints

Build Time Impact

Environment Variables

About

Releases

Packages

Contributors 2

Languages

License

runpod-workers/worker-tensortllm

Folders and files

Latest commit

History

Repository files navigation

OpenAI Compatible Tensort-LLM Worker

Features

Runtime Constraints

Build Time Impact

Environment Variables

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages