feat: ORCA Format KV Cache Utilization in Inference Response Header #7839
+250
−1
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What does the PR do?
This PR adds code to
HTTPAPIServer::GenerateRequestClass::StartResponse
insidesrc/http_server.cc
to add bothkv_cache_utilization
andmax_token_capacity
metrics composed from the existing prometheus metrics in TensorRT-LLM Backend'snv_trt_llm_kv_cache_block_metrics
metric family.This is acomplished by parsing the serialized prometheus metrics text object provided to the Triton Sever frontend by the Triton Core libraries into a structured vector of metrics for a specific metric family.
Checklist
Agreement
<commit_type>: <Title>
pre-commit install, pre-commit run --all
)Commit Type:
Check the conventional commit type
box here and add the label to the github PR.
Where should the reviewer start?
Changes are contained to 2 files:
src/http_server.cc
src/http_server.h
(the former's header file)The changes start in
HTTPAPIServer::GenerateRequestClass::StartResponse()
where the environment variable is checked and the header is written. There are 3 other funcitons below:MetricFamilyExtractor()
which parses serialized prometheus metrics into a vector ofPromMetric
(which have a map of their metric labels),and ExtractKVMetrics()
that pulls the values from the structured metrics and calculates the composite kv metrics, and finallyOrcaKVMetricHeader()
which forms the metrics into anendpoint-load-metrics
header in the ORCA format specified byORCA_METRIC_FORMAT
. If there are no TensorRT-LLM Backend metrics, no metrics found for the header, or an invalid format type the header is simply not written.The valid values for
ORCA_METRIC_FORMAT
are documented in the feature request (related issue linked below) and comments inStartResponse()
Test plan:
The feature is gated behind a feature flag in the form of the
ORCA_METRIC_FORMAT
environment variable. If unset, the feature is effectively disabled. Beyond that, the changes have been manually tested to not cause issue if either the queried metrics are not present (such as if TensorRT-LLM is not being used as the backend), or if the ORCA header metric type is invalid. In either case, nothing is parsed and no header is written. All code changes are wrapped in an#ifdef
and are only included if metrics are enabled during the Triton Server build.Caveats:
This feature only works on Triton Inference Server running with TensorRT-LLM Backend, as otherwise the KV-cache metrics are not included in the server metrics.
This change only implements the kv-cache utilization metics, but the functions it adds allows other metrics to be added easily (including metrics that don't require TensorRT-LLM Backend).
Background
This doc captures the overall requirements for model servers to integrate with llm instance gateway. More details in the Feature Request below.
Related Issues:
Screenshots
Response header before changes (or if
ORCA_METRIC_FORMAT
environment variable is unset):Response header with
ORCA_METRIC_FORMAT="json"
:Response header with
ORCA_METRIC_FORMAT="http"
:cc @yinggeh @krishung5 @jbkyang-nvi