feat: ORCA Format KV Cache Utilization in Inference Response Header #7839

BenjaminBraunDev · 2024-11-27T20:59:18Z

What does the PR do?

This PR adds code to HTTPAPIServer::GenerateRequestClass::StartResponse inside src/http_server.cc to add both kv_cache_utilization and max_token_capacity metrics composed from the existing prometheus metrics in TensorRT-LLM Backend's nv_trt_llm_kv_cache_block_metrics metric family.

This is acomplished by parsing the serialized prometheus metrics text object provided to the Triton Sever frontend by the Triton Core libraries into a structured vector of metrics for a specific metric family.

Checklist

Commit Type:

Check the conventional commit type
box here and add the label to the github PR.

Where should the reviewer start?

Changes are contained to 2 files:

src/http_server.cc
src/http_server.h (the former's header file)

The changes start in HTTPAPIServer::GenerateRequestClass::StartResponse() where the environment variable is checked and the header is written. There are 3 other funcitons below: MetricFamilyExtractor() which parses serialized prometheus metrics into a vector of PromMetric (which have a map of their metric labels), and ExtractKVMetrics() that pulls the values from the structured metrics and calculates the composite kv metrics, and finally OrcaKVMetricHeader() which forms the metrics into an endpoint-load-metrics header in the ORCA format specified by ORCA_METRIC_FORMAT. If there are no TensorRT-LLM Backend metrics, no metrics found for the header, or an invalid format type the header is simply not written.

The valid values for ORCA_METRIC_FORMAT are documented in the feature request (related issue linked below) and comments in StartResponse()

Test plan:

The feature is gated behind a feature flag in the form of the ORCA_METRIC_FORMAT environment variable. If unset, the feature is effectively disabled. Beyond that, the changes have been manually tested to not cause issue if either the queried metrics are not present (such as if TensorRT-LLM is not being used as the backend), or if the ORCA header metric type is invalid. In either case, nothing is parsed and no header is written. All code changes are wrapped in an #ifdef and are only included if metrics are enabled during the Triton Server build.

Caveats:

This feature only works on Triton Inference Server running with TensorRT-LLM Backend, as otherwise the KV-cache metrics are not included in the server metrics.
This change only implements the kv-cache utilization metics, but the functions it adds allows other metrics to be added easily (including metrics that don't require TensorRT-LLM Backend).

Background

This doc captures the overall requirements for model servers to integrate with llm instance gateway. More details in the Feature Request below.

Related Issues:

Resolves Feature Request GitHub issue: #7865

Screenshots

Response header before changes (or if ORCA_METRIC_FORMAT environment variable is unset):

Response header with ORCA_METRIC_FORMAT="json":

Response header with ORCA_METRIC_FORMAT="http":

cc @yinggeh @krishung5 @jbkyang-nvi

… for use in HandleGenerate to add kv_utilization and max_token_capacity to the inference request response header.

src/http_server.h

src/http_server.cc

…nctionality to HTTPAPIServer::GenerateRequestClass::StartResponse() to extract metrics after inference request is processed for up-to-date metrics.

jbkyang-nvi · 2024-12-13T23:35:32Z

@nnshah1 @indrajit96 are we merging into the already frozen r24.10? Or should this be onto main instead?

nnshah1 · 2024-12-13T23:37:14Z

@nnshah1 @indrajit96 are we merging into the already frozen r24.10? Or should this be onto main instead?

good catch we should target main

BenjaminBraunDev force-pushed the r24.10 branch from 43a1b18 to 713c8de Compare December 9, 2024 18:07

BenjaminBraunDev changed the title ~~ORCA Format KV Cache Utilization in Inference Response Header~~ feat: ORCA Format KV Cache Utilization in Inference Response Header Dec 10, 2024

Add helper functions to pull metrics in HTTPAPIServer to pull metrics…

74492a8

… for use in HandleGenerate to add kv_utilization and max_token_capacity to the inference request response header.

BenjaminBraunDev force-pushed the r24.10 branch from 713c8de to 74492a8 Compare December 11, 2024 02:16

BenjaminBraunDev marked this pull request as ready for review December 11, 2024 02:21

liu-cong reviewed Dec 11, 2024

View reviewed changes

Add logging, examples and more detailed comments, and move feature fu…

d5760e0

…nctionality to HTTPAPIServer::GenerateRequestClass::StartResponse() to extract metrics after inference request is processed for up-to-date metrics.

nnshah1 requested a review from indrajit96 December 13, 2024 03:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: ORCA Format KV Cache Utilization in Inference Response Header #7839

feat: ORCA Format KV Cache Utilization in Inference Response Header #7839

BenjaminBraunDev commented Nov 27, 2024 •

edited

Loading

jbkyang-nvi commented Dec 13, 2024

nnshah1 commented Dec 13, 2024

feat: ORCA Format KV Cache Utilization in Inference Response Header #7839

Are you sure you want to change the base?

feat: ORCA Format KV Cache Utilization in Inference Response Header #7839

Conversation

BenjaminBraunDev commented Nov 27, 2024 • edited Loading

What does the PR do?

Checklist

Commit Type:

Where should the reviewer start?

Test plan:

Caveats:

Background

Related Issues:

Screenshots

jbkyang-nvi commented Dec 13, 2024

nnshah1 commented Dec 13, 2024

BenjaminBraunDev commented Nov 27, 2024 •

edited

Loading