Skip to content

Commit

Permalink
[misc] [doc] [frontend] LLM torch profiler support (vllm-project#7943)
Browse files Browse the repository at this point in the history
  • Loading branch information
SolitaryThinker authored Sep 7, 2024
1 parent 29f49cd commit 12dd715
Show file tree
Hide file tree
Showing 6 changed files with 74 additions and 3 deletions.
20 changes: 17 additions & 3 deletions docs/source/dev/profiling/profiling_index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,14 +17,28 @@ Traces can be visualized using https://ui.perfetto.dev/.
.. tip::

Only send a few requests through vLLM when profiling, as the traces can get quite large. Also, no need to untar the traces, they can be viewed directly.

Example commands:

.. tip::

To stop the profiler - it flushes out all the profile trace files to the directory. This takes time, for example for about 100 requests worth of data for a llama 70b, it takes about 10 minutes to flush out on a H100.
Set the env variable VLLM_RPC_GET_DATA_TIMEOUT_MS to a big number before you start the server. Say something like 30 minutes.
``export VLLM_RPC_GET_DATA_TIMEOUT_MS=1800000``

Example commands and usage:
===========================

Offline Inference:
------------------

Refer to `examples/offline_inference_with_profiler.py <https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_with_profiler.py>`_ for an example.


OpenAI Server:
--------------

.. code-block:: bash
VLLM_TORCH_PROFILER_DIR=/mnt/traces/ python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3-70B
VLLM_TORCH_PROFILER_DIR=./vllm_profile python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3-70B
benchmark_serving.py:

Expand Down
33 changes: 33 additions & 0 deletions examples/offline_inference_with_profiler.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
import os

from vllm import LLM, SamplingParams

# enable torch profiler, can also be set on cmd line
os.environ["VLLM_TORCH_PROFILER_DIR"] = "./vllm_profile"

# Sample prompts.
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

# Create an LLM.
llm = LLM(model="facebook/opt-125m")

llm.start_profile()

# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.generate(prompts, sampling_params)

llm.stop_profile()

# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
6 changes: 6 additions & 0 deletions vllm/engine/llm_engine.py
Original file line number Diff line number Diff line change
Expand Up @@ -1914,6 +1914,12 @@ def check_health(self) -> None:
self.tokenizer.check_health()
self.model_executor.check_health()

def start_profile(self) -> None:
self.model_executor.start_profile()

def stop_profile(self) -> None:
self.model_executor.stop_profile()

def is_tracing_enabled(self) -> bool:
return self.tracer is not None

Expand Down
6 changes: 6 additions & 0 deletions vllm/entrypoints/llm.py
Original file line number Diff line number Diff line change
Expand Up @@ -560,6 +560,12 @@ def encode(
outputs = self._run_engine(use_tqdm=use_tqdm)
return LLMEngine.validate_outputs(outputs, EmbeddingRequestOutput)

def start_profile(self) -> None:
self.llm_engine.start_profile()

def stop_profile(self) -> None:
self.llm_engine.stop_profile()

# LEGACY
def _convert_v1_inputs(
self,
Expand Down
6 changes: 6 additions & 0 deletions vllm/executor/cpu_executor.py
Original file line number Diff line number Diff line change
Expand Up @@ -296,6 +296,12 @@ def _wait_for_tasks_completion(self, parallel_worker_tasks: Any) -> None:
for result in parallel_worker_tasks:
result.get()

def start_profile(self) -> None:
self.driver_method_invoker(self.driver_worker, "start_profile")

def stop_profile(self) -> None:
self.driver_method_invoker(self.driver_worker, "stop_profile")


class CPUExecutorAsync(CPUExecutor, ExecutorAsyncBase):

Expand Down
6 changes: 6 additions & 0 deletions vllm/executor/gpu_executor.py
Original file line number Diff line number Diff line change
Expand Up @@ -169,6 +169,12 @@ def check_health(self) -> None:
# it's running.
return

def start_profile(self) -> None:
self.driver_worker.start_profile()

def stop_profile(self) -> None:
self.driver_worker.stop_profile()


class GPUExecutorAsync(GPUExecutor, ExecutorAsyncBase):

Expand Down

0 comments on commit 12dd715

Please sign in to comment.