[Feature]: Improve V1 startup error handling #11109

robertgshaw2-neuralmagic · 2024-12-11T17:18:07Z

🚀 The feature, motivation and pitch

Improve startup the error handling during startup for VLLM V0 and V1.

Right now, the flow for V1 is:

Start EngineCore process in the background. The model is loaded here. Its pretty common for this to fail (e.g. someone puts a too big model onto a GPU that doesn't have enough RAM, they have some config that is bad. When this happens, we log an error (which is in the logs) and throw an exception so the process dies.
The main process LLMEngine or AsyncLLM detects that the EngineCore has died and shuts itself down with a message like EngineCoreProc failed to start. This error occurs and prints a big stack trace that the user sees (which means the root cause is hidden. (Here's where it happens: https://github.com/vllm-project/vllm/blob/main/vllm/v1/engine/core.py#L178)

It would be a much better user experience if we presented the root cause error more clearly at the bottom of the stack trace. Im not too sure if there is a clean way to do this in python other than catching and IPCing the exception from the EngineCore before shutdown and raising it from the main process, but wanted to look into it

Alternatives

No response

Additional context

No response

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

robertgshaw2-neuralmagic added good first issue Good for newcomers feature request labels Dec 11, 2024

Ajay-Satish-01 linked a pull request Dec 22, 2024 that will close this issue

[V1] add error handling #11420

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Improve V1 startup error handling #11109

[Feature]: Improve V1 startup error handling #11109

robertgshaw2-neuralmagic commented Dec 11, 2024 •

edited

Loading

[Feature]: Improve V1 startup error handling #11109

[Feature]: Improve V1 startup error handling #11109

Comments

robertgshaw2-neuralmagic commented Dec 11, 2024 • edited Loading

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

robertgshaw2-neuralmagic commented Dec 11, 2024 •

edited

Loading