Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: torch.OutOfMemoryError for 0.6.4.post1 but 0.6.3.post1 is working #11251

Open
1 task done
mces89 opened this issue Dec 17, 2024 · 4 comments
Open
1 task done
Labels
bug Something isn't working

Comments

@mces89
Copy link

mces89 commented Dec 17, 2024

Your current environment

The output of `python collect_env.py`
Your output of `python collect_env.py` here

Model Input Dumps

No response

🐛 Describe the bug

I'm using v100x2(16gx2) to serve the llama3.1 8B model, but i do observe some wired memory phenomenons across 0.6.3.post1 and 0.6.4.post1:

  1. the 0.6.4.post1 takes more gpu memory(90% v.s. 80%) when using the following command to launch, this is also showed in this issue: [Bug]: Increased VRAM usage since v0.6.4.post1 (vs v0.6.3.post1) [OOM][KV cache] #11230. The command i use is(with just 1 lora module):
    CMD="docker run --gpus all \\ -p ${DEFAULT_PORT}:${DEFAULT_PORT} \\ --ipc=host \\ vllm/vllm-openai:v0.6.x.post1 \\ --model ${MODEL_PATH} \\ --dtype float16 \\ --tensor-parallel_size 2 \\ --gpu-memory-utilization 0.98 \\ --max_model_len 40000 \\ --seed 23 \\ --max-seq-len-to-capture 40000 \\ --port ${DEFAULT_PORT} \\ --disable-log-requests \\ --enable-lora \\ --fully-sharded-loras \\ --max-lora-rank 16 \\ --max-loras ${NUM_LORAS} \\ --lora-modules ${LORA_MODULES}"
  2. if a send a long context prompt, like nearly 40000 as shown above, the 0.6.3.post1 can work, but 0.6.4.post1 failed with torch.OutOfMemoryError. Even though the 0.6.3.post1 showed fewer gpu kv blocks(about 3000) v.s. 0.6.4.post1(about 5000). The detailed error of 0.6.4.post1 is:
    `
    INFO 12-16 21:40:15 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20241216-214015.pkl...
    (VllmWorkerProcess pid=90) INFO 12-16 21:40:15 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20241216-214015.pkl...
    (VllmWorkerProcess pid=90) INFO 12-16 21:40:15 model_runner_base.py:149] Completed writing input of failed execution to /tmp/err_execute_model_input_20241216-214015.pkl.
    INFO 12-16 21:40:15 model_runner_base.py:149] Completed writing input of failed execution to /tmp/err_execute_model_input_20241216-214015.pkl.
    (VllmWorkerProcess pid=90) ERROR 12-16 21:40:15 multiproc_worker_utils.py:229] Exception in worker VllmWorkerProcess while processing method start_worker_execution_loop.
    (VllmWorkerProcess pid=90) ERROR 12-16 21:40:15 multiproc_worker_utils.py:229] Traceback (most recent call last):
    (VllmWorkerProcess pid=90) ERROR 12-16 21:40:15 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner_base.py", line 116, in _wrapper
    (VllmWorkerProcess pid=90) ERROR 12-16 21:40:15 multiproc_worker_utils.py:229] return func(*args, **kwargs)
    (VllmWorkerProcess pid=90) ERROR 12-16 21:40:15 multiproc_worker_utils.py:229] ^^^^^^^^^^^^^^^^^^^^^
    (VllmWorkerProcess pid=90) ERROR 12-16 21:40:15 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1654, in execute_model
    (VllmWorkerProcess pid=90) ERROR 12-16 21:40:15 multiproc_worker_utils.py:229] hidden_or_intermediate_states = model_executable(
    (VllmWorkerProcess pid=90) ERROR 12-16 21:40:15 multiproc_worker_utils.py:229] ^^^^^^^^^^^^^^^^^
    (VllmWorkerProcess pid=90) ERROR 12-16 21:40:15 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    (VllmWorkerProcess pid=90) ERROR 12-16 21:40:15 multiproc_worker_utils.py:229] return self._call_impl(*args, **kwargs)
    (VllmWorkerProcess pid=90) ERROR 12-16 21:40:15 multiproc_worker_utils.py:229] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (VllmWorkerProcess pid=90) ERROR 12-16 21:40:15 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    (VllmWorkerProcess pid=90) ERROR 12-16 21:40:15 multiproc_worker_utils.py:229] return forward_call(*args, **kwargs)
    (VllmWorkerProcess pid=90) ERROR 12-16 21:40:15 multiproc_worker_utils.py:229] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (VllmWorkerProcess pid=90) ERROR 12-16 21:40:15 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 553, in forward
    (VllmWorkerProcess pid=90) ERROR 12-16 21:40:15 multiproc_worker_utils.py:229] model_output = self.model(input_ids, positions, kv_caches,
    (VllmWorkerProcess pid=90) ERROR 12-16 21:40:15 multiproc_worker_utils.py:229] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (VllmWorkerProcess pid=90) ERROR 12-16 21:40:15 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 143, in call
    (VllmWorkerProcess pid=90) ERROR 12-16 21:40:15 multiproc_worker_utils.py:229] return self.forward(*args, **kwargs)
    (VllmWorkerProcess pid=90) ERROR 12-16 21:40:15 multiproc_worker_utils.py:229] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (VllmWorkerProcess pid=90) ERROR 12-16 21:40:15 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 340, in forward
    (VllmWorkerProcess pid=90) ERROR 12-16 21:40:15 multiproc_worker_utils.py:229] hidden_states, residual = layer(positions, hidden_states,
    (VllmWorkerProcess pid=90) ERROR 12-16 21:40:15 multiproc_worker_utils.py:229] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (VllmWorkerProcess pid=90) ERROR 12-16 21:40:15 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    (VllmWorkerProcess pid=90) ERROR 12-16 21:40:15 multiproc_worker_utils.py:229] return self._call_impl(*args, **kwargs)
    (VllmWorkerProcess pid=90) ERROR 12-16 21:40:15 multiproc_worker_utils.py:229] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (VllmWorkerProcess pid=90) ERROR 12-16 21:40:15 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    (VllmWorkerProcess pid=90) ERROR 12-16 21:40:15 multiproc_worker_utils.py:229] return forward_call(*args, **kwargs)
    (VllmWorkerProcess pid=90) ERROR 12-16 21:40:15 multiproc_worker_utils.py:229] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (VllmWorkerProcess pid=90) ERROR 12-16 21:40:15 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 267, in forward
    (VllmWorkerProcess pid=90) ERROR 12-16 21:40:15 multiproc_worker_utils.py:229] hidden_states = self.mlp(hidden_states)
    (VllmWorkerProcess pid=90) ERROR 12-16 21:40:15 multiproc_worker_utils.py:229] ^^^^^^^^^^^^^^^^^^^^^^^
    (VllmWorkerProcess pid=90) ERROR 12-16 21:40:15 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    (VllmWorkerProcess pid=90) ERROR 12-16 21:40:15 multiproc_worker_utils.py:229] return self._call_impl(*args, **kwargs)
    (VllmWorkerProcess pid=90) ERROR 12-16 21:40:15 multiproc_worker_utils.py:229] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (VllmWorkerProcess pid=90) ERROR 12-16 21:40:15 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    (VllmWorkerProcess pid=90) ERROR 12-16 21:40:15 multiproc_worker_utils.py:229] return forward_call(*args, **kwargs)
    (VllmWorkerProcess pid=90) ERROR 12-16 21:40:15 multiproc_worker_utils.py:229] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (VllmWorkerProcess pid=90) ERROR 12-16 21:40:15 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 93, in forward
    (VllmWorkerProcess pid=90) ERROR 12-16 21:40:15 multiproc_worker_utils.py:229] x, _ = self.gate_up_proj(x)
    (VllmWorkerProcess pid=90) ERROR 12-16 21:40:15 multiproc_worker_utils.py:229] ^^^^^^^^^^^^^^^^^^^^
    (VllmWorkerProcess pid=90) ERROR 12-16 21:40:15 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    (VllmWorkerProcess pid=90) ERROR 12-16 21:40:15 multiproc_worker_utils.py:229] return self._call_impl(*args, **kwargs)
    (VllmWorkerProcess pid=90) ERROR 12-16 21:40:15 multiproc_worker_utils.py:229] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (VllmWorkerProcess pid=90) ERROR 12-16 21:40:15 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in call_impl
    (VllmWorkerProcess pid=90) ERROR 12-16 21:40:15 multiproc_worker_utils.py:229] return forward_call(*args, **kwargs)
    (VllmWorkerProcess pid=90) ERROR 12-16 21:40:15 multiproc_worker_utils.py:229] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (VllmWorkerProcess pid=90) ERROR 12-16 21:40:15 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.12/dist-packages/vllm/lora/layers.py", line 582, in forward
    (VllmWorkerProcess pid=90) ERROR 12-16 21:40:15 multiproc_worker_utils.py:229] output_parallel = self.apply(input
    , bias)
    (VllmWorkerProcess pid=90) ERROR 12-16 21:40:15 multiproc_worker_utils.py:229] ^^^^^^^^^^^^^^^^^^^^^^^^
    (VllmWorkerProcess pid=90) ERROR 12-16 21:40:15 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.12/dist-packages/vllm/lora/fully_sharded_layers.py", line 181, in apply
    (VllmWorkerProcess pid=90) ERROR 12-16 21:40:15 multiproc_worker_utils.py:229] return _mcp_apply(x, bias, self)
    (VllmWorkerProcess pid=90) ERROR 12-16 21:40:15 multiproc_worker_utils.py:229] ^^^^^^^^^^^^^^^^^^^^^^^^^
    (VllmWorkerProcess pid=90) ERROR 12-16 21:40:15 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.12/dist-packages/vllm/lora/fully_sharded_layers.py", line 115, in _mcp_apply
    (VllmWorkerProcess pid=90) ERROR 12-16 21:40:15 multiproc_worker_utils.py:229] output = layer.base_layer.quant_method.apply(layer.base_layer, x, bias)
    (VllmWorkerProcess pid=90) ERROR 12-16 21:40:15 multiproc_worker_utils.py:229] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (VllmWorkerProcess pid=90) ERROR 12-16 21:40:15 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/linear.py", line 135, in apply
    (VllmWorkerProcess pid=90) ERROR 12-16 21:40:15 multiproc_worker_utils.py:229] return F.linear(x, layer.weight, bias)
    (VllmWorkerProcess pid=90) ERROR 12-16 21:40:15 multiproc_worker_utils.py:229] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (VllmWorkerProcess pid=90) ERROR 12-16 21:40:15 multiproc_worker_utils.py:229] torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.02 GiB. GPU 1 has a total capacity of 15.77 GiB of which 1.01 GiB is free. Process 6432 has 14.75 GiB memory in use. Of the allocated memory 13.17 GiB is allocated by PyTorch, with 106.00 MiB allocated in private pools (e.g., CUDA Graphs), and 66.92 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
    (VllmWorkerProcess pid=90) ERROR 12-16 21:40:15 multiproc_worker_utils.py:229]
    (VllmWorkerProcess pid=90) ERROR 12-16 21:40:15 multiproc_worker_utils.py:229] The above exception was the direct cause of the following exception:
    (VllmWorkerProcess pid=90) ERROR 12-16 21:40:15 multiproc_worker_utils.py:229]
    (VllmWorkerProcess pid=90) ERROR 12-16 21:40:15 multiproc_worker_utils.py:229] Traceback (most recent call last):
    (VllmWorkerProcess pid=90) ERROR 12-16 21:40:15 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
    (VllmWorkerProcess pid=90) ERROR 12-16 21:40:15 multiproc_worker_utils.py:229] output = executor(*args, **kwargs)
    (VllmWorkerProcess pid=90) ERROR 12-16 21:40:15 multiproc_worker_utils.py:229] ^^^^^^^^^^^^^^^^^^^^^^^^^
    (VllmWorkerProcess pid=90) ERROR 12-16 21:40:15 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    (VllmWorkerProcess pid=90) ERROR 12-16 21:40:15 multiproc_worker_utils.py:229] return func(*args, **kwargs)
    (VllmWorkerProcess pid=90) ERROR 12-16 21:40:15 multiproc_worker_utils.py:229] ^^^^^^^^^^^^^^^^^^^^^
    (VllmWorkerProcess pid=90) ERROR 12-16 21:40:15 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 85, in start_worker_execution_loop
    (VllmWorkerProcess pid=90) ERROR 12-16 21:40:15 multiproc_worker_utils.py:229] output = self.execute_model(execute_model_req=None)
    (VllmWorkerProcess pid=90) ERROR 12-16 21:40:15 multiproc_worker_utils.py:229] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (VllmWorkerProcess pid=90) ERROR 12-16 21:40:15 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 343, in execute_model
    (VllmWorkerProcess pid=90) ERROR 12-16 21:40:15 multiproc_worker_utils.py:229] output = self.model_runner.execute_model(
    (VllmWorkerProcess pid=90) ERROR 12-16 21:40:15 multiproc_worker_utils.py:229] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (VllmWorkerProcess pid=90) ERROR 12-16 21:40:15 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    (VllmWorkerProcess pid=90) ERROR 12-16 21:40:15 multiproc_worker_utils.py:229] return func(*args, **kwargs)
    (VllmWorkerProcess pid=90) ERROR 12-16 21:40:15 multiproc_worker_utils.py:229] ^^^^^^^^^^^^^^^^^^^^^
    (VllmWorkerProcess pid=90) ERROR 12-16 21:40:15 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner_base.py", line 152, in _wrapper
    (VllmWorkerProcess pid=90) ERROR 12-16 21:40:15 multiproc_worker_utils.py:229] raise type(err)(
    (VllmWorkerProcess pid=90) ERROR 12-16 21:40:15 multiproc_worker_utils.py:229] torch.OutOfMemoryError: Error in model execution (input dumped to /tmp/err_execute_model_input_20241216-214015.pkl): CUDA out of memory. Tried to allocate 1.02 GiB. GPU 1 has a total capacity of 15.77 GiB of which 1.01 GiB is free. Process 6432 has 14.75 GiB memory in use. Of the allocated memory 13.17 GiB is allocated by PyTorch, with 106.00 MiB allocated in private pools (e.g., CUDA Graphs), and 66.92 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
    ERROR 12-16 21:40:15 engine.py:135] OutOfMemoryError('Error in model execution (input dumped to /tmp/err_execute_model_input_20241216-214015.pkl): CUDA out of memory. Tried to allocate 1.02 GiB. GPU 0 has a total capacity of 15.77 GiB of which 1.01 GiB is free. Process 6351 has 14.75 GiB memory in use. Of the allocated memory 13.17 GiB is allocated by PyTorch, with 106.00 MiB allocated in private pools (e.g., CUDA Graphs), and 66.92 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)')
    ERROR 12-16 21:40:15 engine.py:135] Traceback (most recent call last):
    ERROR 12-16 21:40:15 engine.py:135] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner_base.py", line 116, in _wrapper
    ERROR 12-16 21:40:15 engine.py:135] return func(*args, **kwargs)
    ERROR 12-16 21:40:15 engine.py:135] ^^^^^^^^^^^^^^^^^^^^^
    ERROR 12-16 21:40:15 engine.py:135] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1654, in execute_model
    ERROR 12-16 21:40:15 engine.py:135] hidden_or_intermediate_states = model_executable(
    ERROR 12-16 21:40:15 engine.py:135] ^^^^^^^^^^^^^^^^^
    ERROR 12-16 21:40:15 engine.py:135] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    ERROR 12-16 21:40:15 engine.py:135] return self._call_impl(*args, **kwargs)
    ERROR 12-16 21:40:15 engine.py:135] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ERROR 12-16 21:40:15 engine.py:135] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    ERROR 12-16 21:40:15 engine.py:135] return forward_call(*args, **kwargs)
    ERROR 12-16 21:40:15 engine.py:135] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ERROR 12-16 21:40:15 engine.py:135] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 553, in forward
    ERROR 12-16 21:40:15 engine.py:135] model_output = self.model(input_ids, positions, kv_caches,
    ERROR 12-16 21:40:15 engine.py:135] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ERROR 12-16 21:40:15 engine.py:135] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 143, in call
    ERROR 12-16 21:40:15 engine.py:135] return self.forward(*args, **kwargs)
    ERROR 12-16 21:40:15 engine.py:135] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ERROR 12-16 21:40:15 engine.py:135] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 340, in forward
    ERROR 12-16 21:40:15 engine.py:135] hidden_states, residual = layer(positions, hidden_states,
    ERROR 12-16 21:40:15 engine.py:135] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ERROR 12-16 21:40:15 engine.py:135] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    ERROR 12-16 21:40:15 engine.py:135] return self._call_impl(*args, **kwargs)
    ERROR 12-16 21:40:15 engine.py:135] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ERROR 12-16 21:40:15 engine.py:135] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    ERROR 12-16 21:40:15 engine.py:135] return forward_call(*args, **kwargs)
    ERROR 12-16 21:40:15 engine.py:135] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ERROR 12-16 21:40:15 engine.py:135] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 267, in forward
    ERROR 12-16 21:40:15 engine.py:135] hidden_states = self.mlp(hidden_states)
    ERROR 12-16 21:40:15 engine.py:135] ^^^^^^^^^^^^^^^^^^^^^^^
    ERROR 12-16 21:40:15 engine.py:135] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    ERROR 12-16 21:40:15 engine.py:135] return self._call_impl(*args, **kwargs)
    ERROR 12-16 21:40:15 engine.py:135] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ERROR 12-16 21:40:15 engine.py:135] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    ERROR 12-16 21:40:15 engine.py:135] return forward_call(*args, **kwargs)
    ERROR 12-16 21:40:15 engine.py:135] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ERROR 12-16 21:40:15 engine.py:135] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 93, in forward
    ERROR 12-16 21:40:15 engine.py:135] x, _ = self.gate_up_proj(x)
    ERROR 12-16 21:40:15 engine.py:135] ^^^^^^^^^^^^^^^^^^^^
    ERROR 12-16 21:40:15 engine.py:135] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    ERROR 12-16 21:40:15 engine.py:135] return self._call_impl(*args, **kwargs)
    ERROR 12-16 21:40:15 engine.py:135] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ERROR 12-16 21:40:15 engine.py:135] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in call_impl
    ERROR 12-16 21:40:15 engine.py:135] return forward_call(*args, **kwargs)
    ERROR 12-16 21:40:15 engine.py:135] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ERROR 12-16 21:40:15 engine.py:135] File "/usr/local/lib/python3.12/dist-packages/vllm/lora/layers.py", line 582, in forward
    ERROR 12-16 21:40:15 engine.py:135] output_parallel = self.apply(input
    , bias)
    ERROR 12-16 21:40:15 engine.py:135] ^^^^^^^^^^^^^^^^^^^^^^^^
    ERROR 12-16 21:40:15 engine.py:135] File "/usr/local/lib/python3.12/dist-packages/vllm/lora/fully_sharded_layers.py", line 181, in apply
    ERROR 12-16 21:40:15 engine.py:135] return _mcp_apply(x, bias, self)
    ERROR 12-16 21:40:15 engine.py:135] ^^^^^^^^^^^^^^^^^^^^^^^^^
    ERROR 12-16 21:40:15 engine.py:135] File "/usr/local/lib/python3.12/dist-packages/vllm/lora/fully_sharded_layers.py", line 115, in _mcp_apply
    ERROR 12-16 21:40:15 engine.py:135] output = layer.base_layer.quant_method.apply(layer.base_layer, x, bias)
    ERROR 12-16 21:40:15 engine.py:135] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ERROR 12-16 21:40:15 engine.py:135] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/linear.py", line 135, in apply
    ERROR 12-16 21:40:15 engine.py:135] return F.linear(x, layer.weight, bias)
    ERROR 12-16 21:40:15 engine.py:135] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ERROR 12-16 21:40:15 engine.py:135] torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.02 GiB. GPU 0 has a total capacity of 15.77 GiB of which 1.01 GiB is free. Process 6351 has 14.75 GiB memory in use. Of the allocated memory 13.17 GiB is allocated by PyTorch, with 106.00 MiB allocated in private pools (e.g., CUDA Graphs), and 66.92 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
    ERROR 12-16 21:40:15 engine.py:135]
    ERROR 12-16 21:40:15 engine.py:135] The above exception was the direct cause of the following exception:
    ERROR 12-16 21:40:15 engine.py:135]
    ERROR 12-16 21:40:15 engine.py:135] Traceback (most recent call last):
    ERROR 12-16 21:40:15 engine.py:135] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 133, in start
    ERROR 12-16 21:40:15 engine.py:135] self.run_engine_loop()
    ERROR 12-16 21:40:15 engine.py:135] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 196, in run_engine_loop
    ERROR 12-16 21:40:15 engine.py:135] request_outputs = self.engine_step()
    ERROR 12-16 21:40:15 engine.py:135] ^^^^^^^^^^^^^^^^^^
    ERROR 12-16 21:40:15 engine.py:135] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 214, in engine_step
    ERROR 12-16 21:40:15 engine.py:135] raise e
    ERROR 12-16 21:40:15 engine.py:135] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 205, in engine_step
    ERROR 12-16 21:40:15 engine.py:135] return self.engine.step()
    ERROR 12-16 21:40:15 engine.py:135] ^^^^^^^^^^^^^^^^^^
    ERROR 12-16 21:40:15 engine.py:135] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 1454, in step
    ERROR 12-16 21:40:15 engine.py:135] outputs = self.model_executor.execute_model(
    ERROR 12-16 21:40:15 engine.py:135] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ERROR 12-16 21:40:15 engine.py:135] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/distributed_gpu_executor.py", line 82, in execute_model
    ERROR 12-16 21:40:15 engine.py:135] driver_outputs = self._driver_execute_model(execute_model_req)
    ERROR 12-16 21:40:15 engine.py:135] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ERROR 12-16 21:40:15 engine.py:135] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 158, in _driver_execute_model
    ERROR 12-16 21:40:15 engine.py:135] return self.driver_worker.execute_model(execute_model_req)
    ERROR 12-16 21:40:15 engine.py:135] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ERROR 12-16 21:40:15 engine.py:135] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 343, in execute_model
    ERROR 12-16 21:40:15 engine.py:135] output = self.model_runner.execute_model(
    ERROR 12-16 21:40:15 engine.py:135] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ERROR 12-16 21:40:15 engine.py:135] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    ERROR 12-16 21:40:15 engine.py:135] return func(*args, **kwargs)
    ERROR 12-16 21:40:15 engine.py:135] ^^^^^^^^^^^^^^^^^^^^^
    ERROR 12-16 21:40:15 engine.py:135] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner_base.py", line 152, in _wrapper
    ERROR 12-16 21:40:15 engine.py:135] raise type(err)(
    ERROR 12-16 21:40:15 engine.py:135] torch.OutOfMemoryError: Error in model execution (input dumped to /tmp/err_execute_model_input_20241216-214015.pkl): CUDA out of memory. Tried to allocate 1.02 GiB. GPU 0 has a total capacity of 15.77 GiB of which 1.01 GiB is free. Process 6351 has 14.75 GiB memory in use. Of the allocated memory 13.17 GiB is allocated by PyTorch, with 106.00 MiB allocated in private pools (e.g., CUDA Graphs), and 66.92 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
    ERROR: Exception in ASGI application
  • Exception Group Traceback (most recent call last):
    | File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/base.py", line 188, in call
    | await response(scope, wrapped_receive, send)
    | File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/base.py", line 222, in call
    | async for chunk in self.body_iterator:
    | File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/base.py", line 179, in body_stream
    | raise app_exc
    | File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/base.py", line 149, in coro
    | await self.app(scope, receive_or_disconnect, send_no_error)
    | File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/cors.py", line 85, in call
    | await self.app(scope, receive, send)
    | File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/exceptions.py", line 62, in call
    | await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
    | File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    | raise exc
    | File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 42, in wrapped_app
    | await app(scope, receive, sender)
    | File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 715, in call
    | await self.middleware_stack(scope, receive, send)
    | File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 735, in app
    | await route.handle(scope, receive, send)
    | File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 288, in handle
    | await self.app(scope, receive, send)
    | File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 76, in app
    | await wrap_app_handling_exceptions(app, request)(scope, receive, send)
    | File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    | raise exc
    | File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 42, in wrapped_app
    | await app(scope, receive, sender)
    | File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 74, in app
    | await response(scope, receive, send)
    | File "/usr/local/lib/python3.12/dist-packages/starlette/responses.py", line 252, in call
    | async with anyio.create_task_group() as task_group:
    | ^^^^^^^^^^^^^^^^^^^^^^^^^
    | File "/usr/local/lib/python3.12/dist-packages/anyio/_backends/_asyncio.py", line 763, in aexit
    | raise BaseExceptionGroup(
    | ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
    +-+---------------- 1 ----------------
    | Traceback (most recent call last):
    | File "/usr/local/lib/python3.12/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 401, in run_asgi
    | result = await app( # type: ignore[func-returns-value]
    | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    | File "/usr/local/lib/python3.12/dist-packages/uvicorn/middleware/proxy_headers.py", line 60, in call
    | return await self.app(scope, receive, send)
    | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    | File "/usr/local/lib/python3.12/dist-packages/fastapi/applications.py", line 1054, in call
    | await super().call(scope, receive, send)
    | File "/usr/local/lib/python3.12/dist-packages/starlette/applications.py", line 113, in call
    | await self.middleware_stack(scope, receive, send)
    | File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/errors.py", line 187, in call
    | raise exc
    | File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/errors.py", line 165, in call
    | await self.app(scope, receive, _send)
    | File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/base.py", line 185, in call
    | with collapse_excgroups():
    | ^^^^^^^^^^^^^^^^^^^^
    | File "/usr/lib/python3.12/contextlib.py", line 158, in exit
    | self.gen.throw(value)
    | File "/usr/local/lib/python3.12/dist-packages/starlette/_utils.py", line 82, in collapse_excgroups
    | raise exc
    | File "/usr/local/lib/python3.12/dist-packages/starlette/responses.py", line 255, in wrap
    | await func()
    | File "/usr/local/lib/python3.12/dist-packages/starlette/responses.py", line 244, in stream_response
    | async for chunk in self.body_iterator:
    | File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 313, in chat_completion_stream_generator
    | async for res in result_generator:
    | File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 402, in iterate_with_cancellation
    | item = await awaits[0]
    | ^^^^^^^^^^^^^^^
    | File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py", line 633, in _process_request
    | raise request_output
    | vllm.engine.multiprocessing.MQEngineDeadError: Engine loop is not running. Inspect the stacktrace to find the original error: OutOfMemoryError('Error in model execution (input dumped to /tmp/err_execute_model_input_20241216-214015.pkl): CUDA out of memory. Tried to allocate 1.02 GiB. GPU 0 has a total capacity of 15.77 GiB of which 1.01 GiB is free. Process 6351 has 14.75 GiB memory in use. Of the allocated memory 13.17 GiB is allocated by PyTorch, with 106.00 MiB allocated in private pools (e.g., CUDA Graphs), and 66.92 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)').
    +------------------------------------

During handling of the above exception, another exception occurred:

  • Exception Group Traceback (most recent call last):
    | File "/usr/local/lib/python3.12/dist-packages/starlette/_utils.py", line 76, in collapse_excgroups
    | yield
    | File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/base.py", line 186, in call
    | async with anyio.create_task_group() as task_group:
    | ^^^^^^^^^^^^^^^^^^^^^^^^^
    | File "/usr/local/lib/python3.12/dist-packages/anyio/_backends/_asyncio.py", line 763, in aexit
    | raise BaseExceptionGroup(
    | ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
    +-+---------------- 1 ----------------
    | Traceback (most recent call last):
    | File "/usr/local/lib/python3.12/dist-packages/starlette/responses.py", line 259, in call
    | await wrap(partial(self.listen_for_disconnect, receive))
    | File "/usr/local/lib/python3.12/dist-packages/starlette/responses.py", line 255, in wrap
    | await func()
    | File "/usr/local/lib/python3.12/dist-packages/starlette/responses.py", line 232, in listen_for_disconnect
    | message = await receive()
    | ^^^^^^^^^^^^^^^
    | File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/base.py", line 118, in receive_or_disconnect
    | async with anyio.create_task_group() as task_group:
    | ^^^^^^^^^^^^^^^^^^^^^^^^^
    | File "/usr/local/lib/python3.12/dist-packages/anyio/_backends/_asyncio.py", line 767, in aexit
    | raise exc_val
    | File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/base.py", line 126, in receive_or_disconnect
    | message = await wrap(wrapped_receive)
    | ^^^^^^^^^^^^^^^^^^^^^^^^^^^
    | File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/base.py", line 121, in wrap
    | result = await func()
    | ^^^^^^^^^^^^
    | File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/base.py", line 51, in wrapped_receive
    | msg = await self.receive()
    | ^^^^^^^^^^^^^^^^^^^^
    | File "/usr/local/lib/python3.12/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 555, in receive
    | await self.message_event.wait()
    | File "/usr/lib/python3.12/asyncio/locks.py", line 212, in wait
    | await fut
    | asyncio.exceptions.CancelledError: Cancelled by cancel scope 7f5b0968adb0
    |
    | During handling of the above exception, another exception occurred:
    |
    | Exception Group Traceback (most recent call last):
    | File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/base.py", line 188, in call
    | await response(scope, wrapped_receive, send)
    | File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/base.py", line 222, in call
    | async for chunk in self.body_iterator:
    | File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/base.py", line 179, in body_stream
    | raise app_exc
    | File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/base.py", line 149, in coro
    | await self.app(scope, receive_or_disconnect, send_no_error)
    | File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/cors.py", line 85, in call
    | await self.app(scope, receive, send)
    | File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/exceptions.py", line 62, in call
    | await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
    | File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    | raise exc
    | File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 42, in wrapped_app
    | await app(scope, receive, sender)
    | File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 715, in call
    | await self.middleware_stack(scope, receive, send)
    | File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 735, in app
    | await route.handle(scope, receive, send)
    | File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 288, in handle
    | await self.app(scope, receive, send)
    | File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 76, in app
    | await wrap_app_handling_exceptions(app, request)(scope, receive, send)
    | File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    | raise exc
    | File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 42, in wrapped_app
    | await app(scope, receive, sender)
    | File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 74, in app
    | await response(scope, receive, send)
    | File "/usr/local/lib/python3.12/dist-packages/starlette/responses.py", line 252, in call
    | async with anyio.create_task_group() as task_group:
    | ^^^^^^^^^^^^^^^^^^^^^^^^^
    | File "/usr/local/lib/python3.12/dist-packages/anyio/_backends/_asyncio.py", line 763, in aexit
    | raise BaseExceptionGroup(
    | ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
    +-+---------------- 1 ----------------
    | Traceback (most recent call last):
    | File "/usr/local/lib/python3.12/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 401, in run_asgi
    | result = await app( # type: ignore[func-returns-value]
    | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    | File "/usr/local/lib/python3.12/dist-packages/uvicorn/middleware/proxy_headers.py", line 60, in call
    | return await self.app(scope, receive, send)
    | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    | File "/usr/local/lib/python3.12/dist-packages/fastapi/applications.py", line 1054, in call
    | await super().call(scope, receive, send)
    | File "/usr/local/lib/python3.12/dist-packages/starlette/applications.py", line 113, in call
    | await self.middleware_stack(scope, receive, send)
    | File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/errors.py", line 187, in call
    | raise exc
    | File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/errors.py", line 165, in call
    | await self.app(scope, receive, _send)
    | File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/base.py", line 185, in call
    | with collapse_excgroups():
    | ^^^^^^^^^^^^^^^^^^^^
    | File "/usr/lib/python3.12/contextlib.py", line 158, in exit
    | self.gen.throw(value)
    | File "/usr/local/lib/python3.12/dist-packages/starlette/_utils.py", line 82, in collapse_excgroups
    | raise exc
    | File "/usr/local/lib/python3.12/dist-packages/starlette/responses.py", line 255, in wrap
    | await func()
    | File "/usr/local/lib/python3.12/dist-packages/starlette/responses.py", line 244, in stream_response
    | async for chunk in self.body_iterator:
    | File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 313, in chat_completion_stream_generator
    | async for res in result_generator:
    | File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 402, in iterate_with_cancellation
    | item = await awaits[0]
    | ^^^^^^^^^^^^^^^
    | File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py", line 633, in _process_request
    | raise request_output
    | vllm.engine.multiprocessing.MQEngineDeadError: Engine loop is not running. Inspect the stacktrace to find the original error: OutOfMemoryError('Error in model execution (input dumped to /tmp/err_execute_model_input_20241216-214015.pkl): CUDA out of memory. Tried to allocate 1.02 GiB. GPU 0 has a total capacity of 15.77 GiB of which 1.01 GiB is free. Process 6351 has 14.75 GiB memory in use. Of the allocated memory 13.17 GiB is allocated by PyTorch, with 106.00 MiB allocated in private pools (e.g., CUDA Graphs), and 66.92 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)').
    +------------------------------------

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.12/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 401, in run_asgi
result = await app( # type: ignore[func-returns-value]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/uvicorn/middleware/proxy_headers.py", line 60, in call
return await self.app(scope, receive, send)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/fastapi/applications.py", line 1054, in call
await super().call(scope, receive, send)
File "/usr/local/lib/python3.12/dist-packages/starlette/applications.py", line 113, in call
await self.middleware_stack(scope, receive, send)
File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/errors.py", line 187, in call
raise exc
File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/errors.py", line 165, in call
await self.app(scope, receive, _send)
File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/base.py", line 185, in call
with collapse_excgroups():
^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/contextlib.py", line 158, in exit
self.gen.throw(value)
File "/usr/local/lib/python3.12/dist-packages/starlette/_utils.py", line 82, in collapse_excgroups
raise exc
File "/usr/local/lib/python3.12/dist-packages/starlette/responses.py", line 255, in wrap
await func()
File "/usr/local/lib/python3.12/dist-packages/starlette/responses.py", line 244, in stream_response
async for chunk in self.body_iterator:
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 313, in chat_completion_stream_generator
async for res in result_generator:
File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 402, in iterate_with_cancellation
item = await awaits[0]
^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py", line 633, in _process_request
raise request_output
vllm.engine.multiprocessing.MQEngineDeadError: Engine loop is not running. Inspect the stacktrace to find the original error: OutOfMemoryError('Error in model execution (input dumped to /tmp/err_execute_model_input_20241216-214015.pkl): CUDA out of memory. Tried to allocate 1.02 GiB. GPU 0 has a total capacity of 15.77 GiB of which 1.01 GiB is free. Process 6351 has 14.75 GiB memory in use. Of the allocated memory 13.17 GiB is allocated by PyTorch, with 106.00 MiB allocated in private pools (e.g., CUDA Graphs), and 66.92 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)').
ERROR 12-16 21:40:16 multiproc_worker_utils.py:116] Worker VllmWorkerProcess pid 90 died, exit code: -15
INFO 12-16 21:40:16 multiproc_worker_utils.py:120] Killing local vLLM worker processes
[rank0]:[W1216 21:40:17.017148290 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
ERROR 12-16 21:40:20 client.py:282] RuntimeError('Engine process (pid 25) died.')
ERROR 12-16 21:40:20 client.py:282] NoneType: None
^CINFO 12-16 21:42:13 launcher.py:57] Shutting down FastAPI HTTP server.
INFO: Shutting down
INFO: Waiting for application shutdown.
INFO: Application shutdown complete.
`

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@mces89 mces89 added the bug Something isn't working label Dec 17, 2024
@mces89 mces89 changed the title [Bug]: torch.cuda.torch.OutOfMemoryError for 0.6.4.post1 but 0.6.3.post1 is working [Bug]: torch.OutOfMemoryError for 0.6.4.post1 but 0.6.3.post1 is working Dec 17, 2024
@DarkLight1337
Copy link
Member

Can you try again using the latest code on main branch? #10511 might fix this.

@philtimmes
Copy link

Can confirm the change fixed memory usage here. (I am not the OP, but had the same issue)

@mces89
Copy link
Author

mces89 commented Dec 18, 2024

@philtimmes are you using the new released v0.6.5? I just tested it and still get the same error.

@DarkLight1337
Copy link
Member

@youkaichao @joerunde any idea about this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants