-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Under the main branch, stress testing the in-flight Triton Server with multiple threads can result in the Triton Server getting stuck. #149
Comments
@StarrickLiu , we have been trying to reproduce the hang for a few days and we have not been able to do it so far. Have you been able to identify the update to the |
I see a crash before I see a hang when I run the press.sh script @StarrickLiu. I don't see the crash if I run with pipeline parallel enabled. Can you try to build your model with these parameters instead and see if you still see the hang? --tp_size 2 All other parameters are the same. |
Another thing you can try is adding this to your /xverse/tensorrt_llm/config.pbtxt file. After disabling trt overlap, I don't see the crashes even with the original (tp_size=8, pp_size=1) model. parameters: { |
Currently, we have conducted tests and resolved the hanging issue by modifying the configuration. In addition to disabling overlap, it is crucial to ensure strict consistency between the batch size in the Triton Server configuration and the batch size used during engine build. This ensures that the hanging issue does not occur again. Is it possible that the hang is caused by some misalignment issues in memory? |
Encountered the same problem with tag v0.6.1. It can be solved by following StarrickLiu's solution. |
NVIDIA/TensorRT-LLM#865 can reproduce the hang. can solve by #149 (comment) |
I was trying the same with
Get stuck in this point with the following gpus stats. This is how i convert hf weights:
This is how I build the engine:
These are the parameters that I've changed in tensorrt_llm/config.pbtxt. The rest of the config is the same as here.
Am I missing something? Thanks in advance. |
Encountered the same problem with the latest main branch even though trt_overlap is disabled and batch size is consistent. Server hangs with tp_size=2 engine while load testing for several hours |
@kaiyux @Shixiaowei02 @byshiue can you help us? |
I have same error on the latest main branch even though trt_overlap is disabled. |
Hi @MrD005 @ARomoH @jellysnack related to #390 |
Thanks @lkm2835, |
Facing a similar issue #577 |
As indicated by the title, on the main branch, I used 40 threads to simultaneously send inference requests to the in-flight Triton Server, resulting in the Triton Server getting stuck.
The specific behavior is as follows: the GPUs utilization in nvidia-smi stays at 100%, and the power consumption ranges from 80W to 95W. During this time, none of the threads receive responses. The situation is illustrated in the image below:
After testing, when sending requests with a single thread, there is no such issue. However, when using a larger number of threads, the problem occurs after approximately one minute. If I switch back to the release/0.5.0 branch, even under continuous stress testing with 40 threads for 16 hours, the Triton Server remains healthy.
I am happy to provide more information if needed.
This is how I build the engine:
The text was updated successfully, but these errors were encountered: