Under the main branch, stress testing the in-flight Triton Server with multiple threads can result in the Triton Server getting stuck. #149

StarrickLiu · 2023-11-21T07:15:19Z

As indicated by the title, on the main branch, I used 40 threads to simultaneously send inference requests to the in-flight Triton Server, resulting in the Triton Server getting stuck.

The specific behavior is as follows: the GPUs utilization in nvidia-smi stays at 100%, and the power consumption ranges from 80W to 95W. During this time, none of the threads receive responses. The situation is illustrated in the image below:

After testing, when sending requests with a single thread, there is no such issue. However, when using a larger number of threads, the problem occurs after approximately one minute. If I switch back to the release/0.5.0 branch, even under continuous stress testing with 40 threads for 16 hours, the Triton Server remains healthy.

I am happy to provide more information if needed.

This is how I build the engine:

python3 build.py --model_dir=/XXXXXX/ckpt/LLaMA-7B \
                 --dtype bfloat16 \
                 --use_gpt_attention_plugin bfloat16 \
                 --use_gemm_plugin bfloat16 \
                 --output_dir /XXXXXX/LLaMA-7B/bf16/1-gpu/8k-1k \
                 --world_size 8 \
                 --tp_size 8 \
                 --max_input_len 8192 \
                 --max_output_len 1024 \
                 --max_batch_size 64 \
                 --remove_input_padding \
                 --enable_context_fmha \
                 --parallel_build \
                 --paged_kv_cache \
                 --use_inflight_batching

The text was updated successfully, but these errors were encountered:

jdemouth-nvidia · 2023-12-01T16:12:29Z

@StarrickLiu , we have been trying to reproduce the hang for a few days and we have not been able to do it so far. Have you been able to identify the update to the main branch that introduces that regression?

thorjohnsen · 2023-12-04T19:37:13Z

I see a crash before I see a hang when I run the press.sh script @StarrickLiu. I don't see the crash if I run with pipeline parallel enabled. Can you try to build your model with these parameters instead and see if you still see the hang?

--tp_size 2
--pp_size 4
--max_batch_size 32 \

All other parameters are the same.

thorjohnsen · 2023-12-04T22:19:00Z

Another thing you can try is adding this to your /xverse/tensorrt_llm/config.pbtxt file. After disabling trt overlap, I don't see the crashes even with the original (tp_size=8, pp_size=1) model.

parameters: {
key: "enable_trt_overlap"
value: {
string_value: "False"
}
}

StarrickLiu · 2023-12-07T08:01:25Z

Another thing you can try is adding this to your /xverse/tensorrt_llm/config.pbtxt file. After disabling trt overlap, I don't see the

Currently, we have conducted tests and resolved the hanging issue by modifying the configuration.

In addition to disabling overlap, it is crucial to ensure strict consistency between the batch size in the Triton Server configuration and the batch size used during engine build. This ensures that the hanging issue does not occur again.

Is it possible that the hang is caused by some misalignment issues in memory?

@jdemouth-nvidia

BasicCoder · 2024-01-12T08:20:51Z

Another thing you can try is adding this to your /xverse/tensorrt_llm/config.pbtxt file. After disabling trt overlap, I don't see the

Currently, we have conducted tests and resolved the hanging issue by modifying the configuration.

In addition to disabling overlap, it is crucial to ensure strict consistency between the batch size in the Triton Server configuration and the batch size used during engine build. This ensures that the hanging issue does not occur again.

Is it possible that the hang is caused by some misalignment issues in memory?

@jdemouth-nvidia

Encountered the same problem with tag v0.6.1. It can be solved by following StarrickLiu's solution.

Linzecong · 2024-01-12T10:04:03Z

NVIDIA/TensorRT-LLM#865 can reproduce the hang.

can solve by #149 (comment)

ARomoH · 2024-01-18T13:20:34Z

I was trying the same with gpt2 following the README example but even changing according to #149 (comment) I still get stuck when lauching triton server with world-size=2. But when I run the same with world-size=1 and tensor-paralelism=1 is working properly. This is how I lauch triton server

python3 /app/scripts/launch_triton_server.py --world_size=2 --model_repo=/tensorrtllm_backend/triton_model_repo/

Get stuck in this point with the following gpus stats.

This is how i convert hf weights:

python3 hf_gpt_convert.py -i gpt2 -o ./c-model/gpt2 --tensor-parallelism 2 --storage-type float16

This is how I build the engine:

python3 build.py --model_dir=./c-model/gpt2/2-gpu/ \
                 --world_size=2 \
                 --dtype float16 \
                 --use_inflight_batching \
                  --use_gpt_attention_plugin float16 \
                 --paged_kv_cache \
                 --use_gemm_plugin float16 \
                 --remove_input_padding \
                 --use_layernorm_plugin float16 \
                  --parallel_build \
                 --hidden_act gelu \
                 --max_batch_size 4 \
                 --output_dir=engines/fp16/2-gpu

These are the parameters that I've changed in tensorrt_llm/config.pbtxt. The rest of the config is the same as here.

name: "tensorrt_llm"
backend: "tensorrtllm"
max_batch_size: 4

model_transaction_policy {
  decoupled: false
}

....

instance_group [
  {
    count: 1
    kind: KIND_GPU
  }
]
parameters: {
  key: "max_beam_width"
  value: {
    string_value: "1"
  }
}
parameters: {
  key: "enable_trt_overlap"
  value: {
    string_value: "False"
  }
}
parameters: {
  key: "kv_cache_free_gpu_mem_fraction"
  value: {
   string_value: "0.7"
  }
}
parameters: {
  key: "FORCE_CPU_ONLY_INPUT_TENSORS"
  value: {
    string_value: "no"
  }
}
parameters: {
  key: "gpt_model_type"
  value: {
    string_value: "inflight_fused_batching"
  }
}
parameters: {
  key: "gpt_model_path"
  value: {
    string_value: "/tensorrtllm_backend/triton_model_repo/tensorrt_llm/1"
  }
}
parameters: {
  key: "batch_scheduler_policy"
  value: {
    string_value: "guaranteed_no_evict"
  }
}

Am I missing something? Thanks in advance.

jellysnack · 2024-02-23T20:46:48Z

Encountered the same problem with the latest main branch even though trt_overlap is disabled and batch size is consistent. Server hangs with tp_size=2 engine while load testing for several hours

ARomoH · 2024-02-28T10:07:54Z

@kaiyux @Shixiaowei02 @byshiue can you help us?

lkm2835 · 2024-04-04T07:15:54Z

I have same error on the latest main branch even though trt_overlap is disabled.

lkm2835 · 2024-04-19T04:58:56Z

Hi @MrD005 @ARomoH @jellysnack
Can you try use_custom_all_reduce set disable.

related to #390

njaramish · 2024-05-08T15:53:18Z

Thanks @lkm2835, use_custom_all_reduce set to disable fixed this issue for me!

dhruvmullick · 2024-08-27T22:55:25Z

Facing a similar issue #577
There's no use_custom_all_reduce build option now either, so not sure how to resolve this

This was referenced Jan 12, 2024

System hang when setting penalty NVIDIA/TensorRT-LLM#865

Closed

System hangs when I use multiple GPUs NVIDIA/TensorRT-LLM#827

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Under the main branch, stress testing the in-flight Triton Server with multiple threads can result in the Triton Server getting stuck. #149

Under the main branch, stress testing the in-flight Triton Server with multiple threads can result in the Triton Server getting stuck. #149

StarrickLiu commented Nov 21, 2023 •

edited

Loading

jdemouth-nvidia commented Dec 1, 2023

thorjohnsen commented Dec 4, 2023

thorjohnsen commented Dec 4, 2023

StarrickLiu commented Dec 7, 2023 •

edited

Loading

BasicCoder commented Jan 12, 2024

Linzecong commented Jan 12, 2024

ARomoH commented Jan 18, 2024

jellysnack commented Feb 23, 2024

ARomoH commented Feb 28, 2024

lkm2835 commented Apr 4, 2024

lkm2835 commented Apr 19, 2024

njaramish commented May 8, 2024

dhruvmullick commented Aug 27, 2024

Under the main branch, stress testing the in-flight Triton Server with multiple threads can result in the Triton Server getting stuck. #149

Under the main branch, stress testing the in-flight Triton Server with multiple threads can result in the Triton Server getting stuck. #149

Comments

StarrickLiu commented Nov 21, 2023 • edited Loading

jdemouth-nvidia commented Dec 1, 2023

thorjohnsen commented Dec 4, 2023

thorjohnsen commented Dec 4, 2023

StarrickLiu commented Dec 7, 2023 • edited Loading

BasicCoder commented Jan 12, 2024

Linzecong commented Jan 12, 2024

ARomoH commented Jan 18, 2024

jellysnack commented Feb 23, 2024

ARomoH commented Feb 28, 2024

lkm2835 commented Apr 4, 2024

lkm2835 commented Apr 19, 2024

njaramish commented May 8, 2024

dhruvmullick commented Aug 27, 2024

StarrickLiu commented Nov 21, 2023 •

edited

Loading

StarrickLiu commented Dec 7, 2023 •

edited

Loading