-
-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Performance]: 5x slower throught with openAI client/server than native one #7935
Comments
Thanks @stas00. There may be scalability issues with the client but we are also aware of significant overhead in the current openai server layer that we are actively working on addressing. It's probably also not a good idea to create multiple client instances in the same proc, I'd suggest to use a single client with asyncio or multiple threads. |
Thank you for the suggestions, Nick. I have already tried using multi-proc, it makes a very marginal improvement over multiple threads. I have been using this approach:
I checked the node I run the clients on is totally underloaded - 200+ cpu cores and 2TB of RAM. Running k6 js client on the same instance is a breeze - scales great - high gpu util. As I said the same So what would you recommend I use as a python client that I could crank up the concurrency with?
But I have no scalability issue with the openai server of yours, if I use k6 client I get high gpu util and high throughput - it's the openAI completions client that causes the problem somehow. It feels like the client is somehow blocking the server from continuing its compute, as in the server comm layer is blocking compute - shouldn't they be async? i.e. the server should continue its compute w/o waiting for the client to receive the generated tokens so far. |
I was suggesting instead of creating a client per worker, try having them all use the same client instance. I'm not sure whether the client is threadsafe so this may or may not work. An alternative would be to use the async variant of the client. |
Have you seen the example in benchmarks/benchmark_server.py |
Thank you for this suggestion, Nick. I wasn't aware of the asyncio openAI API. Should it be mentioned somewhere in the performance section? I think it's critical for vllm, since if the client is the bottleneck it's vllm's acceptance that will suffer. So I extended my benchmark to include asyncio version - it did improve the speed marginally but we are still miles away from high gpu util using it. This is a 10% improvement vs 500% needed to match other clients.
going to look into |
@robertgshaw2-neuralmagic, your suggestion to use https://github.com/vllm-project/vllm/blob/main/benchmarks/benchmark_serving.py seems to be a good one! I need to study it to see where my naive implementation is inefficient. Perhaps it'd be useful to document it as something a user could use to benchmark their use-cases? I somehow missed it as it looked dev-facing and thus started writing my own, but it looks to be totally ready to be used by end users. Also how does this benchmark linked from the top level README.md gets updated? https://buildkite.com/vllm/performance-benchmark/builds/4068 - it seems to be quite old - many versions were published since it was made - I'm not sure how to find the most recent one - I see only partial benchmark reports on that website. But good news, now that I have this tool - I can go and try your autoAWQ suggestion, Robert. |
Couple things:
RE: quantization - please try out LLM-compressor! |
That's super useful to know, Robert. Do you by chance know if this documented somewhere, in particular what is the current limit and what does it depend on? cpu-cores, ram, else? is there a plan to fix that (Issue in question?) I'm aware that this is outside of vllm. And what concurrency did you hit in your experiments that it was still ok? i.e. what is the size of the subset you mentioned.
appreciate the notes - let me experiment some more with this new tool so that I feel comfortable with getting consistent results (as I see some fluctuations which would impact the measuring of optimization features). Any suggestions to how I should run this benchmark so that I get more consistent outputs if I repeat the same benchmark multiple times in a row? Right now I have been using:
probably more prompts? Also is there a way this benchmark can tell me the upper limit of concurrent requests vllm can handle before it starts impacting per-user throughput / ttft? other than doing multiple tries and finding the concurrency where TTFT+decode througput is reasonable still?
yes, now I'm going to re-try them all. I tried them first with a naive single client and couldn't tell any difference. So now I should be good to see the actual impact. |
I found out when I was observing the vllm logs and then inspecting the openai client source code. I don't quite understand the use case for sending N requests from the same client though, other than for benchmarking. Can you elaborate more on your use case?
The key thing with quantization is that performance is a function of the request rate on the application. So, the sample you have here is effectively an offline batch use case. Personally, I like to look at TPOT and TTFT as a function of queries per second on the model. In my blog, I discuss how various quantization schemes impact performance I included details on how to replicate the benchmarks:
One more note: we have some performance overheads in the OpenAI server we are very close to resolving. I can ping you again once these are complete. |
Please feel free to post your results here once you have them, I can take a look and help you understand what is going on. |
we are seeing this as well. Reverting to 0.5.3 |
@binarycrayon This issue does not have anything to do with vLLM version - can you elaborate on what you are seeing? |
I'm running vllm with 9 loras adapters with openai server against our product. |
Thank you. Could you try running on v0.5.5 with —disable-frontend-multiprocessing in the launch script? |
Benchmarking. There are ~6 different quantization techniques theoretically supported by vllm. So how would you know which one works the best if you don't measure performance? (assuming quality is on par). It's a big thing - all these competing frameworks - so it's critical to be able to quickly measure which one delivers the best speed at given hardware, while keeping quality. I think ideally I'd have an abstraction layer where the server software can use multiple frameworks and switch between them at will, depending on the use case. For example, if we look at https://buildkite.com/vllm/performance-benchmark/builds/4068 clearly - there is no clear winner: and even if there was one - a few months later the winner is likely to be loser and vice versa. So as I flagged earlier that benchmark you shared is 2-months old and surely at least vllm has improved since then, so what's the point of showing probably no longer true information now.
Yes, please, Robert!
This is much appreciated, Robert - I will do that! |
And I found your vllm/benchmarks/backend_request_func.py Lines 222 to 298 in 61e5927
|
ok, so I will be running various setups all around a llama2-8b model w/ Baseline:
❌ AutoAWQ (worse than the baseline)
❌ BNB (much worse than the baseline)
|
And the good results: ✅ INT8 W8A8 (better than the baseline)
✅ FP8 W8A8 (on the fly / dynamic quantization) (better than the baseline)
✅ FP8 W8A8 (llmcompressor==0.1.0) (better than the baseline)
|
What QPS rate are you running at? I would suggest running the serving experiments for ~2 minutes, such that the server can come into equilibrium |
I suppose it's qps=50 since I have 50 prompts sent all at once, or do you measure qps differently? here is how I run the benchmark
The problem is that if I make the num-prompts much higher it's likely to hit a queue, no? then the measurements would be wrong for the purpose of the benchmark. That's why earlier I asked how to find the threshold at which vllm starts queueing up the requests. I think this is a crucial metric as well, since it really tells the server's capacity. Once a request is queued the TTFT is going to be bad. It sounds like the benchmark needs a new config to tell it how long to run for? A typical benchmark usually has a warmup period. Though how would one warmup here? I guess run the first X requests and don't count them towards stats? |
KV Cache quantizations experiment results:
|
Proposal to improve performance
I've been trying to write a reliable benchmark to be used with vllm, and I discovered that when I use the openAI client it can't scale. If I try to use 50 concurrent clients the gpu load goes down to 5% and the throughput is extremely slow. The more clients I add the worst things get. With a single client there is no problem.
I then used the same benchmark switching to the vllm native client/server and I'm getting a 60-70% gpu util and 5x higher throughput.
I checked that I had the same
SamplingParams
reported by the server in both cases.In parallel with those I was using https://github.com/grafana/k6 against both uses cases - with openAI entrypoints and with the native entrypoint - I can confirm that the server isn't the problem - in both cases I get high gpu util with k6 client and high throughput.
I thought that perhaps streaming was the cause but disabling it made a very small difference.
So everything points to the openAI client - I know that it's not your product but you recommend using it with the openAI entrypoint:
So perhaps you have some insights to what I'm missing? I'm just using your examples as is.
vllm==0.5.5 here
Thank you!
Your current environment (if you think it is necessary)
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: