-
Notifications
You must be signed in to change notification settings - Fork 576
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Streaming openai api support #43
Conversation
@lealaxy Can you add the usages of the new feature to the documentation |
Hello, I have added the docs and modified the format of the Now, by running |
When I test the api with curl: curl http://localhost:19327/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user","content": "给我讲一些有关杭州的故事吧"}
],
"repetition_penalty": 1.0, "stream":true
}' an error occurs:
@lealaxy Do you have any idea? Could it be related to the package version? my env: |
Yes, it is because your are using Additionally, my env: |
After installing pydantic==1.10.9 (other packages remain the same)
test without streaming curl http://localhost:19327/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user","content": "中国的首都在哪里?"}
],
"repetition_penalty": 1.0
}' output(copied from terminal):
test with streaming curl http://localhost:19327/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user","content": "中国的首都在哪里?"}
],
"repetition_penalty": 1.0,
"stream":true
}' output (copied from terminal, ping info is omitted):
It looks like the output is incomplete and messed up with some tokens. |
Sorry. This was due to an error during the split generate process. I have fixed the bug now. |
There are still some extra tokens...
output:
|
Fixed. I think that Input:
Output:
Input2
Output2:
|
But why there is an extra( |
The repeated The bug causing the repetition of the last word has been fixed. |
I test the api with stream mode, but the responses were strange: My input: curl http://localhost:19327/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user","content": "告诉我中国的首都在哪里"}
],
"repetition_penalty": 1.0,
"stream": true
}' Output: The previous responses all returned |
Sorry. Fixed. |
It seems that not every token is immediately returned when generated. Instead, every once in a while, all tokens during this period will be returned together.
|
This is due to the difference in GPU computing power, where different GPU have different token generation speeds. The generated tokens are added to a buffer. I use the In your case, it is possible that your GPU generated a large number of tokens in a short period of time, and when reading from the buffer, returned multiple tokens at once. Then, due to CUDA calls or other reasons, there was an idle period, followed by another short period of generating a large number of tokens. However, this does not affect the usage. On my machine (with server cooling) using the A100 GPU, token generation time is consistent.
|
I test on a P40 GPU, which has lower computer capability than A100. When generating, the phenomenon is that it return several None output ( My shell command: python scripts/openai_server_demo/openai_api_server.py --base_model [model_path] --gpus 0 Input and output:
The count of I guess the |
You are right. The issue you encountered occurred in versions of In the latest version of |
That's the reason. After updating the version of |
Description
Use
transformers
TextIteratorStreamer
to support streaming response for the OpenAI API.Ref https://huggingface.co/docs/transformers/internal/generation_utils
Related Issue
None