Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streaming openai api support #43

Merged
merged 10 commits into from
Aug 9, 2023
Merged

Conversation

yunhaoli24
Copy link
Contributor

Description

Use transformers TextIteratorStreamer to support streaming response for the OpenAI API.

Ref https://huggingface.co/docs/transformers/internal/generation_utils

Related Issue

None

@airaria
Copy link
Contributor

airaria commented Aug 3, 2023

@lealaxy Can you add the usages of the new feature to the documentation scripts/openai_server_demo/README.md?

@yunhaoli24
Copy link
Contributor Author

@lealaxy Can you add the usages of the new feature to the documentation scripts/openai_server_demo/README.md?

Hello, I have added the docs and modified the format of the chat/completions API request body.

Now, by running openai_api_server.py, you can use chinese-llama-alpaca-2 as the backend for any frontend application based on ChatGPT.

@airaria
Copy link
Contributor

airaria commented Aug 3, 2023

When I test the api with curl:

curl http://localhost:19327/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user","content": "给我讲一些有关杭州的故事吧"}
    ],
    "repetition_penalty": 1.0, "stream":true
  }'

an error occurs:

Traceback (most recent call last):
  File "/Users/yangziqing/opt/anaconda3/lib/python3.9/site-packages/uvicorn/protocols/http/h11_impl.py", line 429, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/Users/yangziqing/opt/anaconda3/lib/python3.9/site-packages/uvicorn/middleware/proxy_headers.py", line 78, in __call__
    return await self.app(scope, receive, send)
  File "/Users/yangziqing/opt/anaconda3/lib/python3.9/site-packages/fastapi/applications.py", line 289, in __call__
    await super().__call__(scope, receive, send)
  File "/Users/yangziqing/opt/anaconda3/lib/python3.9/site-packages/starlette/applications.py", line 122, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/Users/yangziqing/opt/anaconda3/lib/python3.9/site-packages/starlette/middleware/errors.py", line 184, in __call__
    raise exc
  File "/Users/yangziqing/opt/anaconda3/lib/python3.9/site-packages/starlette/middleware/errors.py", line 162, in __call__
    await self.app(scope, receive, _send)
  File "/Users/yangziqing/opt/anaconda3/lib/python3.9/site-packages/starlette/middleware/cors.py", line 83, in __call__
    await self.app(scope, receive, send)
  File "/Users/yangziqing/opt/anaconda3/lib/python3.9/site-packages/starlette/middleware/exceptions.py", line 79, in __call__
    raise exc
  File "/Users/yangziqing/opt/anaconda3/lib/python3.9/site-packages/starlette/middleware/exceptions.py", line 68, in __call__
    await self.app(scope, receive, sender)
  File "/Users/yangziqing/opt/anaconda3/lib/python3.9/site-packages/fastapi/middleware/asyncexitstack.py", line 20, in __call__
    raise e
  File "/Users/yangziqing/opt/anaconda3/lib/python3.9/site-packages/fastapi/middleware/asyncexitstack.py", line 17, in __call__
    await self.app(scope, receive, send)
  File "/Users/yangziqing/opt/anaconda3/lib/python3.9/site-packages/starlette/routing.py", line 718, in __call__
    await route.handle(scope, receive, send)
  File "/Users/yangziqing/opt/anaconda3/lib/python3.9/site-packages/starlette/routing.py", line 276, in handle
    await self.app(scope, receive, send)
  File "/Users/yangziqing/opt/anaconda3/lib/python3.9/site-packages/starlette/routing.py", line 69, in app
    await response(scope, receive, send)
  File "/Users/yangziqing/opt/anaconda3/lib/python3.9/site-packages/sse_starlette/sse.py", line 251, in __call__
    await wrap(partial(self.listen_for_disconnect, receive))
  File "/Users/yangziqing/opt/anaconda3/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 574, in __aexit__
    raise exceptions[0]
  File "/Users/yangziqing/opt/anaconda3/lib/python3.9/site-packages/sse_starlette/sse.py", line 240, in wrap
    await func()
  File "/Users/yangziqing/opt/anaconda3/lib/python3.9/site-packages/sse_starlette/sse.py", line 225, in stream_response
    async for data in self.body_iterator:
  File "/Users/yangziqing/opt/anaconda3/lib/python3.9/site-packages/starlette/concurrency.py", line 63, in iterate_in_threadpool
    yield await anyio.to_thread.run_sync(_next, iterator)
  File "/Users/yangziqing/opt/anaconda3/lib/python3.9/site-packages/anyio/to_thread.py", line 28, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(func, *args, cancellable=cancellable,
  File "/Users/yangziqing/opt/anaconda3/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 818, in run_sync_in_worker_thread
    return await future
  File "/Users/yangziqing/opt/anaconda3/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 754, in run
    result = context.run(func, *args)
  File "/Users/yangziqing/opt/anaconda3/lib/python3.9/site-packages/starlette/concurrency.py", line 53, in _next
    return next(iterator)
  File "/Users/yangziqing/Documents/projects/llama/PR/Chinese-LLaMA-Alpaca-2/scripts/openai_server_demo/openai_api_server.py", line 228, in stream_predict
    yield "{}".format(chunk.json(exclude_unset=True, ensure_ascii=False))
  File "/Users/yangziqing/opt/anaconda3/lib/python3.9/site-packages/typing_extensions.py", line 2509, in wrapper
    return __arg(*args, **kwargs)
  File "/Users/yangziqing/opt/anaconda3/lib/python3.9/site-packages/pydantic/main.py", line 945, in json
    raise TypeError('`dumps_kwargs` keyword arguments are no longer supported.')

@lealaxy Do you have any idea? Could it be related to the package version?

my env:
python 3.9.16
pydantic 2.1.1
fastapi 0.100.1
uvicorn 0.21.1
sse-starlette 1.6.1
starlette 0.27.0

@yunhaoli24
Copy link
Contributor Author

Yes, it is because your are using pydantic>2.0.0, while in my environment, I have pydantic 1.10.9 installed.

Additionally, deepspeed currently requires pydantic<2.0.0. To maintain consistency between the inference and training environments, I suggest install pydantic<2.0.0.

my env:
Python 3.10.11
pydantic 1.10.9
fastapi 0.100.1
uvicorn 0.22.0
sse-starlette 1.6.1
starlette 0.27.0
torch 2.0.1
deepspeed 0.10.0

@airaria
Copy link
Contributor

airaria commented Aug 3, 2023

After installing pydantic==1.10.9 (other packages remain the same)
run the server (on MacOS)

python openai_api_server.py --base_model ./chinese-alpaca-2-7b --only_cpu

test without streaming

curl http://localhost:19327/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user","content": "中国的首都在哪里?"}
    ],
    "repetition_penalty": 1.0
  }'

output(copied from terminal):

{"id":"chatcmpl-N4hHBcHfx7WjwcEsQaNG8k","object":"chat.completion","created":1691067769,"model":"chinese-llama-alpaca-2","choices":[{"index":0,"message":{"role":"user","content":"中国的首都是哪里?"}},{"index":1,"message":{"role":"assistant","content":"中国的首都是北京。"}}]}

test with streaming

curl http://localhost:19327/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user","content": "中国的首都在哪里?"}
    ],
    "repetition_penalty": 1.0,
    "stream":true
  }'

output (copied from terminal, ping info is omitted):

data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"role": "assistant"}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": "的"}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": "首都"}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": "是"}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": "北京"}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": ""}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": ""}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": "京。"}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {}, "finish_reason": "stop"}]}
data: [DONE]

It looks like the output is incomplete and messed up with some tokens.
Did I use it in the right way?

@yunhaoli24
Copy link
Contributor Author

Sorry. This was due to an error during the split generate process. I have fixed the bug now.

@airaria
Copy link
Contributor

airaria commented Aug 4, 2023

There are still some extra tokens...
input:

curl http://localhost:19327/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user","content": "中国的首都是哪里"},
      {"role": "assistant","content": "北京。"},
      {"role": "user","content": "法国的呢"}
    ],
    "repetition_penalty": 1.0,
    "stream": true
  }'

output:

data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"role": "assistant"}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": " "}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": "法国"}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": "的"}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": "首都"}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": "是"}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": "巴黎"}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": ""}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": ""}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": "黎。"}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {}, "finish_reason": "stop"}]}
data: [DONE]

@yunhaoli24
Copy link
Contributor Author

Fixed. I think that "" should not be counted as an extra token.

Input:

curl http://localhost:19327/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user","content": "中国的首都是哪里"},
      {"role": "assistant","content": "北京。"},
      {"role": "user","content": "法国的呢"}
    ],
    "repetition_penalty": 1.0,
    "stream": true
  }'

Output:

data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"role": "assistant"}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": ""}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": "法国"}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": "的"}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": "首都"}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": "是"}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": "巴黎"}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": ""}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": ""}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": "黎。"}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {}, "finish_reason": "stop"}]}
data: [DONE]

Input2

curl http://localhost:19327/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user","content": "中国的首都在哪里?"}
    ],
    "repetition_penalty": 1.0,
    "stream":true
  }'

Output2:

data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"role": "assistant"}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": "中国"}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": "的"}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": "首都"}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": "是"}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": "北京"}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": ""}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": ""}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": "京。"}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {}, "finish_reason": "stop"}]}
data: [DONE]

@airaria
Copy link
Contributor

airaria commented Aug 4, 2023

data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": "北京"}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": ""}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": ""}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": "京。"}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {}, "finish_reason": "stop"}]}

But why there is an extra("京。" (and "黎。") and repeated ""?
Can you fix this?

@yunhaoli24
Copy link
Contributor Author

The repeated "" is a normal behavior that occurs during the model's generation process. It returns an empty string, but it doesn't affect the final answer when concatenated together.

The bug causing the repetition of the last word has been fixed.

@airaria airaria requested a review from GoGoJoestar August 7, 2023 08:06
@GoGoJoestar
Copy link
Collaborator

I test the api with stream mode, but the responses were strange:

My input:

curl http://localhost:19327/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user","content": "告诉我中国的首都在哪里"}
    ],
    "repetition_penalty": 1.0,
    "stream": true
  }'

Output:

i讯飞图片_1691397437723

The previous responses all returned "", only the last one returned output, and the output loss the first token.

@yunhaoli24
Copy link
Contributor Author

Sorry. Fixed.

@GoGoJoestar
Copy link
Collaborator

GoGoJoestar commented Aug 8, 2023

It seems that not every token is immediately returned when generated. Instead, every once in a while, all tokens during this period will be returned together.

data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"role": "assistant"}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": ""}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": ""}, "finish_reason": null}]}
... # Here are many lines of the repeat responses
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": ""}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": "亚洲是一个广阔的洲,拥有许多美丽的国家和城市。以下是一些亚洲国家及其首都:\n"}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": ""}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": ""}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": "1. "}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": ""}, "finish_reason": null}]}
...
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": ""}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": "印度:新德里\n"}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": ""}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": ""}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": "2. "}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": ""}, "finish_reason": null}]}
...
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": ""}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": "巴基斯坦:伊斯兰堡\n"}, "finish_reason": null}]}
...

@yunhaoli24
Copy link
Contributor Author

This is due to the difference in GPU computing power, where different GPU have different token generation speeds.

The generated tokens are added to a buffer. I use the TextIteratorStreamer to read tokens from the buffer in a loop and return them.

In your case, it is possible that your GPU generated a large number of tokens in a short period of time, and when reading from the buffer, returned multiple tokens at once. Then, due to CUDA calls or other reasons, there was an idle period, followed by another short period of generating a large number of tokens.

However, this does not affect the usage. On my machine (with server cooling) using the A100 GPU, token generation time is consistent.

curl http://localhost:19327/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user","content": "告诉我中国的首都在哪里"}
    ],
    "repetition_penalty": 1.0,
    "stream": true
  }'

data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"role": "assistant"}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": "中国"}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": "的"}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": "首都"}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": "是"}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": "北京"}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": ""}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": ",位于"}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": "华北"}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": "地区"}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": ""}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": ""}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": "。"}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {}, "finish_reason": "stop"}]}
data: [DONE]

@GoGoJoestar
Copy link
Collaborator

I test on a P40 GPU, which has lower computer capability than A100. When generating, the phenomenon is that it return several None output ("content": "") responses, then stop and wait for a long output consist of many tokens. If I input an English instruction, it can return tokens more frequently.

My shell command:

python scripts/openai_server_demo/openai_api_server.py --base_model [model_path] --gpus 0

Input and output:

curl http://localhost:19327/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user","content": "告诉我中国的首都在哪里"}
    ],
    "repetition_penalty": 1.0,
    "stream": true
  }'

curl http://localhost:19327/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user","content": "Tell me where is the capital of China."}
    ],
    "repetition_penalty": 1.0,
    "stream": true
  }'

# Here I only paste content
"content": ""
"content": ""
"content": ""
"content": ""
"content": ""
"content": ""
"content": "中国的首都是北京。"

"content": ""
"content": "The"
"content": "capital "
"content": "of "
"content": "China "
"content": "is "
"content": ""
"content": ""
"content": ""
"content": ""
"content": "Beijing."

The count of "content": "" is the same as tokens count of tokenized following text (Exclude the first return "content": "" of each request).

1691483162814

I guess the TextIteratorStreamer will return the text when the tokens can merge into a complete word. This will lead to discontinuous text generation. However, the process of generation on your device is right. Can you check if there is something wrong with my usage?

@yunhaoli24
Copy link
Contributor Author

You are right. TextIteratorStreamer avoid printing incomplete words.

The issue you encountered occurred in versions of transformers<4.29.0 and was fixed in #22664.

In the latest version of transformers, TextIteratorStreamer has been fixed to handle Chinese text, allowing for smooth retrieval of Chinese tokens.

@GoGoJoestar
Copy link
Collaborator

That's the reason. After updating the version of transformers, everything is ok!

@ymcui ymcui merged commit 7b19c67 into ymcui:main Aug 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants