Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add support for 4bit inference #47

Merged
merged 7 commits into from
Aug 2, 2023
Merged

add support for 4bit inference #47

merged 7 commits into from
Aug 2, 2023

Conversation

iMountTai
Copy link
Collaborator

@iMountTai iMountTai commented Aug 2, 2023

Description

This PR adds support for 4-bit inference, which can effectively reduce the VRAM. The VRAM after loading the model is as follows (in Tesla P40) :

load_in_4bit load_in_8bit fp16
4999M 7989M 13759M

Usage
Just add the --load_in_4bit to the launching command. For example:

python scripts/inference/inference_hf.py \
    --base_model path_to_merged_llama2_or_alpaca2_hf_dir \
    --with_prompt \
    --load_in_4bit \
    --interactive

Related Issue

None.

@iMountTai iMountTai requested a review from airaria August 2, 2023 07:26
@airaria
Copy link
Contributor

airaria commented Aug 2, 2023

  1. Update requirements.txt
  2. Since this is a breaking change, please make sure all the occurrences of load_in_8bit in Wiki and scripts are replaced with load_in_kbit 8
  3. Can load_in_kbit be added to openai_api_server.py and openai_api_server_vllm.py?

@iMountTai
Copy link
Collaborator Author

iMountTai commented Aug 2, 2023

@airaria

  • The requirements and wiki have been updated, and the load_in_8bit setting has been retained.
  • load_in_kbit has been added to openai_api_server.py. openai_api_server_vllm.py does not surpport load_in_kbit

@airaria
Copy link
Contributor

airaria commented Aug 2, 2023

TODO: Update wiki w.r.t. this PR. @iMountTai

@ymcui ymcui merged commit 7b3e5ab into ymcui:main Aug 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants