Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Doc]: AutoAWQ quantization example fails #7717

Closed
stas00 opened this issue Aug 21, 2024 · 5 comments · Fixed by #7937
Closed

[Doc]: AutoAWQ quantization example fails #7717

stas00 opened this issue Aug 21, 2024 · 5 comments · Fixed by #7937
Labels
documentation Improvements or additions to documentation

Comments

@stas00
Copy link
Contributor

stas00 commented Aug 21, 2024

📚 The doc issue

The quantization example at https://docs.vllm.ai/en/latest/quantization/auto_awq.html can't be run - it looks like AWQ is looking for safetensors files and https://huggingface.co/lmsys/vicuna-7b-v1.5/tree/main doesn't have them.

    return model_class.from_pretrained(
  File "/env/lib/conda/stas-inference/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3477, in from_pretrained
    raise EnvironmentError(
OSError: Error no file named model.safetensors found in directory /data/huggingface/hub/models--lmsys--vicuna-7b-v1.5/snapshots/3321f76e3f527bd14065daf69dad9344000a201d.

autoawq=0.2.6

Suggest a potential alternative/fix

I tried another model that has .safetensors files but then it fails with:

  File "/env/lib/conda/stas-inference/lib/python3.10/site-packages/datasets/data_files.py", line 332, in resolve_pattern
    fs, _, _ = get_fs_token_paths(pattern, storage_options=storage_options)
  File "/env/lib/conda/stas-inference/lib/python3.10/site-packages/fsspec/core.py", line 681, in get_fs_token_paths
    paths = [f for f in sorted(fs.glob(paths)) if not fs.isdir(f)]
  File "/env/lib/conda/stas-inference/lib/python3.10/site-packages/huggingface_hub/hf_file_system.py", line 417, in glob
    return super().glob(path, **kwargs)
  File "/env/lib/conda/stas-inference/lib/python3.10/site-packages/fsspec/spec.py", line 613, in glob
    pattern = glob_translate(path + ("/" if ends_with_sep else ""))
  File "/env/lib/conda/stas-inference/lib/python3.10/site-packages/fsspec/utils.py", line 732, in glob_translate
    raise ValueError(
ValueError: Invalid pattern: '**' can only be an entire path component

I see that this example has been copied from https://github.com/casper-hansen/AutoAWQ?tab=readme-ov-file#examples and it's identical and broken at the source.

edit: I think the issue is the datasets version - I'm able to run this version https://github.com/casper-hansen/AutoAWQ/blob/6f14fc7436d9a3fb5fc69299e4eb37db4ee9c891/examples/quantize.py with datasets==2.21.0

the version from https://docs.vllm.ai/en/latest/quantization/auto_awq.html still fails as explained above.

@stas00 stas00 added the documentation Improvements or additions to documentation label Aug 21, 2024
@stas00 stas00 changed the title [Doc]: AWQ example is broken [Doc]: AutoAWQ quantization example fails Aug 21, 2024
@stas00
Copy link
Contributor Author

stas00 commented Aug 21, 2024

So probably need to update the vllm example to use an example that actually works.

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = 'mistralai/Mistral-7B-Instruct-v0.2'
quant_path = 'mistral-instruct-v0.2-awq'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }

# Load model
model = AutoAWQForCausalLM.from_pretrained(
    model_path, **{"low_cpu_mem_usage": True, "use_cache": False}
)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

# Quantize
model.quantize(tokenizer, quant_config=quant_config)

# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

print(f'Model is quantized and saved at "{quant_path}"')

I have filed a PR to fix the datasets version there. casper-hansen/AutoAWQ#593
and the example casper-hansen/AutoAWQ#595

@robertgshaw2-neuralmagic
Copy link
Collaborator

Can you post a PR with the change?

@robertgshaw2-neuralmagic
Copy link
Collaborator

robertgshaw2-neuralmagic commented Aug 22, 2024

@stas00 AWQ is great BTW. However, if you have some high QPS workloads or offline workloads, I would suggest using activation quantization to get the best performance. With activation quantization, we can use the lower bit tensor cores which have 2x the FLOPs. This means we can accelerate the compute bound regime (which becomes the bottlenecks). AWQ 4 bit will still get the best possible latency for very low QPS regimes (e.g. QPS = 1) but outside of this, act quant will dominate.

Some benchmarks analyzing this result in this blog:

Here's some examples for how to make activation quantization models for vllm:

I figured this might be useful for you.

@stas00
Copy link
Contributor Author

stas00 commented Aug 28, 2024

Can you post a PR with the change?

done: #7937

I'd love to experiment with your suggestions, Robert. Do I need to use your fork for that?

But first I need to figure out how to reliably measure performance so that I could measure the impact and currently as I reported here #7935 it doesn't scale using openAI client. What benchmarks do you use to compare performance of various quantization techniques?

Thank you!

@robertgshaw2-neuralmagic
Copy link
Collaborator

Can you post a PR with the change?

done: #7937

I'd love to experiment with your suggestions, Robert. Do I need to use your fork for that?

But first I need to figure out how to reliably measure performance so that I could measure the impact and currently as I reported here #7935 it doesn't scale using openAI client. What benchmarks do you use to compare performance of various quantization techniques?

Thank you!

Nope, you do not need the fork. These methods are all supported in vLLM.

re: OpenAI performance. Nick and I are working on it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants