-
-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Qiang navi4x fp8 llama #9674
Qiang navi4x fp8 llama #9674
Conversation
* adds wvSpltK optimization for skinny gemm. --------- Co-authored-by: Hashem Hashemi <[email protected]>
Fix 8K decode latency jump issue.
* add quantization_weights_path for fp8 weights * fix lint * fix lint * change to quantized_weights_path * fix lint
* Moving custom skinni gemm heuristic before hipblas or rocblas solutions. Disabling the now obsolete LLMM1 path * Simplified the decision logic * Added back one case when LLMM1 can be used. Defaulting to adding bias separately * Moved bias addition inside tgemm
* [Kernel] Enable custome AR on ROCm * Install amdsmi in Docker in preparation for custom all reduce (cherry picked from commit f6cfb9bf31e9feeefbdedecf2165f80dd0564b75) * Fix for yapf * Linting and small fixes to vLLM syntax (cherry picked from commit 2cf8103bfb0afce59b28a06c5bbe905983c42728) --------- Co-authored-by: Matthew Wong <[email protected]>
* Fix 1-hop XGMI detection * Fix numpy versioning
* adding input type * merge gradlib_fp8 to gradlib * using fp8 * fix lint * fix lint
* Wokaround for SWDEV-470361. Calling the version of setProblem that does not cause integer overflow on large gemm shapes * clang-format
* Enabling some basic tests for ROCm 6.2 Use strict xfail for ROCm 6.2 test repairs * Use lenient xfail instead --------- Co-authored-by: Alexei V. Ivanov <[email protected]>
….2 metrics test (#73) * Dockerfile updates: base image; preemptive uninstalls * Remove ROCm 6.2 xfails from metrics test
Let's hope float64 internal to pandas dataframe is good enough.
[Build/CI] tests for rocm/vllm:main as of 2024-06-28
* fix gradlib fp8 output * add condition check for existing tune result * fix linter * fix import order * fix lint
* Initializing hipblaslt workspace for fp8 gemms * make workspace size configurable * assign default value for worksapce pointer * fix clang-format * fix clang-format --------- Co-authored-by: Gregory Shtrasberg <[email protected]>
* update tuning script to match new api * add mi308 configs for TP=8,4,2 * nit: ruff isort and argparse fix * nit: make yapf happy * nit: yapf happy-2
* remove elementwise kernel * fix lint
* cuda graph + num-scheduler-steps bug fix * cuda graph + num-scheduler-steps bug fix * linting
* fix code path logic to load mllama model * fix lint error * fix lint error --------- Co-authored-by: tjtanaa <[email protected]>
* prefix-enabled FA perf issue * split ENC, DEC/ENC_DEC * lint
* add option to adjust partition size * changed CPA partition size to 256 in rocm attention backend * support context length 128K with partition size 256
* Not important variation to create a dummy PR for CI testing. * Skipy & numba updates in a timely manner. * fixing environment variables * Changing the installation to use requirements-rocm.txt --------- Co-authored-by: Alexei Ivanov <[email protected]> Co-authored-by: Gregory Shtrasberg <[email protected]>
…llama3.2 (#241) * improved handling of output to be the same as before * after merge correction --------- Co-authored-by: Aleksandr Malyshev <[email protected]>
Upstream merge 24 10 21
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
[Misc] Add FP8 support for Llama model family on Navi4x