Qiang navi4x fp8 llama #9674

qli88 · 2024-10-24T23:20:44Z

[Misc] Add FP8 support for Llama model family on Navi4x

* adds wvSpltK optimization for skinny gemm. --------- Co-authored-by: Hashem Hashemi <[email protected]>

Fix 8K decode latency jump issue.

* add quantization_weights_path for fp8 weights * fix lint * fix lint * change to quantized_weights_path * fix lint

* Moving custom skinni gemm heuristic before hipblas or rocblas solutions. Disabling the now obsolete LLMM1 path * Simplified the decision logic * Added back one case when LLMM1 can be used. Defaulting to adding bias separately * Moved bias addition inside tgemm

* [Kernel] Enable custome AR on ROCm * Install amdsmi in Docker in preparation for custom all reduce (cherry picked from commit f6cfb9bf31e9feeefbdedecf2165f80dd0564b75) * Fix for yapf * Linting and small fixes to vLLM syntax (cherry picked from commit 2cf8103bfb0afce59b28a06c5bbe905983c42728) --------- Co-authored-by: Matthew Wong <[email protected]>

* Fix 1-hop XGMI detection * Fix numpy versioning

* adding input type * merge gradlib_fp8 to gradlib * using fp8 * fix lint * fix lint

* Wokaround for SWDEV-470361. Calling the version of setProblem that does not cause integer overflow on large gemm shapes * clang-format

This reverts commit 2a3cbf9, reversing changes made to 367aa5a.

* Enabling some basic tests for ROCm 6.2 Use strict xfail for ROCm 6.2 test repairs * Use lenient xfail instead --------- Co-authored-by: Alexei V. Ivanov <[email protected]>

….2 metrics test (#73) * Dockerfile updates: base image; preemptive uninstalls * Remove ROCm 6.2 xfails from metrics test

Let's hope float64 internal to pandas dataframe is good enough.

[Build/CI] tests for rocm/vllm:main as of 2024-06-28

* fix gradlib fp8 output * add condition check for existing tune result * fix linter * fix import order * fix lint

* Initializing hipblaslt workspace for fp8 gemms * make workspace size configurable * assign default value for worksapce pointer * fix clang-format * fix clang-format --------- Co-authored-by: Gregory Shtrasberg <[email protected]>

* update tuning script to match new api * add mi308 configs for TP=8,4,2 * nit: ruff isort and argparse fix * nit: make yapf happy * nit: yapf happy-2

* remove elementwise kernel * fix lint

* cuda graph + num-scheduler-steps bug fix * cuda graph + num-scheduler-steps bug fix * linting

* fix code path logic to load mllama model * fix lint error * fix lint error --------- Co-authored-by: tjtanaa <[email protected]>

…0_21

* prefix-enabled FA perf issue * split ENC, DEC/ENC_DEC * lint

* add option to adjust partition size * changed CPA partition size to 256 in rocm attention backend * support context length 128K with partition size 256

* Not important variation to create a dummy PR for CI testing. * Skipy & numba updates in a timely manner. * fixing environment variables * Changing the installation to use requirements-rocm.txt --------- Co-authored-by: Alexei Ivanov <[email protected]> Co-authored-by: Gregory Shtrasberg <[email protected]>

…llama3.2 (#241) * improved handling of output to be the same as before * after merge correction --------- Co-authored-by: Aleksandr Malyshev <[email protected]>

Upstream merge 24 10 21

github-actions · 2024-10-24T23:20:55Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

amd-hhashemi and others added 30 commits June 18, 2024 18:47

adds wvSpltK optimization for skinny gemm. (#54)

131b217

* adds wvSpltK optimization for skinny gemm. --------- Co-authored-by: Hashem Hashemi <[email protected]>

fix 8k issue by changing max-context/seq len to 32k

3c86a03

Merge pull request #55 from ROCm/cl/fix-8k-issue

719bf9d

Fix 8K decode latency jump issue.

Adding quantized_weights_path arg for fp8 weights (#57)

93aab3c

* add quantization_weights_path for fp8 weights * fix lint * fix lint * change to quantized_weights_path * fix lint

wvSpltK fix for 10GB+ output tensors

b02fcb2

Use uint64_t instead of unsigned long for clarity (#62)

3e9dac6

fix for oob LDS fill in wvSpltK slm version (#63)

c455e9c

fix error (#65)

17e6307

Fix numpy and XGMI 1-hop detection (#67)

3e7b0b6

* Fix 1-hop XGMI detection * Fix numpy versioning

Fix linting (#68)

3200953

Merging fp8_gemm_tuner.py to gemm_tuner.py (#66)

367aa5a

* adding input type * merge gradlib_fp8 to gradlib * using fp8 * fix lint * fix lint

Enabling some basic tests for ROCm 6.2

014a9fc

Merge branch 'main' of github.com:ROCm/vllm

2a3cbf9

Wokaround for SWDEV-470361 (#69)

616baa9

* Wokaround for SWDEV-470361. Calling the version of setProblem that does not cause integer overflow on large gemm shapes * clang-format

Revert "Merge branch 'main' of github.com:ROCm/vllm" (#72)

596d58c

This reverts commit 2a3cbf9, reversing changes made to 367aa5a.

[2/2] Using xfail instead of skip for ROCm 6.2 tests (#70)

cce6281

* Enabling some basic tests for ROCm 6.2 Use strict xfail for ROCm 6.2 test repairs * Use lenient xfail instead --------- Co-authored-by: Alexei V. Ivanov <[email protected]>

Dockerfile updates: base image, preemptive uninstalls; restore ROCm 6…

e162af9

….2 metrics test (#73) * Dockerfile updates: base image; preemptive uninstalls * Remove ROCm 6.2 xfails from metrics test

return int64 type for solidx in tuning results (#74)

1ee620e

Let's hope float64 internal to pandas dataframe is good enough.

CI tests for rocm/vllm:main as of 2024-06-28

98105d5

.

270de2d

Merge pull request #77 from ROCm/qa_rocm_vllm_tests

d6e7862

[Build/CI] tests for rocm/vllm:main as of 2024-06-28

Fix gradlib fp8 output (#76)

52df169

* fix gradlib fp8 output * add condition check for existing tune result * fix linter * fix import order * fix lint

Allocate workspace for hipblaslt fp8 gemm. (#78)

e45129d

* Initializing hipblaslt workspace for fp8 gemms * make workspace size configurable * assign default value for worksapce pointer * fix clang-format * fix clang-format --------- Co-authored-by: Gregory Shtrasberg <[email protected]>

Mixtral moe tuning for mi308 (#80)

ec9e784

* update tuning script to match new api * add mi308 configs for TP=8,4,2 * nit: ruff isort and argparse fix * nit: make yapf happy * nit: yapf happy-2

remove elementwise kernel

15d6f77

fix lint

1584c3b

Remove elementwise kernel before each fp8 gemm (#81)

c3e8349

* remove elementwise kernel * fix lint

Fix the Parameter flag

9635554

dhonnappa-amd and others added 17 commits October 14, 2024 10:39

Update Buildkite env variable (#232)

35e2c54

cuda graph + num-scheduler-steps bug fix (#236)

82cfa5a

* cuda graph + num-scheduler-steps bug fix * cuda graph + num-scheduler-steps bug fix * linting

[Model] [BUG] Fix code path logic to load mllama model (#234)

1658370

* fix code path logic to load mllama model * fix lint error * fix lint error --------- Co-authored-by: tjtanaa <[email protected]>

Merge remote-tracking branch 'origin/main' into upstream_merge_24_10_21

6e79dcf

Merge remote-tracking branch 'upstream/main' into upstream_merge_24_1…

b10dad1

…0_21

yapf

634d9b0

prefix-enabled FA perf issue (#239)

e0b6bb4

* prefix-enabled FA perf issue * split ENC, DEC/ENC_DEC * lint

Merge branch 'main' into upstream_merge_24_10_21

af76c9d

Custom PA Partition size 256 to improve performance (#238)

1eefd1e

* add option to adjust partition size * changed CPA partition size to 256 in rocm attention backend * support context length 128K with partition size 256

Merge branch 'main' into upstream_merge_24_10_21

a594c0c

Merge branch 'main' into upstream_merge_24_10_21

87e3970

[BUGFIX] Restored handling of ROCM FA output as before adaptation of …

69d5e1d

…llama3.2 (#241) * improved handling of output to be the same as before * after merge correction --------- Co-authored-by: Aleksandr Malyshev <[email protected]>

Merge branch 'main' into upstream_merge_24_10_21

be448fb

Merge pull request #240 from ROCm/upstream_merge_24_10_21

2a3f461

Upstream merge 24 10 21

Using the correct datatype on prefix prefill for fp8 kv cache (#242)

46aa3d2

Add fp8 support for Llama model family on Navi4x

fe6f613

qli88 requested review from tlrmchlsmth, WoosukKwon, zhuohan123, youkaichao, alexm-neuralmagic, comaniac and njhill as code owners October 24, 2024 23:20

qli88 closed this Oct 24, 2024

qli88 deleted the qiang-navi4x-fp8-llama branch October 25, 2024 00:00

qli88 restored the qiang-navi4x-fp8-llama branch October 25, 2024 00:01

qli88 deleted the qiang-navi4x-fp8-llama branch October 25, 2024 00:09

githebs mentioned this pull request Nov 1, 2024

[Bugfix] Hermes tool parser fails to check for & handle None values in some cases #9908

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qiang navi4x fp8 llama #9674

Qiang navi4x fp8 llama #9674

qli88 commented Oct 24, 2024

github-actions bot commented Oct 24, 2024

Qiang navi4x fp8 llama #9674

Qiang navi4x fp8 llama #9674

Conversation

qli88 commented Oct 24, 2024

github-actions bot commented Oct 24, 2024