vulkan: copy iq4_nl LUT into shared memory #10409

jeffbolznv · 2024-11-19T15:05:43Z

NVIDIA hardware has performance issues with non-uniform indexing of constant loads (e.g. see https://resources.nvidia.com/en-us-nsight-developer-tools-mc/en-us-nsight-developer-tools/ldc-divergence). The iq4_nl lookup table was suffering from this. This change copies the LUT into shared memory.

before:
  MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                 2556 runs -   503.27 us/run - 117.44 MFLOP/run - 233.35 GFLOPS
  MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                112 runs -  9026.09 us/run -  60.13 GFLOP/run -   6.66 TFLOPS

| llama 8B IQ4_NL - 4.5 bpw      |   4.35 GiB |     8.03 B | Vulkan     | 100 |         tg128 |         17.59  0.01 |

after:
  MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                 9372 runs -   111.01 us/run - 117.44 MFLOP/run -   1.06 TFLOPS
  MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                164 runs -  6123.31 us/run -  60.13 GFLOP/run -   9.82 TFLOPS

| llama 8B IQ4_NL - 4.5 bpw      |   4.35 GiB |     8.03 B | Vulkan     | 100 |         tg128 |         59.50  1.28 |

I'm guessing this is probably about neutral on AMD/Intel, but I'm not sure.

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

0cc4m · 2024-11-20T07:39:00Z

It seems AMD and Intel suffer from the same issue. I wasn't aware of LDC divergence at all. Do you know why CUDA and ROCm don't run into this issue? Do they move the static LUT into shared memory automatically?

Radeon Pro VII

Before:

  MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                 2556 runs -   492.23 us/run - 117.44 MFLOP/run - 238.59 GFLOPS
  MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                 56 runs - 18246.21 us/run -  60.13 GFLOP/run -   3.30 TFLOPS

| llama 8B IQ4_NL - 4.5 bpw      |   4.35 GiB |     8.03 B | Vulkan     |  99 |         pp512 |        245.65 ± 0.77 |
| llama 8B IQ4_NL - 4.5 bpw      |   4.35 GiB |     8.03 B | Vulkan     |  99 |         tg128 |         14.32 ± 0.02 |

After:

  MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                 9372 runs -   109.94 us/run - 117.44 MFLOP/run -   1.07 TFLOPS
  MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                 72 runs - 14139.26 us/run -  60.13 GFLOP/run -   4.25 TFLOPS

| llama 8B IQ4_NL - 4.5 bpw      |   4.35 GiB |     8.03 B | Vulkan     |  99 |         pp512 |        308.47 ± 1.75 |
| llama 8B IQ4_NL - 4.5 bpw      |   4.35 GiB |     8.03 B | Vulkan     |  99 |         tg128 |         50.21 ± 0.07 |

Intel Arc A770

Before:

  MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                 3408 runs -   335.60 us/run - 117.44 MFLOP/run - 349.94 GFLOPS
  MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                 26 runs - 41285.58 us/run -  60.13 GFLOP/run -   1.46 TFLOPS

| llama 8B IQ4_NL - 4.5 bpw      |   4.35 GiB |     8.03 B | Vulkan     |  99 |         pp512 |        115.42 ± 0.04 |
| llama 8B IQ4_NL - 4.5 bpw      |   4.35 GiB |     8.03 B | Vulkan     |  99 |         tg128 |         20.66 ± 0.02 |

After:

  MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                 6816 runs -   159.13 us/run - 117.44 MFLOP/run - 738.03 GFLOPS
  MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                 42 runs - 24920.62 us/run -  60.13 GFLOP/run -   2.41 TFLOPS

| llama 8B IQ4_NL - 4.5 bpw      |   4.35 GiB |     8.03 B | Vulkan     |  99 |         pp512 |        184.50 ± 0.10 |
| llama 8B IQ4_NL - 4.5 bpw      |   4.35 GiB |     8.03 B | Vulkan     |  99 |         tg128 |         32.64 ± 0.03 |

NVIDIA RTX 3090

Before:

  MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                 3408 runs -   362.48 us/run - 117.44 MFLOP/run - 323.99 GFLOPS
  MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                190 runs -  5276.25 us/run -  60.13 GFLOP/run -  11.40 TFLOPS

| llama 8B IQ4_NL - 4.5 bpw      |   4.35 GiB |     8.03 B | Vulkan     |  99 |         pp512 |        778.45 ± 1.97 |
| llama 8B IQ4_NL - 4.5 bpw      |   4.35 GiB |     8.03 B | Vulkan     |  99 |         tg128 |         22.61 ± 0.01 |

After:

  MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                17892 runs -    58.70 us/run - 117.44 MFLOP/run -   2.00 TFLOPS
  MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                202 runs -  4959.23 us/run -  60.13 GFLOP/run -  12.12 TFLOPS

| llama 8B IQ4_NL - 4.5 bpw      |   4.35 GiB |     8.03 B | Vulkan     |  99 |         pp512 |        813.52 ± 2.06 |
| llama 8B IQ4_NL - 4.5 bpw      |   4.35 GiB |     8.03 B | Vulkan     |  99 |         tg128 |         89.34 ± 0.13 |

jeffbolznv · 2024-11-20T14:13:36Z

Interesting, I didn't know AMD/Intel had similar issues.

Do you know why CUDA and ROCm don't run into this issue? Do they move the static LUT into shared memory automatically?

I'm not sure. I see the LUT uses static constexpr __device__ and not __constant__, so maybe it was just getting written to some device memory that stays in the cache and this has been good enough?

slaren · 2024-11-20T14:19:32Z

I'm not sure. I see the LUT uses static constexpr __device__ and not __constant__, so maybe it was just getting written to some device memory that stays in the cache and this has been good enough?

That's probably the case. There is some context here: #4773 (comment)

0cc4m · 2024-11-20T14:22:54Z

Ah yeah, interesting how these kinds of details show up across APIs.

vulkan: copy iq4_nl LUT into shared memory

aa7f6a8

jeffbolznv requested a review from 0cc4m November 19, 2024 15:05

0cc4m approved these changes Nov 20, 2024

View reviewed changes

0cc4m merged commit 8fd4b7f into ggerganov:master Nov 20, 2024
53 of 54 checks passed

jeffbolznv mentioned this pull request Nov 20, 2024

vulkan: define all quant data structures in types.comp #10440

Merged

4 tasks

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Dec 20, 2024

vulkan: copy iq4_nl LUT into shared memory (ggerganov#10409)

c396183

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vulkan: copy iq4_nl LUT into shared memory #10409

vulkan: copy iq4_nl LUT into shared memory #10409

jeffbolznv commented Nov 19, 2024

0cc4m commented Nov 20, 2024

jeffbolznv commented Nov 20, 2024

slaren commented Nov 20, 2024

0cc4m commented Nov 20, 2024

vulkan: copy iq4_nl LUT into shared memory #10409

vulkan: copy iq4_nl LUT into shared memory #10409

Conversation

jeffbolznv commented Nov 19, 2024

0cc4m commented Nov 20, 2024

Radeon Pro VII

Intel Arc A770

NVIDIA RTX 3090

jeffbolznv commented Nov 20, 2024

slaren commented Nov 20, 2024

0cc4m commented Nov 20, 2024