Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vulkan: copy iq4_nl LUT into shared memory #10409

Merged
merged 1 commit into from
Nov 20, 2024
Merged

Conversation

jeffbolznv
Copy link
Collaborator

NVIDIA hardware has performance issues with non-uniform indexing of constant loads (e.g. see https://resources.nvidia.com/en-us-nsight-developer-tools-mc/en-us-nsight-developer-tools/ldc-divergence). The iq4_nl lookup table was suffering from this. This change copies the LUT into shared memory.

before:
  MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                 2556 runs -   503.27 us/run - 117.44 MFLOP/run - 233.35 GFLOPS
  MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                112 runs -  9026.09 us/run -  60.13 GFLOP/run -   6.66 TFLOPS

| llama 8B IQ4_NL - 4.5 bpw      |   4.35 GiB |     8.03 B | Vulkan     | 100 |         tg128 |         17.59  0.01 |

after:
  MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                 9372 runs -   111.01 us/run - 117.44 MFLOP/run -   1.06 TFLOPS
  MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                164 runs -  6123.31 us/run -  60.13 GFLOP/run -   9.82 TFLOPS

| llama 8B IQ4_NL - 4.5 bpw      |   4.35 GiB |     8.03 B | Vulkan     | 100 |         tg128 |         59.50  1.28 |

I'm guessing this is probably about neutral on AMD/Intel, but I'm not sure.

@jeffbolznv jeffbolznv requested a review from 0cc4m November 19, 2024 15:05
@0cc4m
Copy link
Collaborator

0cc4m commented Nov 20, 2024

It seems AMD and Intel suffer from the same issue. I wasn't aware of LDC divergence at all. Do you know why CUDA and ROCm don't run into this issue? Do they move the static LUT into shared memory automatically?


Radeon Pro VII

Before:

  MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                 2556 runs -   492.23 us/run - 117.44 MFLOP/run - 238.59 GFLOPS
  MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                 56 runs - 18246.21 us/run -  60.13 GFLOP/run -   3.30 TFLOPS

| llama 8B IQ4_NL - 4.5 bpw      |   4.35 GiB |     8.03 B | Vulkan     |  99 |         pp512 |        245.65 ± 0.77 |
| llama 8B IQ4_NL - 4.5 bpw      |   4.35 GiB |     8.03 B | Vulkan     |  99 |         tg128 |         14.32 ± 0.02 |

After:

  MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                 9372 runs -   109.94 us/run - 117.44 MFLOP/run -   1.07 TFLOPS
  MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                 72 runs - 14139.26 us/run -  60.13 GFLOP/run -   4.25 TFLOPS

| llama 8B IQ4_NL - 4.5 bpw      |   4.35 GiB |     8.03 B | Vulkan     |  99 |         pp512 |        308.47 ± 1.75 |
| llama 8B IQ4_NL - 4.5 bpw      |   4.35 GiB |     8.03 B | Vulkan     |  99 |         tg128 |         50.21 ± 0.07 |

Intel Arc A770

Before:

  MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                 3408 runs -   335.60 us/run - 117.44 MFLOP/run - 349.94 GFLOPS
  MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                 26 runs - 41285.58 us/run -  60.13 GFLOP/run -   1.46 TFLOPS

| llama 8B IQ4_NL - 4.5 bpw      |   4.35 GiB |     8.03 B | Vulkan     |  99 |         pp512 |        115.42 ± 0.04 |
| llama 8B IQ4_NL - 4.5 bpw      |   4.35 GiB |     8.03 B | Vulkan     |  99 |         tg128 |         20.66 ± 0.02 |

After:

  MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                 6816 runs -   159.13 us/run - 117.44 MFLOP/run - 738.03 GFLOPS
  MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                 42 runs - 24920.62 us/run -  60.13 GFLOP/run -   2.41 TFLOPS

| llama 8B IQ4_NL - 4.5 bpw      |   4.35 GiB |     8.03 B | Vulkan     |  99 |         pp512 |        184.50 ± 0.10 |
| llama 8B IQ4_NL - 4.5 bpw      |   4.35 GiB |     8.03 B | Vulkan     |  99 |         tg128 |         32.64 ± 0.03 |

NVIDIA RTX 3090

Before:

  MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                 3408 runs -   362.48 us/run - 117.44 MFLOP/run - 323.99 GFLOPS
  MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                190 runs -  5276.25 us/run -  60.13 GFLOP/run -  11.40 TFLOPS

| llama 8B IQ4_NL - 4.5 bpw      |   4.35 GiB |     8.03 B | Vulkan     |  99 |         pp512 |        778.45 ± 1.97 |
| llama 8B IQ4_NL - 4.5 bpw      |   4.35 GiB |     8.03 B | Vulkan     |  99 |         tg128 |         22.61 ± 0.01 |

After:

  MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                17892 runs -    58.70 us/run - 117.44 MFLOP/run -   2.00 TFLOPS
  MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                202 runs -  4959.23 us/run -  60.13 GFLOP/run -  12.12 TFLOPS

| llama 8B IQ4_NL - 4.5 bpw      |   4.35 GiB |     8.03 B | Vulkan     |  99 |         pp512 |        813.52 ± 2.06 |
| llama 8B IQ4_NL - 4.5 bpw      |   4.35 GiB |     8.03 B | Vulkan     |  99 |         tg128 |         89.34 ± 0.13 |

@0cc4m 0cc4m merged commit 8fd4b7f into ggerganov:master Nov 20, 2024
53 of 54 checks passed
@jeffbolznv
Copy link
Collaborator Author

Interesting, I didn't know AMD/Intel had similar issues.

Do you know why CUDA and ROCm don't run into this issue? Do they move the static LUT into shared memory automatically?

I'm not sure. I see the LUT uses static constexpr __device__ and not __constant__, so maybe it was just getting written to some device memory that stays in the cache and this has been good enough?

@slaren
Copy link
Collaborator

slaren commented Nov 20, 2024

I'm not sure. I see the LUT uses static constexpr __device__ and not __constant__, so maybe it was just getting written to some device memory that stays in the cache and this has been good enough?

That's probably the case. There is some context here: #4773 (comment)

@0cc4m
Copy link
Collaborator

0cc4m commented Nov 20, 2024

Ah yeah, interesting how these kinds of details show up across APIs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants