vulkan: further optimize mul_mat_vec using larger loads #10387

jeffbolznv · 2024-11-18T15:43:06Z

There are a few things in this PR:

Use pipeline_robustness to disable bounds-checking for some pipelines (reduces integer instruction load).
Apply the same kind of optimizations I did to Q4_K in vulkan: Optimize some mat-vec mul quant shaders #10296 to Q5_K/Q6_K.
Add vec4 dequant functions that use 16b loads and use them to do 8 K values per thread per iteration in mul_mat_vec, to help reduce load on the memory system.

Some performance results on RTX 4070. Note that the directed tests tend to fit in L2 on this system so they are much less memory-limited and don't reflect the real performance gain in networks (which tend to be framebuffer-limited, and benefit less from these optimizations). Also, this "Q4_K" network I tested uses all of Q4_K/Q5_K/Q6_K and is showing some benefit from the optimizations on those shaders (i.e. it's not all from the robustness change).

before:
  MUL_MAT(type_a=f32,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   11076 runs -   484.48 us/run - 117.44 MFLOP/run - 242.40 GFLOPS
  MUL_MAT(type_a=f16,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   20448 runs -   247.18 us/run - 117.44 MFLOP/run - 475.11 GFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  65604 runs -    76.82 us/run - 117.44 MFLOP/run -   1.53 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  58788 runs -    86.02 us/run - 117.44 MFLOP/run -   1.37 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  36636 runs -   137.31 us/run - 117.44 MFLOP/run - 855.27 GFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  34080 runs -   147.13 us/run - 117.44 MFLOP/run - 798.20 GFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  36636 runs -   139.35 us/run - 117.44 MFLOP/run - 842.79 GFLOPS
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  51120 runs -    97.95 us/run - 117.44 MFLOP/run -   1.20 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  40044 runs -   126.78 us/run - 117.44 MFLOP/run - 926.32 GFLOPS
  MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  88608 runs -    56.83 us/run - 117.44 MFLOP/run -   2.07 TFLOPS
  MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  40896 runs -   123.38 us/run - 117.44 MFLOP/run - 951.87 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  40896 runs -   123.12 us/run - 117.44 MFLOP/run - 953.83 GFLOPS
  MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                11076 runs -   466.42 us/run - 117.44 MFLOP/run - 251.79 GFLOPS
  
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     | 100 |         tg128 |        105.07  0.19 |
| baichuan 13B Q4_0              |   7.44 GiB |    13.90 B | Vulkan     | 100 |         tg128 |         39.87  0.41 |
| starcoder2 7B Q4_0             |   3.76 GiB |     7.17 B | Vulkan     | 100 |         tg128 |         65.04  0.23 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     | 100 |         tg128 |         90.59  0.55 |

after:
  MUL_MAT(type_a=f32,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   10224 runs -   492.94 us/run - 117.44 MFLOP/run - 238.24 GFLOPS
  MUL_MAT(type_a=f16,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   20448 runs -   250.89 us/run - 117.44 MFLOP/run - 468.10 GFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  97980 runs -    51.10 us/run - 117.44 MFLOP/run -   2.30 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  98832 runs -    50.76 us/run - 117.44 MFLOP/run -   2.31 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  61344 runs -    81.83 us/run - 117.44 MFLOP/run -   1.44 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  51120 runs -    99.18 us/run - 117.44 MFLOP/run -   1.18 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  37488 runs -   135.20 us/run - 117.44 MFLOP/run - 868.67 GFLOPS
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  51972 runs -    96.68 us/run - 117.44 MFLOP/run -   1.21 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  44304 runs -   113.30 us/run - 117.44 MFLOP/run -   1.04 TFLOPS
  MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  94572 runs -    52.91 us/run - 117.44 MFLOP/run -   2.22 TFLOPS
  MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  48564 runs -   104.13 us/run - 117.44 MFLOP/run -   1.13 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  47712 runs -   106.42 us/run - 117.44 MFLOP/run -   1.10 TFLOPS
  MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                11076 runs -   457.97 us/run - 117.44 MFLOP/run - 256.43 GFLOPS

| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     | 100 |         tg128 |        109.65  0.33 |
| baichuan 13B Q4_0              |   7.44 GiB |    13.90 B | Vulkan     | 100 |         tg128 |         46.00  0.50 |
| starcoder2 7B Q4_0             |   3.76 GiB |     7.17 B | Vulkan     | 100 |         tg128 |         74.85  1.05 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     | 100 |         tg128 |         94.51  1.52 |

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

Add some early returns for nonexistent rows in mul_mat_vec shaders. These can only be hit when dispatching a 2D grid of workgroups. Fix the logic for the 2D grid of workgroups to round up. Enable the pipeline robustness extension if it's available, and use it to disable robustness for these pipelines. The instructions to do the bounds checking contend for the same ALU resources as the bit twiddling dequant instructions.

In Vulkan it's not possible to cast pointer types, so instead you have to declare an aliased binding for the memory with a different type. This commit adds aliases for the quant formats using 16b ints, and in a few places where the struct size is a multiple of 4 also using 32b ints. Currently only q4_k's aliases are used, but others will be used in subsequent commits.

Similar to the optimization I did in q4_k recently, this vectorizes some loads and reduces the number of bit twiddling instructions.

Add vec4 dequantization functions, and use them to do K=8 per iteration in mul_mat_vec. This uses 16b loads for the quant values and 128b loads for B which helps reduce the load on the memory system. The K_PER_ITER==2 logic is still there, just for F16/F32, and really only because they support unaligned sizes. Tweak the num_iters/unrolling logic to be simpler and catch a couple missed unrolling opportunities.

netrunnereve · 2024-11-19T03:27:10Z

Wow this is really nice. Here are my numbers on the RX 570 for this PR.

Before:

model	size	params	backend	ngl	threads	test	t/s
llama 8B Q4_0	4.33 GiB	8.03 B	Vulkan	100	8	tg128	11.61 ± 0.05
llama 8B Q8_0	7.95 GiB	8.03 B	Vulkan	100	8	tg128	6.60 ± 0.02
llama 8B Q4_K - Small	4.36 GiB	8.03 B	Vulkan	100	8	tg128	9.18 ± 0.02

After:

model	size	params	backend	ngl	threads	test	t/s
llama 8B Q4_0	4.33 GiB	8.03 B	Vulkan	100	8	tg128	16.66 ± 0.05
llama 8B Q8_0	7.95 GiB	8.03 B	Vulkan	100	8	tg128	11.67 ± 0.00
llama 8B Q4_K - Small	4.36 GiB	8.03 B	Vulkan	100	8	tg128	9.62 ± 0.01

There isn't that much change with the K-quants but Q8 now runs faster than the old Q4! 🎉

Bonus results:

model	size	params	backend	ngl	threads	test	t/s
llama 8B Q4_0 (8 rows/workgroup)	4.33 GiB	8.03 B	Vulkan	100	8	tg128	18.23 ± 0.05
llama 8B Q4_0 (8 rows + move out delta)	4.33 GiB	8.03 B	Vulkan	100	8	tg128	19.66 ± 0.02

As discussed in #10296 calculating 8 rows at a time helps with the 570. From what I've read online AMD GCN apparently likes longer running shaders. If I optimize the delta calculation using the code below I nearly get 20t/s - and that's all I can come up with for tonight.

------------ ggml/src/ggml-vulkan/vulkan-shaders/dequant_funcs.comp ------------
index 5fc1ba4a..7f3cb33a 100644
@@ -30,9 +30,9 @@ vec2 dequantize(uint ib, uint iqs, uint a_offset) {
     return (vec2(vui & 0xF, vui >> 4) - 8.0f) * d;
 }
 vec4 dequantize4(uint ib, uint iqs, uint a_offset) {
-    const float d = float(data_a_packed16[a_offset + ib].d);
+    //const float d = float(data_a_packed16[a_offset + ib].d);
     const uint vui = uint(data_a_packed16[a_offset + ib].qs[iqs/2]);
-    return (vec4(vui & 0xF, (vui >> 4) & 0xF, (vui >> 8) & 0xF, (vui >> 12) & 0xF) - 8.0f) * d;
+    return (vec4(vui & 0xF, (vui >> 4) & 0xF, (vui >> 8) & 0xF, (vui >> 12) & 0xF) - 8.0f);
 }
 #endif
 
@@ -95,10 +95,10 @@ vec2 dequantize(uint ib, uint iqs, uint a_offset) {
     return vec2(int(data_a[a_offset + ib].qs[iqs]), int(data_a[a_offset + ib].qs[iqs + 1])) * d;
 }
 vec4 dequantize4(uint ib, uint iqs, uint a_offset) {
-    const float d = float(data_a_packed16[a_offset + ib].d);
+    //const float d = float(data_a_packed16[a_offset + ib].d);
     uint32_t v0 = data_a_packed16[a_offset + ib].qs[iqs/2];
     uint32_t v1 = data_a_packed16[a_offset + ib].qs[iqs/2 + 1];
-    return vec4(int8_t(v0 & 0xFF), int8_t((v0 >> 8) & 0xFF), int8_t(v1 & 0xFF), int8_t((v1 >> 8) & 0xFF)) * d;
+    return vec4(int8_t(v0 & 0xFF), int8_t((v0 >> 8) & 0xFF), int8_t(v1 & 0xFF), int8_t((v1 >> 8) & 0xFF));
 }
 #endif
 

------------- ggml/src/ggml-vulkan/vulkan-shaders/mul_mat_vec.comp -------------
index 00807a06..6a15bce5 100644
@@ -73,16 +73,22 @@ void iter(inout FLOAT_TYPE temp[NUM_ROWS], const uint first_row, const uint num_
 #if K_PER_ITER == 8
         const vec4 v = dequantize4(ib, iqs, a_offset);
         const vec4 v2 = dequantize4(ib, iqs+(4/QUANT_R), a_offset);
+        FLOAT_TYPE rowtmp = 0;
 
         // matrix multiplication
-        temp[n] = fma(FLOAT_TYPE(v.x), b0, temp[n]);
-        temp[n] = fma(FLOAT_TYPE(v.y), b1, temp[n]);
-        temp[n] = fma(FLOAT_TYPE(v.z), b2, temp[n]);
-        temp[n] = fma(FLOAT_TYPE(v.w), b3, temp[n]);
-        temp[n] = fma(FLOAT_TYPE(v2.x), b4, temp[n]);
-        temp[n] = fma(FLOAT_TYPE(v2.y), b5, temp[n]);
-        temp[n] = fma(FLOAT_TYPE(v2.z), b6, temp[n]);
-        temp[n] = fma(FLOAT_TYPE(v2.w), b7, temp[n]);
+        rowtmp = FLOAT_TYPE(v.x) * b0;
+        rowtmp = fma(FLOAT_TYPE(v.y), b1, rowtmp);
+        rowtmp = fma(FLOAT_TYPE(v.z), b2, rowtmp);
+        rowtmp = fma(FLOAT_TYPE(v.w), b3, rowtmp);
+        rowtmp = fma(FLOAT_TYPE(v2.x), b4, rowtmp);
+        rowtmp = fma(FLOAT_TYPE(v2.y), b5, rowtmp);
+        rowtmp = fma(FLOAT_TYPE(v2.z), b6, rowtmp);
+        rowtmp = fma(FLOAT_TYPE(v2.w), b7, rowtmp);
+#if defined(DATA_A_Q4_0) || defined(DATA_A_Q8_0)
+        const float d = float(data_a_packed16[a_offset + ib].d);
+        rowtmp *= d;
+#endif
+        temp[n] += rowtmp;
 #else
         const vec2 v = dequantize(ib, iqs, a_offset);

jeffbolznv · 2024-11-19T04:42:28Z

I can split out the scale factor multiply in a followon change if there's interest. I haven't seen cases where it's a bottleneck on RTX 4070, but it couldn't hurt.

0cc4m · 2024-11-19T15:13:11Z

I'll add some model benchmarks later, but here's the test-backend-ops perf results.

Looks like a significant step forward, most effective on Nvidia (not surprising since that's what you're working on), but also significantly positive on AMD. I'm not sure why the difference between AMD Vulkan and ROCm is larger than Nvidia Vulkan and CUDA.

My Intel A770 also really liked the changes in q4_0, q4_1, q5_0 and q5_1, but not q8_0 and q6_k, not sure what's going on there.

0cc4m · 2024-11-20T07:09:19Z

Here's the results for Llama3 8B q4_0 and q4_k_s and Mistral Nemo q6_k. Looks good. The difference to CUDA/ROCm is larger than I remembered. Intel didn't like the change in some areas, so at a later point we might want to make the K_PER_ITER thing a specialization constant.

) * vulkan: Use pipeline_robustness to disable robustness in mul_mat_vec. Add some early returns for nonexistent rows in mul_mat_vec shaders. These can only be hit when dispatching a 2D grid of workgroups. Fix the logic for the 2D grid of workgroups to round up. Enable the pipeline robustness extension if it's available, and use it to disable robustness for these pipelines. The instructions to do the bounds checking contend for the same ALU resources as the bit twiddling dequant instructions. * vulkan: Add GLSL structure aliases for quant types to allow larger loads In Vulkan it's not possible to cast pointer types, so instead you have to declare an aliased binding for the memory with a different type. This commit adds aliases for the quant formats using 16b ints, and in a few places where the struct size is a multiple of 4 also using 32b ints. Currently only q4_k's aliases are used, but others will be used in subsequent commits. * vulkan: use larger loads in q5_k and q6_k shaders. Similar to the optimization I did in q4_k recently, this vectorizes some loads and reduces the number of bit twiddling instructions. * vulkan: use larger K step per iteration in mul_mat_vec. Add vec4 dequantization functions, and use them to do K=8 per iteration in mul_mat_vec. This uses 16b loads for the quant values and 128b loads for B which helps reduce the load on the memory system. The K_PER_ITER==2 logic is still there, just for F16/F32, and really only because they support unaligned sizes. Tweak the num_iters/unrolling logic to be simpler and catch a couple missed unrolling opportunities.

jeffbolznv added 4 commits November 18, 2024 09:22

vulkan: use larger loads in q5_k and q6_k shaders.

55f477b

Similar to the optimization I did in q4_k recently, this vectorizes some loads and reduces the number of bit twiddling instructions.

jeffbolznv requested a review from 0cc4m November 18, 2024 15:43

jeffbolznv mentioned this pull request Nov 18, 2024

vulkan: Add VK_NV_cooperative_matrix2 support for mul_mat and FlashAttention2 #10206

Merged

4 tasks

0cc4m approved these changes Nov 20, 2024

View reviewed changes

0cc4m merged commit 1bacb9f into ggerganov:master Nov 20, 2024
53 of 54 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vulkan: further optimize mul_mat_vec using larger loads #10387

vulkan: further optimize mul_mat_vec using larger loads #10387

jeffbolznv commented Nov 18, 2024

netrunnereve commented Nov 19, 2024

jeffbolznv commented Nov 19, 2024

0cc4m commented Nov 19, 2024

0cc4m commented Nov 20, 2024

vulkan: further optimize mul_mat_vec using larger loads #10387

vulkan: further optimize mul_mat_vec using larger loads #10387

Conversation

jeffbolznv commented Nov 18, 2024

netrunnereve commented Nov 19, 2024

jeffbolznv commented Nov 19, 2024

0cc4m commented Nov 19, 2024

0cc4m commented Nov 20, 2024