Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vulkan: Optimize soft_max #10301

Merged
merged 2 commits into from
Nov 19, 2024
Merged

vulkan: Optimize soft_max #10301

merged 2 commits into from
Nov 19, 2024

Conversation

jeffbolznv
Copy link
Collaborator

Large soft_max could already saturate memory, but small/medium sizes were pretty slow. The bulk of the gains for them comes from using a smaller workgroup size, and making the workgroup size match the subgroup size also makes the barriers much cheaper.

Cache some values in locals to avoid refetching/recomputing. And stamp out a few "template instantiations" so smaller cases will fully unroll.

Add a missing early return for OOB rows. This happens when there are more than 512 rows and the dispatch is 512 x H.

These sizes I benchmarked came from a stable diffusion network I was looking at a while back.

Before:
  SOFT_MAX(type=f32,ne=[4096,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 728 runs -  1472.19 us/run -   655360 kB/run -  428.62 GB/s
  SOFT_MAX(type=f32,ne=[77,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                  5448 runs -   351.30 us/run -    12320 kB/run -   33.45 GB/s
  SOFT_MAX(type=f32,ne=[1024,1024,10,1],mask=0,scale=1.000000,max_bias=0.000000):               4510 runs -   238.39 us/run -    81920 kB/run -  328.11 GB/s
  SOFT_MAX(type=f32,ne=[77,1024,10,1],mask=0,scale=1.000000,max_bias=0.000000):                10896 runs -   177.49 us/run -     6160 kB/run -   33.10 GB/s
  SOFT_MAX(type=f32,ne=[256,256,20,1],mask=0,scale=1.000000,max_bias=0.000000):                13108 runs -    92.21 us/run -    10240 kB/run -  105.92 GB/s
  SOFT_MAX(type=f32,ne=[64,64,20,1],mask=0,scale=1.000000,max_bias=0.000000):                  40955 runs -    29.06 us/run -      640 kB/run -   21.00 GB/s
  SOFT_MAX(type=f32,ne=[77,64,20,1],mask=0,scale=1.000000,max_bias=0.000000):                  40955 runs -    28.83 us/run -      770 kB/run -   25.47 GB/s
  SOFT_MAX(type=f32,ne=[4096,4096,5,1],mask=1,scale=1.000000,max_bias=0.000000):                 470 runs -  2164.47 us/run -   720896 kB/run -  320.70 GB/s
  SOFT_MAX(type=f32,ne=[77,4096,5,1],mask=1,scale=1.000000,max_bias=0.000000):                  4952 runs -   349.47 us/run -    13552 kB/run -   36.99 GB/s
  SOFT_MAX(type=f32,ne=[1024,1024,10,1],mask=1,scale=1.000000,max_bias=0.000000):               4301 runs -   247.73 us/run -    86016 kB/run -  331.54 GB/s
  SOFT_MAX(type=f32,ne=[77,1024,10,1],mask=1,scale=1.000000,max_bias=0.000000):                10376 runs -   176.68 us/run -     6468 kB/run -   34.92 GB/s
  SOFT_MAX(type=f32,ne=[256,256,20,1],mask=1,scale=1.000000,max_bias=0.000000):                12788 runs -    92.78 us/run -    10496 kB/run -  107.91 GB/s
  SOFT_MAX(type=f32,ne=[64,64,20,1],mask=1,scale=1.000000,max_bias=0.000000):                  40955 runs -    28.73 us/run -      656 kB/run -   21.78 GB/s
  SOFT_MAX(type=f32,ne=[77,64,20,1],mask=1,scale=1.000000,max_bias=0.000000):                  40955 runs -    28.92 us/run -      789 kB/run -   26.02 GB/s
  
After:
  SOFT_MAX(type=f32,ne=[4096,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 676 runs -  1558.61 us/run -   655360 kB/run -  404.85 GB/s
  SOFT_MAX(type=f32,ne=[77,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 46308 runs -    21.79 us/run -    12320 kB/run -  539.30 GB/s
  SOFT_MAX(type=f32,ne=[1024,1024,10,1],mask=0,scale=1.000000,max_bias=0.000000):               5330 runs -   198.13 us/run -    81920 kB/run -  394.80 GB/s
  SOFT_MAX(type=f32,ne=[77,1024,10,1],mask=0,scale=1.000000,max_bias=0.000000):                81720 runs -    12.47 us/run -     6160 kB/run -  471.09 GB/s
  SOFT_MAX(type=f32,ne=[256,256,20,1],mask=0,scale=1.000000,max_bias=0.000000):               108141 runs -     9.40 us/run -    10240 kB/run - 1038.94 GB/s
  SOFT_MAX(type=f32,ne=[64,64,20,1],mask=0,scale=1.000000,max_bias=0.000000):                 270303 runs -     3.71 us/run -      640 kB/run -  164.51 GB/s
  SOFT_MAX(type=f32,ne=[77,64,20,1],mask=0,scale=1.000000,max_bias=0.000000):                 278494 runs -     3.68 us/run -      770 kB/run -  199.55 GB/s
  SOFT_MAX(type=f32,ne=[4096,4096,5,1],mask=1,scale=1.000000,max_bias=0.000000):                 376 runs -  2963.52 us/run -   720896 kB/run -  234.23 GB/s
  SOFT_MAX(type=f32,ne=[77,4096,5,1],mask=1,scale=1.000000,max_bias=0.000000):                 49520 runs -    21.00 us/run -    13552 kB/run -  615.68 GB/s
  SOFT_MAX(type=f32,ne=[1024,1024,10,1],mask=1,scale=1.000000,max_bias=0.000000):               5083 runs -   204.13 us/run -    86016 kB/run -  402.34 GB/s
  SOFT_MAX(type=f32,ne=[77,1024,10,1],mask=1,scale=1.000000,max_bias=0.000000):                88196 runs -    12.03 us/run -     6468 kB/run -  513.00 GB/s
  SOFT_MAX(type=f32,ne=[256,256,20,1],mask=1,scale=1.000000,max_bias=0.000000):               102304 runs -    10.00 us/run -    10496 kB/run - 1001.46 GB/s
  SOFT_MAX(type=f32,ne=[64,64,20,1],mask=1,scale=1.000000,max_bias=0.000000):                 278494 runs -     3.61 us/run -      656 kB/run -  173.13 GB/s
  SOFT_MAX(type=f32,ne=[77,64,20,1],mask=1,scale=1.000000,max_bias=0.000000):                 270303 runs -     3.75 us/run -      789 kB/run -  200.55 GB/s
  
CUDA:
  SOFT_MAX(type=f32,ne=[4096,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 676 runs -  1491.82 us/run -   655360 kB/run -  422.98 GB/s
  SOFT_MAX(type=f32,ne=[77,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 21792 runs -    46.83 us/run -    12320 kB/run -  250.96 GB/s
  SOFT_MAX(type=f32,ne=[1024,1024,10,1],mask=0,scale=1.000000,max_bias=0.000000):               3280 runs -   333.83 us/run -    81920 kB/run -  234.31 GB/s
  SOFT_MAX(type=f32,ne=[77,1024,10,1],mask=0,scale=1.000000,max_bias=0.000000):                43584 runs -    24.65 us/run -     6160 kB/run -  238.38 GB/s
  SOFT_MAX(type=f32,ne=[256,256,20,1],mask=0,scale=1.000000,max_bias=0.000000):                42601 runs -    25.40 us/run -    10240 kB/run -  384.60 GB/s
  SOFT_MAX(type=f32,ne=[64,64,20,1],mask=0,scale=1.000000,max_bias=0.000000):                 311258 runs -     3.27 us/run -      640 kB/run -  186.94 GB/s
  SOFT_MAX(type=f32,ne=[77,64,20,1],mask=0,scale=1.000000,max_bias=0.000000):                 229348 runs -     4.44 us/run -      770 kB/run -  165.44 GB/s
  SOFT_MAX(type=f32,ne=[4096,4096,5,1],mask=1,scale=1.000000,max_bias=0.000000):                 470 runs -  2208.65 us/run -   720896 kB/run -  314.29 GB/s
  SOFT_MAX(type=f32,ne=[77,4096,5,1],mask=1,scale=1.000000,max_bias=0.000000):                 22284 runs -    48.65 us/run -    13552 kB/run -  265.73 GB/s
  SOFT_MAX(type=f32,ne=[1024,1024,10,1],mask=1,scale=1.000000,max_bias=0.000000):               3128 runs -   338.29 us/run -    86016 kB/run -  242.78 GB/s
  SOFT_MAX(type=f32,ne=[77,1024,10,1],mask=1,scale=1.000000,max_bias=0.000000):                41504 runs -    25.95 us/run -     6468 kB/run -  237.73 GB/s
  SOFT_MAX(type=f32,ne=[256,256,20,1],mask=1,scale=1.000000,max_bias=0.000000):                41561 runs -    25.66 us/run -    10496 kB/run -  390.13 GB/s
  SOFT_MAX(type=f32,ne=[64,64,20,1],mask=1,scale=1.000000,max_bias=0.000000):                 311258 runs -     3.25 us/run -      656 kB/run -  192.73 GB/s
  SOFT_MAX(type=f32,ne=[77,64,20,1],mask=1,scale=1.000000,max_bias=0.000000):                 221157 runs -     4.66 us/run -      789 kB/run -  161.51 GB/s

Large soft_max could already saturate memory, but small/medium sizes were
pretty slow. The bulk of the gains for them comes from using a smaller
workgroup size, and making the workgroup size match the subgroup size also
makes the barriers much cheaper.

Cache some values in locals to avoid refetching/recomputing. And stamp
out a few "template instantiations" so smaller cases will fully unroll.

Add a missing early return for OOB rows. This happens when there are more
than 512 rows and the dispatch is 512 x H.
@jeffbolznv jeffbolznv requested a review from 0cc4m November 15, 2024 02:37
@github-actions github-actions bot added the testing Everything test related label Nov 15, 2024
@0cc4m
Copy link
Collaborator

0cc4m commented Nov 17, 2024

I can confirm this significantly improves softmax performance for small and medium sizes, but I also see a regression for large sizes. On your GPU and on my AMD Radeon Pro VII it seems to be minimal, but in the 4096, 4096, 5, 1 tests for Intel A770 and Nvidia RTX 3090 I see a big difference. Do you have an idea what causes this?

AMD_Radeon_(TM)_Pro_VII_performance_comparison
Intel(R)_Arc(tm)A770_Graphics(DG2)_performance_comparison
NVIDIA_GeForce_RTX_3090_performance_comparison

@jeffbolznv
Copy link
Collaborator Author

The big difference between Ampere and Ada that comes to mind is the much larger L2 cache size on Ada. I'm guessing that with the larger workgroup size there are fewer rows being processed at a time and it can hit in the smaller L2 cache. I probably need to bring back the larger block size and use it when the rows are large enough. I have an RTX 3070 I can try this out on tomorrow.

@jeffbolznv
Copy link
Collaborator Author

I haven't had a chance to test on RTX 3070, but I tested a wider variety of sizes on 4070 was able to see some similar effects. I brought back the 512 workgroup size for larger rows, and added some of the missing cases to unroll, and perf is a lot more consistent now. You can see with the previous commit there were perf dips where there were missing cases to unroll, and that for large enough rows the perf fell to about 1/3 of the bandwidth limit presumably due to not fitting in the cache.

I tested with this code:

    for (uint32_t i = 1; i < 32; ++i) {
        test_cases.emplace_back(new test_soft_max(GGML_TYPE_F32, {32*i, 256, 5, 1}, false, 1.0f, 0.0f));
    }

    for (uint32_t i = 1; i < 32; ++i) {
        test_cases.emplace_back(new test_soft_max(GGML_TYPE_F32, {128*i, 1024, 5, 1}, false, 1.0f, 0.0f));
    }

    for (uint32_t i = 1; i < 20; ++i) {
        test_cases.emplace_back(new test_soft_max(GGML_TYPE_F32, {1024*i, 4096, 5, 1}, false, 1.0f, 0.0f));
    }
master:
  SOFT_MAX(type=f32,ne=[4096,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 728 runs -  1472.51 us/run -   655360 kB/run -  428.53 GB/s
  SOFT_MAX(type=f32,ne=[77,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                  5448 runs -   353.28 us/run -    12320 kB/run -   33.26 GB/s
  SOFT_MAX(type=f32,ne=[1024,1024,10,1],mask=0,scale=1.000000,max_bias=0.000000):               4510 runs -   238.41 us/run -    81920 kB/run -  328.09 GB/s
  SOFT_MAX(type=f32,ne=[77,1024,10,1],mask=0,scale=1.000000,max_bias=0.000000):                10896 runs -   177.79 us/run -     6160 kB/run -   33.04 GB/s
  SOFT_MAX(type=f32,ne=[256,256,20,1],mask=0,scale=1.000000,max_bias=0.000000):                13108 runs -    92.53 us/run -    10240 kB/run -  105.56 GB/s
  SOFT_MAX(type=f32,ne=[64,64,20,1],mask=0,scale=1.000000,max_bias=0.000000):                  40955 runs -    29.30 us/run -      640 kB/run -   20.83 GB/s
  SOFT_MAX(type=f32,ne=[77,64,20,1],mask=0,scale=1.000000,max_bias=0.000000):                  40955 runs -    28.86 us/run -      770 kB/run -   25.45 GB/s
  SOFT_MAX(type=f32,ne=[32,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                  40955 runs -    28.52 us/run -      320 kB/run -   10.70 GB/s
  SOFT_MAX(type=f32,ne=[64,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                  40955 runs -    28.67 us/run -      640 kB/run -   21.29 GB/s
  SOFT_MAX(type=f32,ne=[96,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                  40955 runs -    28.84 us/run -      960 kB/run -   31.74 GB/s
  SOFT_MAX(type=f32,ne=[128,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 40955 runs -    28.97 us/run -     1280 kB/run -   42.14 GB/s
  SOFT_MAX(type=f32,ne=[160,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 40955 runs -    29.27 us/run -     1600 kB/run -   52.13 GB/s
  SOFT_MAX(type=f32,ne=[192,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 40955 runs -    29.45 us/run -     1920 kB/run -   62.18 GB/s
  SOFT_MAX(type=f32,ne=[224,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 40955 runs -    29.59 us/run -     2240 kB/run -   72.19 GB/s
  SOFT_MAX(type=f32,ne=[256,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 40955 runs -    29.78 us/run -     2560 kB/run -   81.99 GB/s
  SOFT_MAX(type=f32,ne=[288,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 40955 runs -    30.02 us/run -     2880 kB/run -   91.49 GB/s
  SOFT_MAX(type=f32,ne=[320,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 40955 runs -    30.19 us/run -     3200 kB/run -  101.10 GB/s
  SOFT_MAX(type=f32,ne=[352,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 40955 runs -    30.36 us/run -     3520 kB/run -  110.56 GB/s
  SOFT_MAX(type=f32,ne=[384,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 32764 runs -    30.57 us/run -     3840 kB/run -  119.80 GB/s
  SOFT_MAX(type=f32,ne=[416,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 40330 runs -    30.74 us/run -     4160 kB/run -  129.06 GB/s
  SOFT_MAX(type=f32,ne=[448,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 37450 runs -    31.01 us/run -     4480 kB/run -  137.78 GB/s
  SOFT_MAX(type=f32,ne=[480,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 34955 runs -    31.21 us/run -     4800 kB/run -  146.69 GB/s
  SOFT_MAX(type=f32,ne=[512,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 32770 runs -    31.34 us/run -     5120 kB/run -  155.83 GB/s
  SOFT_MAX(type=f32,ne=[544,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 37014 runs -    32.16 us/run -     5440 kB/run -  161.33 GB/s
  SOFT_MAX(type=f32,ne=[576,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 34956 runs -    32.50 us/run -     5760 kB/run -  169.01 GB/s
  SOFT_MAX(type=f32,ne=[608,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 33114 runs -    32.69 us/run -     6080 kB/run -  177.41 GB/s
  SOFT_MAX(type=f32,ne=[640,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 31458 runs -    32.78 us/run -     6400 kB/run -  186.23 GB/s
  SOFT_MAX(type=f32,ne=[672,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 34958 runs -    33.02 us/run -     6720 kB/run -  194.11 GB/s
  SOFT_MAX(type=f32,ne=[704,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 33369 runs -    33.26 us/run -     7040 kB/run -  201.90 GB/s
  SOFT_MAX(type=f32,ne=[736,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 31920 runs -    33.40 us/run -     7360 kB/run -  210.15 GB/s
  SOFT_MAX(type=f32,ne=[768,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 30590 runs -    33.63 us/run -     7680 kB/run -  217.81 GB/s
  SOFT_MAX(type=f32,ne=[800,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 33560 runs -    33.82 us/run -     8000 kB/run -  225.59 GB/s
  SOFT_MAX(type=f32,ne=[832,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 32264 runs -    34.10 us/run -     8320 kB/run -  232.71 GB/s
  SOFT_MAX(type=f32,ne=[864,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 31072 runs -    34.14 us/run -     8640 kB/run -  241.39 GB/s
  SOFT_MAX(type=f32,ne=[896,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 29960 runs -    34.37 us/run -     8960 kB/run -  248.63 GB/s
  SOFT_MAX(type=f32,ne=[928,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 28928 runs -    34.67 us/run -     9280 kB/run -  255.30 GB/s
  SOFT_MAX(type=f32,ne=[960,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 31464 runs -    34.87 us/run -     9600 kB/run -  262.61 GB/s
  SOFT_MAX(type=f32,ne=[992,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 30447 runs -    35.01 us/run -     9920 kB/run -  270.28 GB/s
  SOFT_MAX(type=f32,ne=[128,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                13108 runs -    89.83 us/run -     5120 kB/run -   54.36 GB/s
  SOFT_MAX(type=f32,ne=[256,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                13108 runs -    92.59 us/run -    10240 kB/run -  105.49 GB/s
  SOFT_MAX(type=f32,ne=[384,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                10925 runs -    95.41 us/run -    15360 kB/run -  153.57 GB/s
  SOFT_MAX(type=f32,ne=[512,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                11473 runs -    98.35 us/run -    20480 kB/run -  198.66 GB/s
  SOFT_MAX(type=f32,ne=[640,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                10488 runs -   102.81 us/run -    25600 kB/run -  237.56 GB/s
  SOFT_MAX(type=f32,ne=[768,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 9837 runs -   105.60 us/run -    30720 kB/run -  277.57 GB/s
  SOFT_MAX(type=f32,ne=[896,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 9370 runs -   108.72 us/run -    35840 kB/run -  314.54 GB/s
  SOFT_MAX(type=f32,ne=[1024,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                8200 runs -   122.10 us/run -    40960 kB/run -  320.13 GB/s
  SOFT_MAX(type=f32,ne=[1152,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                8019 runs -   135.38 us/run -    46080 kB/run -  324.84 GB/s
  SOFT_MAX(type=f32,ne=[1280,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                7216 runs -   141.99 us/run -    51200 kB/run -  344.15 GB/s
  SOFT_MAX(type=f32,ne=[1408,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                7152 runs -   151.77 us/run -    56320 kB/run -  354.20 GB/s
  SOFT_MAX(type=f32,ne=[1536,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                6564 runs -   154.22 us/run -    61440 kB/run -  380.28 GB/s
  SOFT_MAX(type=f32,ne=[1664,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                6060 runs -   177.12 us/run -    66560 kB/run -  358.73 GB/s
  SOFT_MAX(type=f32,ne=[1792,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                5628 runs -   183.94 us/run -    71680 kB/run -  372.04 GB/s
  SOFT_MAX(type=f32,ne=[1920,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                5244 runs -   195.45 us/run -    76800 kB/run -  375.16 GB/s
  SOFT_MAX(type=f32,ne=[2048,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                5330 runs -   197.30 us/run -    81920 kB/run -  396.45 GB/s
  SOFT_MAX(type=f32,ne=[2176,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                4632 runs -   220.91 us/run -    87040 kB/run -  376.25 GB/s
  SOFT_MAX(type=f32,ne=[2304,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                4745 runs -   222.43 us/run -    92160 kB/run -  395.67 GB/s
  SOFT_MAX(type=f32,ne=[2432,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                4485 runs -   232.77 us/run -    97280 kB/run -  399.14 GB/s
  SOFT_MAX(type=f32,ne=[2560,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                4264 runs -   237.07 us/run -   102400 kB/run -  412.56 GB/s
  SOFT_MAX(type=f32,ne=[2688,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                4069 runs -   257.08 us/run -   107520 kB/run -  399.50 GB/s
  SOFT_MAX(type=f32,ne=[2816,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                3874 runs -   261.81 us/run -   112640 kB/run -  410.99 GB/s
  SOFT_MAX(type=f32,ne=[2944,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                3705 runs -   272.23 us/run -   117760 kB/run -  413.27 GB/s
  SOFT_MAX(type=f32,ne=[3072,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                3836 runs -   278.63 us/run -   122880 kB/run -  421.35 GB/s
  SOFT_MAX(type=f32,ne=[3200,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                3419 runs -   297.58 us/run -   128000 kB/run -  411.00 GB/s
  SOFT_MAX(type=f32,ne=[3328,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                3289 runs -   305.28 us/run -   133120 kB/run -  416.68 GB/s
  SOFT_MAX(type=f32,ne=[3456,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                3159 runs -   316.94 us/run -   138240 kB/run -  416.82 GB/s
  SOFT_MAX(type=f32,ne=[3584,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                3290 runs -   324.31 us/run -   143360 kB/run -  422.47 GB/s
  SOFT_MAX(type=f32,ne=[3712,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                2938 runs -   340.72 us/run -   148480 kB/run -  416.51 GB/s
  SOFT_MAX(type=f32,ne=[3840,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                3066 runs -   348.33 us/run -   153600 kB/run -  421.49 GB/s
  SOFT_MAX(type=f32,ne=[3968,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                2968 runs -   360.64 us/run -   158720 kB/run -  420.71 GB/s
  SOFT_MAX(type=f32,ne=[1024,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                2255 runs -   472.98 us/run -   163840 kB/run -  331.16 GB/s
  SOFT_MAX(type=f32,ne=[2048,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                1339 runs -   771.48 us/run -   327680 kB/run -  407.03 GB/s
  SOFT_MAX(type=f32,ne=[3072,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 966 runs -  1101.89 us/run -   491520 kB/run -  428.49 GB/s
  SOFT_MAX(type=f32,ne=[4096,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 728 runs -  1472.70 us/run -   655360 kB/run -  428.47 GB/s
  SOFT_MAX(type=f32,ne=[5120,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 574 runs -  1832.80 us/run -   819200 kB/run -  431.46 GB/s
  SOFT_MAX(type=f32,ne=[6144,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 455 runs -  2198.60 us/run -   983040 kB/run -  432.50 GB/s
  SOFT_MAX(type=f32,ne=[7168,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 390 runs -  2564.36 us/run -  1146880 kB/run -  433.63 GB/s
  SOFT_MAX(type=f32,ne=[8192,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 364 runs -  2930.23 us/run -  1310720 kB/run -  434.79 GB/s
  SOFT_MAX(type=f32,ne=[9216,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 299 runs -  3375.61 us/run -  1474560 kB/run -  425.65 GB/s
  SOFT_MAX(type=f32,ne=[10240,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                252 runs -  4080.17 us/run -  1638400 kB/run -  392.07 GB/s
  SOFT_MAX(type=f32,ne=[11264,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                247 runs -  4067.74 us/run -  1802240 kB/run -  433.65 GB/s
  SOFT_MAX(type=f32,ne=[12288,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                234 runs -  4394.57 us/run -  1966080 kB/run -  438.51 GB/s
  SOFT_MAX(type=f32,ne=[13312,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                208 runs -  5023.31 us/run -  2129920 kB/run -  417.00 GB/s
  SOFT_MAX(type=f32,ne=[14336,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                210 runs -  5123.70 us/run -  2293760 kB/run -  441.17 GB/s
  SOFT_MAX(type=f32,ne=[15360,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                182 runs -  5494.51 us/run -  2457600 kB/run -  441.80 GB/s
  SOFT_MAX(type=f32,ne=[16384,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                182 runs -  5851.55 us/run -  2621440 kB/run -  443.67 GB/s
  SOFT_MAX(type=f32,ne=[17408,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                169 runs -  6229.83 us/run -  2785280 kB/run -  442.77 GB/s
  SOFT_MAX(type=f32,ne=[18432,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                156 runs -  6600.72 us/run -  2949120 kB/run -  443.84 GB/s
  SOFT_MAX(type=f32,ne=[19456,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                154 runs -  6955.88 us/run -  3112960 kB/run -  446.20 GB/s

first commit in this PR:
  SOFT_MAX(type=f32,ne=[4096,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 676 runs -  1553.47 us/run -   655360 kB/run -  406.19 GB/s
  SOFT_MAX(type=f32,ne=[77,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 46308 runs -    21.81 us/run -    12320 kB/run -  538.71 GB/s
  SOFT_MAX(type=f32,ne=[1024,1024,10,1],mask=0,scale=1.000000,max_bias=0.000000):               5330 runs -   199.98 us/run -    81920 kB/run -  391.15 GB/s
  SOFT_MAX(type=f32,ne=[77,1024,10,1],mask=0,scale=1.000000,max_bias=0.000000):                81720 runs -    12.50 us/run -     6160 kB/run -  470.07 GB/s
  SOFT_MAX(type=f32,ne=[256,256,20,1],mask=0,scale=1.000000,max_bias=0.000000):               108141 runs -     9.49 us/run -    10240 kB/run - 1029.62 GB/s
  SOFT_MAX(type=f32,ne=[64,64,20,1],mask=0,scale=1.000000,max_bias=0.000000):                 270303 runs -     3.81 us/run -      640 kB/run -  160.38 GB/s
  SOFT_MAX(type=f32,ne=[77,64,20,1],mask=0,scale=1.000000,max_bias=0.000000):                 270303 runs -     3.72 us/run -      770 kB/run -  197.28 GB/s
  SOFT_MAX(type=f32,ne=[32,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 278494 runs -     3.66 us/run -      320 kB/run -   83.39 GB/s
  SOFT_MAX(type=f32,ne=[64,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 270303 runs -     3.75 us/run -      640 kB/run -  162.57 GB/s
  SOFT_MAX(type=f32,ne=[96,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 270303 runs -     3.75 us/run -      960 kB/run -  244.42 GB/s
  SOFT_MAX(type=f32,ne=[128,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                262112 runs -     3.93 us/run -     1280 kB/run -  310.65 GB/s
  SOFT_MAX(type=f32,ne=[160,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                237539 runs -     4.23 us/run -     1600 kB/run -  360.33 GB/s
  SOFT_MAX(type=f32,ne=[192,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                237539 runs -     4.28 us/run -     1920 kB/run -  428.05 GB/s
  SOFT_MAX(type=f32,ne=[224,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                229348 runs -     4.51 us/run -     2240 kB/run -  474.08 GB/s
  SOFT_MAX(type=f32,ne=[256,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                221157 runs -     4.59 us/run -     2560 kB/run -  532.17 GB/s
  SOFT_MAX(type=f32,ne=[288,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 90101 runs -    11.20 us/run -     2880 kB/run -  245.17 GB/s
  SOFT_MAX(type=f32,ne=[320,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 90101 runs -    11.97 us/run -     3200 kB/run -  255.04 GB/s
  SOFT_MAX(type=f32,ne=[352,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 81910 runs -    13.02 us/run -     3520 kB/run -  257.80 GB/s
  SOFT_MAX(type=f32,ne=[384,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 73719 runs -    13.71 us/run -     3840 kB/run -  267.19 GB/s
  SOFT_MAX(type=f32,ne=[416,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 72594 runs -    14.75 us/run -     4160 kB/run -  269.07 GB/s
  SOFT_MAX(type=f32,ne=[448,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 67410 runs -    15.59 us/run -     4480 kB/run -  274.14 GB/s
  SOFT_MAX(type=f32,ne=[480,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 62919 runs -    16.56 us/run -     4800 kB/run -  276.51 GB/s
  SOFT_MAX(type=f32,ne=[512,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                157296 runs -     6.47 us/run -     5120 kB/run -  754.32 GB/s
  SOFT_MAX(type=f32,ne=[544,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 61690 runs -    17.92 us/run -     5440 kB/run -  289.55 GB/s
  SOFT_MAX(type=f32,ne=[576,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 58260 runs -    18.40 us/run -     5760 kB/run -  298.57 GB/s
  SOFT_MAX(type=f32,ne=[608,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 55190 runs -    19.03 us/run -     6080 kB/run -  304.70 GB/s
  SOFT_MAX(type=f32,ne=[640,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 52430 runs -    19.35 us/run -     6400 kB/run -  315.41 GB/s
  SOFT_MAX(type=f32,ne=[672,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 54934 runs -    19.92 us/run -     6720 kB/run -  321.75 GB/s
  SOFT_MAX(type=f32,ne=[704,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 52437 runs -    20.48 us/run -     7040 kB/run -  327.93 GB/s
  SOFT_MAX(type=f32,ne=[736,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 50160 runs -    21.11 us/run -     7360 kB/run -  332.54 GB/s
  SOFT_MAX(type=f32,ne=[768,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 48070 runs -    21.60 us/run -     7680 kB/run -  339.07 GB/s
  SOFT_MAX(type=f32,ne=[800,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 46145 runs -    22.21 us/run -     8000 kB/run -  343.56 GB/s
  SOFT_MAX(type=f32,ne=[832,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 44363 runs -    22.93 us/run -     8320 kB/run -  346.13 GB/s
  SOFT_MAX(type=f32,ne=[864,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 42724 runs -    23.68 us/run -     8640 kB/run -  347.94 GB/s
  SOFT_MAX(type=f32,ne=[896,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 41195 runs -    24.34 us/run -     8960 kB/run -  351.09 GB/s
  SOFT_MAX(type=f32,ne=[928,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 43392 runs -    25.07 us/run -     9280 kB/run -  353.05 GB/s
  SOFT_MAX(type=f32,ne=[960,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 41952 runs -    25.78 us/run -     9600 kB/run -  355.23 GB/s
  SOFT_MAX(type=f32,ne=[992,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 40596 runs -    26.59 us/run -     9920 kB/run -  355.82 GB/s
  SOFT_MAX(type=f32,ne=[128,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):               124526 runs -     8.23 us/run -     5120 kB/run -  593.58 GB/s
  SOFT_MAX(type=f32,ne=[256,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):               108141 runs -     9.50 us/run -    10240 kB/run - 1027.77 GB/s
  SOFT_MAX(type=f32,ne=[384,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                30590 runs -    35.03 us/run -    15360 kB/run -  418.27 GB/s
  SOFT_MAX(type=f32,ne=[512,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                67199 runs -    15.03 us/run -    20480 kB/run - 1300.03 GB/s
  SOFT_MAX(type=f32,ne=[640,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                19665 runs -    51.44 us/run -    25600 kB/run -  474.75 GB/s
  SOFT_MAX(type=f32,ne=[768,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                17488 runs -    59.06 us/run -    30720 kB/run -  496.26 GB/s
  SOFT_MAX(type=f32,ne=[896,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                14992 runs -    67.80 us/run -    35840 kB/run -  504.42 GB/s
  SOFT_MAX(type=f32,ne=[1024,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                9840 runs -   108.17 us/run -    40960 kB/run -  361.33 GB/s
  SOFT_MAX(type=f32,ne=[1152,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                8748 runs -   123.30 us/run -    46080 kB/run -  356.67 GB/s
  SOFT_MAX(type=f32,ne=[1280,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                7872 runs -   137.52 us/run -    51200 kB/run -  355.34 GB/s
  SOFT_MAX(type=f32,ne=[1408,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                7152 runs -   150.21 us/run -    56320 kB/run -  357.87 GB/s
  SOFT_MAX(type=f32,ne=[1536,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                6564 runs -   162.54 us/run -    61440 kB/run -  360.82 GB/s
  SOFT_MAX(type=f32,ne=[1664,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                6060 runs -   175.00 us/run -    66560 kB/run -  363.09 GB/s
  SOFT_MAX(type=f32,ne=[1792,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                5628 runs -   187.00 us/run -    71680 kB/run -  365.94 GB/s
  SOFT_MAX(type=f32,ne=[1920,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                5244 runs -   201.43 us/run -    76800 kB/run -  364.04 GB/s
  SOFT_MAX(type=f32,ne=[2048,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                4920 runs -   212.24 us/run -    81920 kB/run -  368.54 GB/s
  SOFT_MAX(type=f32,ne=[2176,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                4632 runs -   222.64 us/run -    87040 kB/run -  373.32 GB/s
  SOFT_MAX(type=f32,ne=[2304,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                4380 runs -   234.01 us/run -    92160 kB/run -  376.10 GB/s
  SOFT_MAX(type=f32,ne=[2432,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                4140 runs -   248.35 us/run -    97280 kB/run -  374.10 GB/s
  SOFT_MAX(type=f32,ne=[2560,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                3936 runs -   261.68 us/run -   102400 kB/run -  373.76 GB/s
  SOFT_MAX(type=f32,ne=[2688,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                3756 runs -   278.45 us/run -   107520 kB/run -  368.84 GB/s
  SOFT_MAX(type=f32,ne=[2816,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                3576 runs -   291.87 us/run -   112640 kB/run -  368.66 GB/s
  SOFT_MAX(type=f32,ne=[2944,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                3420 runs -   306.99 us/run -   117760 kB/run -  366.46 GB/s
  SOFT_MAX(type=f32,ne=[3072,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                3288 runs -   315.66 us/run -   122880 kB/run -  371.93 GB/s
  SOFT_MAX(type=f32,ne=[3200,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                3156 runs -   331.83 us/run -   128000 kB/run -  368.57 GB/s
  SOFT_MAX(type=f32,ne=[3328,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                3036 runs -   339.35 us/run -   133120 kB/run -  374.85 GB/s
  SOFT_MAX(type=f32,ne=[3456,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                2916 runs -   354.57 us/run -   138240 kB/run -  372.59 GB/s
  SOFT_MAX(type=f32,ne=[3584,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                2820 runs -   365.14 us/run -   143360 kB/run -  375.22 GB/s
  SOFT_MAX(type=f32,ne=[3712,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                2712 runs -   381.60 us/run -   148480 kB/run -  371.89 GB/s
  SOFT_MAX(type=f32,ne=[3840,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                2628 runs -   392.71 us/run -   153600 kB/run -  373.86 GB/s
  SOFT_MAX(type=f32,ne=[3968,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                2544 runs -   413.59 us/run -   158720 kB/run -  366.84 GB/s
  SOFT_MAX(type=f32,ne=[1024,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                2665 runs -   381.96 us/run -   163840 kB/run -  410.07 GB/s
  SOFT_MAX(type=f32,ne=[2048,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                1339 runs -   764.97 us/run -   327680 kB/run -  410.49 GB/s
  SOFT_MAX(type=f32,ne=[3072,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 897 runs -  1122.21 us/run -   491520 kB/run -  420.73 GB/s
  SOFT_MAX(type=f32,ne=[4096,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 676 runs -  1531.61 us/run -   655360 kB/run -  411.99 GB/s
  SOFT_MAX(type=f32,ne=[5120,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 492 runs -  2197.07 us/run -   819200 kB/run -  359.92 GB/s
  SOFT_MAX(type=f32,ne=[6144,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 315 runs -  3413.09 us/run -   983040 kB/run -  278.60 GB/s
  SOFT_MAX(type=f32,ne=[7168,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 240 runs -  4763.44 us/run -  1146880 kB/run -  233.44 GB/s
  SOFT_MAX(type=f32,ne=[8192,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 182 runs -  6243.60 us/run -  1310720 kB/run -  204.05 GB/s
  SOFT_MAX(type=f32,ne=[9216,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 138 runs -  8293.17 us/run -  1474560 kB/run -  173.25 GB/s
  SOFT_MAX(type=f32,ne=[10240,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                105 runs - 10097.20 us/run -  1638400 kB/run -  158.43 GB/s
  SOFT_MAX(type=f32,ne=[11264,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 95 runs - 11999.02 us/run -  1802240 kB/run -  147.01 GB/s
  SOFT_MAX(type=f32,ne=[12288,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 90 runs - 13278.80 us/run -  1966080 kB/run -  145.12 GB/s
  SOFT_MAX(type=f32,ne=[13312,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 80 runs - 13390.11 us/run -  2129920 kB/run -  156.44 GB/s
  SOFT_MAX(type=f32,ne=[14336,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 75 runs - 14477.32 us/run -  2293760 kB/run -  156.14 GB/s
  SOFT_MAX(type=f32,ne=[15360,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 70 runs - 15585.04 us/run -  2457600 kB/run -  155.76 GB/s
  SOFT_MAX(type=f32,ne=[16384,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 65 runs - 16686.80 us/run -  2621440 kB/run -  155.58 GB/s
  SOFT_MAX(type=f32,ne=[17408,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 65 runs - 17906.31 us/run -  2785280 kB/run -  154.05 GB/s
  SOFT_MAX(type=f32,ne=[18432,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 60 runs - 18854.50 us/run -  2949120 kB/run -  155.38 GB/s
  SOFT_MAX(type=f32,ne=[19456,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 55 runs - 20265.47 us/run -  3112960 kB/run -  153.15 GB/s

second commit in this PR:
  SOFT_MAX(type=f32,ne=[4096,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 728 runs -  1463.61 us/run -   655360 kB/run -  431.13 GB/s
  SOFT_MAX(type=f32,ne=[77,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 46308 runs -    22.76 us/run -    12320 kB/run -  516.26 GB/s
  SOFT_MAX(type=f32,ne=[1024,1024,10,1],mask=0,scale=1.000000,max_bias=0.000000):               5330 runs -   188.12 us/run -    81920 kB/run -  415.80 GB/s
  SOFT_MAX(type=f32,ne=[77,1024,10,1],mask=0,scale=1.000000,max_bias=0.000000):                81720 runs -    12.97 us/run -     6160 kB/run -  452.85 GB/s
  SOFT_MAX(type=f32,ne=[256,256,20,1],mask=0,scale=1.000000,max_bias=0.000000):               104864 runs -     9.54 us/run -    10240 kB/run - 1023.40 GB/s
  SOFT_MAX(type=f32,ne=[64,64,20,1],mask=0,scale=1.000000,max_bias=0.000000):                 270303 runs -     3.79 us/run -      640 kB/run -  160.98 GB/s
  SOFT_MAX(type=f32,ne=[77,64,20,1],mask=0,scale=1.000000,max_bias=0.000000):                 270303 runs -     3.79 us/run -      770 kB/run -  193.87 GB/s
  SOFT_MAX(type=f32,ne=[32,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 278494 runs -     3.60 us/run -      320 kB/run -   84.87 GB/s
  SOFT_MAX(type=f32,ne=[64,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 270303 runs -     3.72 us/run -      640 kB/run -  164.18 GB/s
  SOFT_MAX(type=f32,ne=[96,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 262112 runs -     3.86 us/run -      960 kB/run -  236.99 GB/s
  SOFT_MAX(type=f32,ne=[128,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                253921 runs -     4.01 us/run -     1280 kB/run -  304.28 GB/s
  SOFT_MAX(type=f32,ne=[160,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                237539 runs -     4.26 us/run -     1600 kB/run -  358.32 GB/s
  SOFT_MAX(type=f32,ne=[192,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                245730 runs -     4.15 us/run -     1920 kB/run -  441.48 GB/s
  SOFT_MAX(type=f32,ne=[224,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                237539 runs -     4.34 us/run -     2240 kB/run -  492.43 GB/s
  SOFT_MAX(type=f32,ne=[256,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                221157 runs -     4.54 us/run -     2560 kB/run -  538.37 GB/s
  SOFT_MAX(type=f32,ne=[288,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                204775 runs -     5.01 us/run -     2880 kB/run -  548.47 GB/s
  SOFT_MAX(type=f32,ne=[320,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                196584 runs -     5.19 us/run -     3200 kB/run -  588.61 GB/s
  SOFT_MAX(type=f32,ne=[352,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                188393 runs -     5.37 us/run -     3520 kB/run -  625.75 GB/s
  SOFT_MAX(type=f32,ne=[384,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                180202 runs -     5.55 us/run -     3840 kB/run -  659.69 GB/s
  SOFT_MAX(type=f32,ne=[416,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                177452 runs -     5.72 us/run -     4160 kB/run -  693.33 GB/s
  SOFT_MAX(type=f32,ne=[448,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                172270 runs -     5.97 us/run -     4480 kB/run -  715.77 GB/s
  SOFT_MAX(type=f32,ne=[480,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                167784 runs -     6.12 us/run -     4800 kB/run -  747.63 GB/s
  SOFT_MAX(type=f32,ne=[512,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                157296 runs -     6.52 us/run -     5120 kB/run -  748.91 GB/s
  SOFT_MAX(type=f32,ne=[544,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                129549 runs -     7.79 us/run -     5440 kB/run -  665.79 GB/s
  SOFT_MAX(type=f32,ne=[576,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                128172 runs -     8.14 us/run -     5760 kB/run -  674.67 GB/s
  SOFT_MAX(type=f32,ne=[608,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                121418 runs -     8.41 us/run -     6080 kB/run -  689.33 GB/s
  SOFT_MAX(type=f32,ne=[640,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                115346 runs -     8.75 us/run -     6400 kB/run -  697.65 GB/s
  SOFT_MAX(type=f32,ne=[672,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                114862 runs -     9.09 us/run -     6720 kB/run -  705.28 GB/s
  SOFT_MAX(type=f32,ne=[704,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                104874 runs -     9.67 us/run -     7040 kB/run -  694.51 GB/s
  SOFT_MAX(type=f32,ne=[736,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                100320 runs -     9.99 us/run -     7360 kB/run -  702.87 GB/s
  SOFT_MAX(type=f32,ne=[768,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 96140 runs -    10.51 us/run -     7680 kB/run -  697.10 GB/s
  SOFT_MAX(type=f32,ne=[800,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 96485 runs -    10.81 us/run -     8000 kB/run -  706.14 GB/s
  SOFT_MAX(type=f32,ne=[832,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 88726 runs -    11.37 us/run -     8320 kB/run -  697.75 GB/s
  SOFT_MAX(type=f32,ne=[864,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 89332 runs -    11.70 us/run -     8640 kB/run -  704.41 GB/s
  SOFT_MAX(type=f32,ne=[896,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 82390 runs -    12.43 us/run -     8960 kB/run -  687.40 GB/s
  SOFT_MAX(type=f32,ne=[928,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 79552 runs -    12.91 us/run -     9280 kB/run -  685.83 GB/s
  SOFT_MAX(type=f32,ne=[960,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 76912 runs -    13.52 us/run -     9600 kB/run -  677.37 GB/s
  SOFT_MAX(type=f32,ne=[992,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 71043 runs -    14.11 us/run -     9920 kB/run -  670.67 GB/s
  SOFT_MAX(type=f32,ne=[128,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):               124526 runs -     8.36 us/run -     5120 kB/run -  583.90 GB/s
  SOFT_MAX(type=f32,ne=[256,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):               108141 runs -     9.53 us/run -    10240 kB/run - 1025.19 GB/s
  SOFT_MAX(type=f32,ne=[384,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                80845 runs -    12.70 us/run -    15360 kB/run - 1153.93 GB/s
  SOFT_MAX(type=f32,ne=[512,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                67199 runs -    15.17 us/run -    20480 kB/run - 1288.24 GB/s
  SOFT_MAX(type=f32,ne=[640,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                44574 runs -    22.75 us/run -    25600 kB/run - 1073.76 GB/s
  SOFT_MAX(type=f32,ne=[768,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                36069 runs -    28.17 us/run -    30720 kB/run - 1040.64 GB/s
  SOFT_MAX(type=f32,ne=[896,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                28110 runs -    36.16 us/run -    35840 kB/run -  945.68 GB/s
  SOFT_MAX(type=f32,ne=[1024,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):               12300 runs -    85.08 us/run -    40960 kB/run -  459.43 GB/s
  SOFT_MAX(type=f32,ne=[1152,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                8019 runs -   133.29 us/run -    46080 kB/run -  329.91 GB/s
  SOFT_MAX(type=f32,ne=[1280,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                7872 runs -   134.76 us/run -    51200 kB/run -  362.62 GB/s
  SOFT_MAX(type=f32,ne=[1408,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                7152 runs -   141.15 us/run -    56320 kB/run -  380.85 GB/s
  SOFT_MAX(type=f32,ne=[1536,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                7111 runs -   147.80 us/run -    61440 kB/run -  396.81 GB/s
  SOFT_MAX(type=f32,ne=[1664,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                6565 runs -   156.84 us/run -    66560 kB/run -  405.12 GB/s
  SOFT_MAX(type=f32,ne=[1792,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                6566 runs -   161.69 us/run -    71680 kB/run -  423.24 GB/s
  SOFT_MAX(type=f32,ne=[1920,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                6118 runs -   173.03 us/run -    76800 kB/run -  423.78 GB/s
  SOFT_MAX(type=f32,ne=[2048,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                5740 runs -   180.95 us/run -    81920 kB/run -  432.28 GB/s
  SOFT_MAX(type=f32,ne=[2176,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                5404 runs -   192.27 us/run -    87040 kB/run -  432.29 GB/s
  SOFT_MAX(type=f32,ne=[2304,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                5110 runs -   203.37 us/run -    92160 kB/run -  432.77 GB/s
  SOFT_MAX(type=f32,ne=[2432,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                4830 runs -   214.68 us/run -    97280 kB/run -  432.77 GB/s
  SOFT_MAX(type=f32,ne=[2560,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                4592 runs -   225.64 us/run -   102400 kB/run -  433.46 GB/s
  SOFT_MAX(type=f32,ne=[2688,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                4382 runs -   237.18 us/run -   107520 kB/run -  433.02 GB/s
  SOFT_MAX(type=f32,ne=[2816,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                4172 runs -   248.21 us/run -   112640 kB/run -  433.52 GB/s
  SOFT_MAX(type=f32,ne=[2944,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                3990 runs -   259.43 us/run -   117760 kB/run -  433.65 GB/s
  SOFT_MAX(type=f32,ne=[3072,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                3836 runs -   275.99 us/run -   122880 kB/run -  425.38 GB/s
  SOFT_MAX(type=f32,ne=[3200,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                3682 runs -   287.38 us/run -   128000 kB/run -  425.58 GB/s
  SOFT_MAX(type=f32,ne=[3328,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                3542 runs -   298.59 us/run -   133120 kB/run -  426.02 GB/s
  SOFT_MAX(type=f32,ne=[3456,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                3402 runs -   310.16 us/run -   138240 kB/run -  425.93 GB/s
  SOFT_MAX(type=f32,ne=[3584,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                3290 runs -   316.09 us/run -   143360 kB/run -  433.45 GB/s
  SOFT_MAX(type=f32,ne=[3712,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                3164 runs -   327.07 us/run -   148480 kB/run -  433.90 GB/s
  SOFT_MAX(type=f32,ne=[3840,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                3066 runs -   338.26 us/run -   153600 kB/run -  434.04 GB/s
  SOFT_MAX(type=f32,ne=[3968,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                2968 runs -   349.41 us/run -   158720 kB/run -  434.23 GB/s
  SOFT_MAX(type=f32,ne=[1024,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                2870 runs -   364.87 us/run -   163840 kB/run -  429.27 GB/s
  SOFT_MAX(type=f32,ne=[2048,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                1442 runs -   728.23 us/run -   327680 kB/run -  431.21 GB/s
  SOFT_MAX(type=f32,ne=[3072,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 966 runs -  1099.38 us/run -   491520 kB/run -  429.47 GB/s
  SOFT_MAX(type=f32,ne=[4096,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 728 runs -  1463.92 us/run -   655360 kB/run -  431.04 GB/s
  SOFT_MAX(type=f32,ne=[5120,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 574 runs -  1829.72 us/run -   819200 kB/run -  432.19 GB/s
  SOFT_MAX(type=f32,ne=[6144,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 490 runs -  2193.34 us/run -   983040 kB/run -  433.54 GB/s
  SOFT_MAX(type=f32,ne=[7168,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 420 runs -  2538.13 us/run -  1146880 kB/run -  438.11 GB/s
  SOFT_MAX(type=f32,ne=[8192,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 364 runs -  2877.24 us/run -  1310720 kB/run -  442.80 GB/s
  SOFT_MAX(type=f32,ne=[9216,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 322 runs -  3236.41 us/run -  1474560 kB/run -  443.96 GB/s
  SOFT_MAX(type=f32,ne=[10240,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                294 runs -  3595.31 us/run -  1638400 kB/run -  444.94 GB/s
  SOFT_MAX(type=f32,ne=[11264,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                247 runs -  4313.91 us/run -  1802240 kB/run -  408.91 GB/s
  SOFT_MAX(type=f32,ne=[12288,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                234 runs -  4309.84 us/run -  1966080 kB/run -  447.14 GB/s
  SOFT_MAX(type=f32,ne=[13312,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                224 runs -  4756.63 us/run -  2129920 kB/run -  440.38 GB/s
  SOFT_MAX(type=f32,ne=[14336,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                210 runs -  5118.93 us/run -  2293760 kB/run -  441.58 GB/s
  SOFT_MAX(type=f32,ne=[15360,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                196 runs -  5492.59 us/run -  2457600 kB/run -  441.95 GB/s
  SOFT_MAX(type=f32,ne=[16384,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                182 runs -  5867.84 us/run -  2621440 kB/run -  442.44 GB/s
  SOFT_MAX(type=f32,ne=[17408,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                169 runs -  6232.28 us/run -  2785280 kB/run -  442.60 GB/s
  SOFT_MAX(type=f32,ne=[18432,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                132 runs -  8092.53 us/run -  2949120 kB/run -  362.02 GB/s
  SOFT_MAX(type=f32,ne=[19456,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                132 runs -  7732.15 us/run -  3112960 kB/run -  401.40 GB/s

cuda:
  SOFT_MAX(type=f32,ne=[4096,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 676 runs -  1492.47 us/run -   655360 kB/run -  422.79 GB/s
  SOFT_MAX(type=f32,ne=[77,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 21792 runs -    46.52 us/run -    12320 kB/run -  252.61 GB/s
  SOFT_MAX(type=f32,ne=[1024,1024,10,1],mask=0,scale=1.000000,max_bias=0.000000):               3280 runs -   334.33 us/run -    81920 kB/run -  233.96 GB/s
  SOFT_MAX(type=f32,ne=[77,1024,10,1],mask=0,scale=1.000000,max_bias=0.000000):                43584 runs -    24.37 us/run -     6160 kB/run -  241.10 GB/s
  SOFT_MAX(type=f32,ne=[256,256,20,1],mask=0,scale=1.000000,max_bias=0.000000):                39324 runs -    25.49 us/run -    10240 kB/run -  383.25 GB/s
  SOFT_MAX(type=f32,ne=[64,64,20,1],mask=0,scale=1.000000,max_bias=0.000000):                 319449 runs -     3.19 us/run -      640 kB/run -  191.31 GB/s
  SOFT_MAX(type=f32,ne=[77,64,20,1],mask=0,scale=1.000000,max_bias=0.000000):                 229348 runs -     4.46 us/run -      770 kB/run -  164.75 GB/s
  SOFT_MAX(type=f32,ne=[32,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 393168 runs -     2.56 us/run -      320 kB/run -  119.17 GB/s
  SOFT_MAX(type=f32,ne=[64,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 319449 runs -     3.16 us/run -      640 kB/run -  192.99 GB/s
  SOFT_MAX(type=f32,ne=[96,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 229348 runs -     4.45 us/run -      960 kB/run -  205.82 GB/s
  SOFT_MAX(type=f32,ne=[128,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                221157 runs -     4.63 us/run -     1280 kB/run -  263.62 GB/s
  SOFT_MAX(type=f32,ne=[160,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                147438 runs -     7.13 us/run -     1600 kB/run -  214.16 GB/s
  SOFT_MAX(type=f32,ne=[192,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                139247 runs -     7.33 us/run -     1920 kB/run -  249.76 GB/s
  SOFT_MAX(type=f32,ne=[224,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                139247 runs -     7.44 us/run -     2240 kB/run -  287.26 GB/s
  SOFT_MAX(type=f32,ne=[256,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                131056 runs -     7.66 us/run -     2560 kB/run -  318.64 GB/s
  SOFT_MAX(type=f32,ne=[288,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 81910 runs -    13.00 us/run -     2880 kB/run -  211.27 GB/s
  SOFT_MAX(type=f32,ne=[320,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 81910 runs -    13.11 us/run -     3200 kB/run -  232.74 GB/s
  SOFT_MAX(type=f32,ne=[352,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 81910 runs -    13.25 us/run -     3520 kB/run -  253.41 GB/s
  SOFT_MAX(type=f32,ne=[384,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 81910 runs -    13.42 us/run -     3840 kB/run -  272.85 GB/s
  SOFT_MAX(type=f32,ne=[416,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 72594 runs -    13.78 us/run -     4160 kB/run -  287.96 GB/s
  SOFT_MAX(type=f32,ne=[448,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 74900 runs -    13.89 us/run -     4480 kB/run -  307.71 GB/s
  SOFT_MAX(type=f32,ne=[480,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 76901 runs -    13.86 us/run -     4800 kB/run -  330.23 GB/s
  SOFT_MAX(type=f32,ne=[512,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 72094 runs -    13.97 us/run -     5120 kB/run -  349.43 GB/s
  SOFT_MAX(type=f32,ne=[544,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 30845 runs -    32.64 us/run -     5440 kB/run -  158.94 GB/s
  SOFT_MAX(type=f32,ne=[576,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 34956 runs -    32.66 us/run -     5760 kB/run -  168.19 GB/s
  SOFT_MAX(type=f32,ne=[608,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 33114 runs -    32.75 us/run -     6080 kB/run -  177.09 GB/s
  SOFT_MAX(type=f32,ne=[640,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 31458 runs -    32.99 us/run -     6400 kB/run -  185.03 GB/s
  SOFT_MAX(type=f32,ne=[672,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 34958 runs -    33.07 us/run -     6720 kB/run -  193.81 GB/s
  SOFT_MAX(type=f32,ne=[704,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 33369 runs -    33.63 us/run -     7040 kB/run -  199.66 GB/s
  SOFT_MAX(type=f32,ne=[736,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 31920 runs -    33.75 us/run -     7360 kB/run -  208.01 GB/s
  SOFT_MAX(type=f32,ne=[768,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 30590 runs -    33.70 us/run -     7680 kB/run -  217.39 GB/s
  SOFT_MAX(type=f32,ne=[800,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 33560 runs -    33.93 us/run -     8000 kB/run -  224.88 GB/s
  SOFT_MAX(type=f32,ne=[832,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 32264 runs -    34.16 us/run -     8320 kB/run -  232.29 GB/s
  SOFT_MAX(type=f32,ne=[864,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 31072 runs -    34.41 us/run -     8640 kB/run -  239.49 GB/s
  SOFT_MAX(type=f32,ne=[896,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 29960 runs -    34.58 us/run -     8960 kB/run -  247.17 GB/s
  SOFT_MAX(type=f32,ne=[928,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 28928 runs -    34.97 us/run -     9280 kB/run -  253.12 GB/s
  SOFT_MAX(type=f32,ne=[960,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 31464 runs -    35.10 us/run -     9600 kB/run -  260.89 GB/s
  SOFT_MAX(type=f32,ne=[992,256,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 30447 runs -    35.29 us/run -     9920 kB/run -  268.13 GB/s
  SOFT_MAX(type=f32,ne=[128,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                78648 runs -    13.82 us/run -     5120 kB/run -  353.26 GB/s
  SOFT_MAX(type=f32,ne=[256,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                39324 runs -    25.53 us/run -    10240 kB/run -  382.54 GB/s
  SOFT_MAX(type=f32,ne=[384,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                21850 runs -    47.48 us/run -    15360 kB/run -  308.60 GB/s
  SOFT_MAX(type=f32,ne=[512,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                21307 runs -    50.40 us/run -    20480 kB/run -  387.65 GB/s
  SOFT_MAX(type=f32,ne=[640,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 9177 runs -   125.86 us/run -    25600 kB/run -  194.05 GB/s
  SOFT_MAX(type=f32,ne=[768,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 8744 runs -   128.85 us/run -    30720 kB/run -  227.47 GB/s
  SOFT_MAX(type=f32,ne=[896,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 8433 runs -   132.78 us/run -    35840 kB/run -  257.55 GB/s
  SOFT_MAX(type=f32,ne=[1024,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                6560 runs -   169.05 us/run -    40960 kB/run -  231.22 GB/s
  SOFT_MAX(type=f32,ne=[1152,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                5103 runs -   202.03 us/run -    46080 kB/run -  217.67 GB/s
  SOFT_MAX(type=f32,ne=[1280,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                5248 runs -   206.87 us/run -    51200 kB/run -  236.21 GB/s
  SOFT_MAX(type=f32,ne=[1408,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                4768 runs -   214.96 us/run -    56320 kB/run -  250.07 GB/s
  SOFT_MAX(type=f32,ne=[1536,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                4923 runs -   216.37 us/run -    61440 kB/run -  271.05 GB/s
  SOFT_MAX(type=f32,ne=[1664,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                4545 runs -   222.75 us/run -    66560 kB/run -  285.25 GB/s
  SOFT_MAX(type=f32,ne=[1792,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                4690 runs -   226.99 us/run -    71680 kB/run -  301.48 GB/s
  SOFT_MAX(type=f32,ne=[1920,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                4370 runs -   234.72 us/run -    76800 kB/run -  312.39 GB/s
  SOFT_MAX(type=f32,ne=[2048,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                4920 runs -   208.30 us/run -    81920 kB/run -  375.51 GB/s
  SOFT_MAX(type=f32,ne=[2176,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                3860 runs -   276.20 us/run -    87040 kB/run -  300.92 GB/s
  SOFT_MAX(type=f32,ne=[2304,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                3650 runs -   281.24 us/run -    92160 kB/run -  312.94 GB/s
  SOFT_MAX(type=f32,ne=[2432,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                3450 runs -   294.41 us/run -    97280 kB/run -  315.57 GB/s
  SOFT_MAX(type=f32,ne=[2560,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                3608 runs -   299.35 us/run -   102400 kB/run -  326.72 GB/s
  SOFT_MAX(type=f32,ne=[2688,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                3443 runs -   309.37 us/run -   107520 kB/run -  331.97 GB/s
  SOFT_MAX(type=f32,ne=[2816,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                3278 runs -   310.46 us/run -   112640 kB/run -  346.59 GB/s
  SOFT_MAX(type=f32,ne=[2944,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                3420 runs -   318.13 us/run -   117760 kB/run -  353.64 GB/s
  SOFT_MAX(type=f32,ne=[3072,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                3288 runs -   325.97 us/run -   122880 kB/run -  360.16 GB/s
  SOFT_MAX(type=f32,ne=[3200,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                2893 runs -   355.76 us/run -   128000 kB/run -  343.78 GB/s
  SOFT_MAX(type=f32,ne=[3328,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                2783 runs -   368.96 us/run -   133120 kB/run -  344.76 GB/s
  SOFT_MAX(type=f32,ne=[3456,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                2673 runs -   386.28 us/run -   138240 kB/run -  342.00 GB/s
  SOFT_MAX(type=f32,ne=[3584,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                2820 runs -   386.72 us/run -   143360 kB/run -  354.29 GB/s
  SOFT_MAX(type=f32,ne=[3712,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                2712 runs -   398.36 us/run -   148480 kB/run -  356.25 GB/s
  SOFT_MAX(type=f32,ne=[3840,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                2628 runs -   404.81 us/run -   153600 kB/run -  362.69 GB/s
  SOFT_MAX(type=f32,ne=[3968,1024,5,1],mask=0,scale=1.000000,max_bias=0.000000):                2544 runs -   413.16 us/run -   158720 kB/run -  367.23 GB/s
  SOFT_MAX(type=f32,ne=[1024,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                1640 runs -   666.77 us/run -   163840 kB/run -  234.91 GB/s
  SOFT_MAX(type=f32,ne=[2048,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                1236 runs -   825.59 us/run -   327680 kB/run -  380.35 GB/s
  SOFT_MAX(type=f32,ne=[3072,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 828 runs -  1288.73 us/run -   491520 kB/run -  366.37 GB/s
  SOFT_MAX(type=f32,ne=[4096,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 676 runs -  1491.34 us/run -   655360 kB/run -  423.12 GB/s
  SOFT_MAX(type=f32,ne=[5120,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 492 runs -  2035.04 us/run -   819200 kB/run -  388.58 GB/s
  SOFT_MAX(type=f32,ne=[6144,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 455 runs -  2374.11 us/run -   983040 kB/run -  400.53 GB/s
  SOFT_MAX(type=f32,ne=[7168,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 390 runs -  2760.87 us/run -  1146880 kB/run -  402.76 GB/s
  SOFT_MAX(type=f32,ne=[8192,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 338 runs -  3138.17 us/run -  1310720 kB/run -  405.98 GB/s
  SOFT_MAX(type=f32,ne=[9216,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                 299 runs -  3531.18 us/run -  1474560 kB/run -  406.90 GB/s
  SOFT_MAX(type=f32,ne=[10240,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                273 runs -  3891.35 us/run -  1638400 kB/run -  411.09 GB/s
  SOFT_MAX(type=f32,ne=[11264,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                247 runs -  4279.13 us/run -  1802240 kB/run -  412.23 GB/s
  SOFT_MAX(type=f32,ne=[12288,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                198 runs -  5103.53 us/run -  1966080 kB/run -  377.60 GB/s
  SOFT_MAX(type=f32,ne=[13312,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                176 runs -  5706.03 us/run -  2129920 kB/run -  367.11 GB/s
  SOFT_MAX(type=f32,ne=[14336,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                195 runs -  5480.89 us/run -  2293760 kB/run -  412.42 GB/s
  SOFT_MAX(type=f32,ne=[15360,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                182 runs -  5900.43 us/run -  2457600 kB/run -  411.40 GB/s
  SOFT_MAX(type=f32,ne=[16384,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                169 runs -  6307.22 us/run -  2621440 kB/run -  411.62 GB/s
  SOFT_MAX(type=f32,ne=[17408,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                156 runs -  6703.22 us/run -  2785280 kB/run -  411.51 GB/s
  SOFT_MAX(type=f32,ne=[18432,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                144 runs -  7123.97 us/run -  2949120 kB/run -  411.24 GB/s
  SOFT_MAX(type=f32,ne=[19456,4096,5,1],mask=0,scale=1.000000,max_bias=0.000000):                143 runs -  7476.99 us/run -  3112960 kB/run -  415.10 GB/s 

Restore the workgroup size of 512 case, use it for >1024.

Use unrollable loops for more iteration counts.
@0cc4m
Copy link
Collaborator

0cc4m commented Nov 18, 2024

Thank you, this looks a lot better. It now nearly consistently outperforms even CUDA on my 3090.

Behaviour is a little different on the AMD Radeon Pro VII, where there's a bunch of tests with the first commit outperforming the second. But the margin is close enough that I don't think it matters. I'll follow up with a test on a more modern AMD GPU to check how it runs on RDNA.

Intel is weird as usual, with very erratic performance. There's cases where the first commit outperforms the second significantly, and even cases where master outperforms both by a good margin. There's even a single case (1024, 4096, 5, 1) where both first and second commit drop to single digit performance for whatever reason. If you can see an easy pattern we could switch by vendor similar to the matrix multiplication shader selection, but it's not necessary for this PR.

My apologies for the wide plots:

AMD_Radeon_(TM)_Pro_VII_performance_comparison
Intel(R)_Arc(tm)A770_Graphics(DG2)_performance_comparison
NVIDIA_GeForce_RTX_3090_performance_comparison

@0cc4m
Copy link
Collaborator

0cc4m commented Nov 18, 2024

Here's results from an AMD RX 6800 XT. Looks similar to the Radeon Pro VII, the huge L3 cache seems to benefit it quite a bit until the buffers become too large and it gets limited by VRAM bandwidth.

Looking at the results, switching to the large shader at 1024 seems to be correct on Nvidia Ampere, but for AMD and Intel switching at 2048 might be better. I'm not sure if that would cause an issue with other test sizes, so for now it's fine as is I think.

Let me know if you want to change anything or if it's ready to merge.

AMD_Radeon_RX_6800_XT_performance_comparison

@jeffbolznv
Copy link
Collaborator Author

Thanks for retesting. Looks like there's opportunities for more tuning on Intel, but I'd prefer to merge this as-is. I don't have Intel HW available to tune it myself.

@0cc4m 0cc4m merged commit b3e5859 into ggerganov:master Nov 19, 2024
54 checks passed
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Dec 20, 2024
* vulkan: Optimize soft_max

Large soft_max could already saturate memory, but small/medium sizes were
pretty slow. The bulk of the gains for them comes from using a smaller
workgroup size, and making the workgroup size match the subgroup size also
makes the barriers much cheaper.

Cache some values in locals to avoid refetching/recomputing. And stamp
out a few "template instantiations" so smaller cases will fully unroll.

Add a missing early return for OOB rows. This happens when there are more
than 512 rows and the dispatch is 512 x H.

* vulkan: Further soft_max optimizations

Restore the workgroup size of 512 case, use it for >1024.

Use unrollable loops for more iteration counts.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants