cuda : optimize argmax #10441

slaren · 2024-11-21T00:12:24Z

I was curious about the CUDA implementation to see if it could be used as a reference for the Metal implementation and figured it could be optimized. Processes one row per group, uses multiple warps if the row size is big enough.

Also renamed loop parameter of the warp shuffles to offset, since that should be more accurate.

  Device description: NVIDIA GeForce RTX 3090 Ti
  Device memory: 24563 MB (23287 MB free)

PR:
  ARGMAX(type=f32,ne=[32,10,1,1]):            811008 runs -     1.23 us/run -        1 kB/run -    1.00 GB/s
  ARGMAX(type=f32,ne=[1024,10,1,1]):          573440 runs -     1.76 us/run -       40 kB/run -   21.66 GB/s
  ARGMAX(type=f32,ne=[32000,512,1,1]):         12075 runs -    83.35 us/run -    64002 kB/run -  732.34 GB/s

master:
  ARGMAX(type=f32,ne=[32,10,1,1]):            270336 runs -     3.71 us/run -        1 kB/run -    0.33 GB/s
  ARGMAX(type=f32,ne=[1024,10,1,1]):           24576 runs -    49.73 us/run -       40 kB/run -    0.77 GB/s
  ARGMAX(type=f32,ne=[32000,512,1,1]):           525 runs -  9097.55 us/run -    64002 kB/run -    6.71 GB/s

ggml-ci

JohannesGaessler

Thank you, after I had already written the code and especially after #10318 I've been thinking that I set the wrong priorities for this kernel. I think the only comment of mine that needs to be addressed is the one about undefined behavior, the rest are only suggestions.

ggml/src/ggml-cuda/argmax.cu

JohannesGaessler · 2024-11-21T10:17:06Z

ggml/src/ggml-cuda/argmax.cu

+        if (val > maxval) {
+            maxval = val;
+            argmax = col;
        }


In retrospect it probably makes more sense to do it like this; conditional statements are problematic for code optimization since they prevent the compiler from reordering instructions but there isn't much to do in one loop iteration anyways.

I couldn't measure a meaningful difference in performance and this should be easier to understand and maintain. Maybe in some hardware it would make a difference? I also would expect the compiler to be able to optimize simple conditionals like this, but that may be expecting too much.

ggml/src/ggml-cuda/argmax.cu

JohannesGaessler · 2024-11-21T10:30:02Z

ggml/src/ggml-cuda/argmax.cu

+        if (warp_id == 0 && lane_id < n_warps) {
+            maxval = shared_maxval[lane_id];
+            argmax = shared_argmax[lane_id];
+            const unsigned int mask = (1u << n_warps) - 1u;


It's probably faster to just have all threads participate in the shuffle unconditionally.

The reason for doing this is that if there are less than 32 warps, then some values will not be written to the shared memory, so they should not be used.

My suggestion would be to just have the first warp clear the memory and then do a __syncthreads before reading again.

JohannesGaessler · 2024-11-21T10:31:04Z

ggml/src/ggml-cuda/argmax.cu

+    if (warp_id == 0 && lane_id == 0) {
+        dst[row] = argmax;
    }


My experience is that conditional returns/continues are faster than conditional writes but it probably doesn't matter much.

JohannesGaessler · 2024-11-21T10:35:15Z

ggml/src/ggml-cuda/argmax.cu

-
-    const dim3 blocks_dim(WARP_SIZE, 1, 1);
+    const int64_t num_blocks = nrows;
+    const int64_t num_threads = std::min<int64_t>(1024, (ne00 + WARP_SIZE - 1) / WARP_SIZE * WARP_SIZE);


This is going to be efficient for 32 <= ne00 <= 1024 and ne00 >> 1024 but inefficient for 1024 < ne00 <= 4096. And in general, if you have a variable block size you should make it a template parameter.

JohannesGaessler · 2024-11-21T10:44:01Z

ggml/src/ggml-cuda/argmax.cu

+            for (int offset = 16; offset > 0; offset >>= 1) {
+                const float val = __shfl_xor_sync(mask, maxval, offset, WARP_SIZE);
+                const int   col = __shfl_xor_sync(mask, argmax, offset, WARP_SIZE);


The CUDA documentation says:

Threads may only read data from another thread which is actively participating in the __shfl_sync() command. If the target thread is inactive, the retrieved value is undefined.

It doesn't explicitly mention __shfl_xor_sync but I suspect that the same hardware is used and that thus the same limitations apply.

Looking at the PTX documentation the behavior is definitely undefined.

I don't understand what's the undefined behavior here, can you elaborate? The mask is set such as only the threads participating in the sync are used.

The PTX documentations reads:

Note that results are undefined if a thread sources a register from an inactive thread or a thread that is not in membermask.

The problem is that even if you limit the participating threads via the mask they are still retrieving data from threads outside the mask. You would have to dynamically change the values of offset and in the most general case where n_warps is not a power of 2 you would need to use instructions other than __shfl_xor_sync.

Ok, thanks. It should be fixed now.

JohannesGaessler · 2024-11-21T11:23:28Z

tests/test-backend-ops.cpp

+    test_cases.emplace_back(new test_argmax(GGML_TYPE_F32, {32, 1, 1, 1}));
+    test_cases.emplace_back(new test_argmax(GGML_TYPE_F32, {100, 10, 1, 1}));
+    test_cases.emplace_back(new test_argmax(GGML_TYPE_F32, {1024, 10, 1, 1}));
+    test_cases.emplace_back(new test_argmax(GGML_TYPE_F32, {2000, 10, 1, 1}));


You may want to also check the case with ne01 and ne00 flipped where whether or not the writes are coalesced makes a comparatively larger difference. But that would be the case with a very large batch size and few classes and especially with language models that have large vocabulary sizes I think it's not an important use case.

Do you mean test for correctness or performance? These cases are the ones used in eval mode only.

I also tested the performance with [512,32000], and it drops to 480GB/s (compared to 730GB/s with [32000,512]). There are surely more optimization opportunities, but I don't think it is worth spending more time on this at moment.

I only meant performance. I wrote the code on master in the context of the ggml MNIST example with an input shape of {10, 1000, 1, 1}. In principle, if you have a low number of classes but a large number of datapoints the number of writes should become significant and it would make sense to try and coalesce them (but with the code on master there are likely also issues with tail effects because the number of CUDA blocks is reduced by a factor of 32). In the first place, I should have written code with a use case like 256000, 128, 1, 1 in mind since that is going to be relevant for llama.cpp.

Co-authored-by: Johannes Gäßler <[email protected]>

JohannesGaessler · 2024-11-21T13:45:21Z

ggml/src/ggml-cuda/argmax.cu

-static __global__ void argmax_f32(
-    const float * x, int32_t * dst, const int64_t ncols, const int64_t nrows) {
+    float maxval = -FLT_MAX;
+    int   argmax = -1;


Looking at the code again, I think either 64 bit should be used for the ne00 dimension or there should be an assert that 32 bit is enough.

The output is int32, so it would definitely not work with ne00 larger than INT_MAX. In that case it might make more sense to add the assert to ggml_argmax instead. Other arg* functions will have the same issue.

* cuda : optimize argmax * remove unused parameter ggml-ci * fixup : use full warps ggml-ci * Apply suggestions from code review Co-authored-by: Johannes Gäßler <[email protected]> * fix ub * ggml : check ne00 <= INT32_MAX in argmax and argsort --------- Co-authored-by: Johannes Gäßler <[email protected]>

cuda : optimize argmax

35386e8

github-actions bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs labels Nov 21, 2024

slaren added 2 commits November 21, 2024 01:59

remove unused parameter

0a737d2

ggml-ci

fixup : use full warps

1e9447a

ggml-ci

ggerganov approved these changes Nov 21, 2024

View reviewed changes

JohannesGaessler self-requested a review November 21, 2024 07:56

JohannesGaessler reviewed Nov 21, 2024

View reviewed changes

slaren and others added 2 commits November 21, 2024 13:32

Apply suggestions from code review

a734da7

Co-authored-by: Johannes Gäßler <[email protected]>

fix ub

316f3d3

JohannesGaessler approved these changes Nov 21, 2024

View reviewed changes

ggml : check ne00 <= INT32_MAX in argmax and argsort

48f94d4

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Nov 21, 2024

slaren merged commit a5e4759 into master Nov 21, 2024
55 checks passed

slaren deleted the sl/cuda-opt-argmax branch November 21, 2024 17:18

ggerganov mentioned this pull request Nov 28, 2024

feat: add GGML_UNARY_OP_ARGMAX Metal kernel ggerganov/ggml#1019

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuda : optimize argmax #10441

cuda : optimize argmax #10441

slaren commented Nov 21, 2024 •

edited

Loading

JohannesGaessler left a comment

JohannesGaessler Nov 21, 2024

slaren Nov 21, 2024

JohannesGaessler Nov 21, 2024

slaren Nov 21, 2024

JohannesGaessler Nov 21, 2024

JohannesGaessler Nov 21, 2024

JohannesGaessler Nov 21, 2024

JohannesGaessler Nov 21, 2024

JohannesGaessler Nov 21, 2024

slaren Nov 21, 2024

JohannesGaessler Nov 21, 2024

slaren Nov 21, 2024

JohannesGaessler Nov 21, 2024

slaren Nov 21, 2024

JohannesGaessler Nov 21, 2024

JohannesGaessler Nov 21, 2024

slaren Nov 21, 2024 •

edited

Loading

cuda : optimize argmax #10441

cuda : optimize argmax #10441

Conversation

slaren commented Nov 21, 2024 • edited Loading

JohannesGaessler left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

slaren Nov 21, 2024 • edited Loading

Choose a reason for hiding this comment

slaren commented Nov 21, 2024 •

edited

Loading

slaren Nov 21, 2024 •

edited

Loading