Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cuda : optimize argmax #10441

Merged
merged 6 commits into from
Nov 21, 2024
Merged

cuda : optimize argmax #10441

merged 6 commits into from
Nov 21, 2024

Conversation

slaren
Copy link
Collaborator

@slaren slaren commented Nov 21, 2024

I was curious about the CUDA implementation to see if it could be used as a reference for the Metal implementation and figured it could be optimized. Processes one row per group, uses multiple warps if the row size is big enough.

Also renamed loop parameter of the warp shuffles to offset, since that should be more accurate.

  Device description: NVIDIA GeForce RTX 3090 Ti
  Device memory: 24563 MB (23287 MB free)

PR:
  ARGMAX(type=f32,ne=[32,10,1,1]):            811008 runs -     1.23 us/run -        1 kB/run -    1.00 GB/s
  ARGMAX(type=f32,ne=[1024,10,1,1]):          573440 runs -     1.76 us/run -       40 kB/run -   21.66 GB/s
  ARGMAX(type=f32,ne=[32000,512,1,1]):         12075 runs -    83.35 us/run -    64002 kB/run -  732.34 GB/s

master:
  ARGMAX(type=f32,ne=[32,10,1,1]):            270336 runs -     3.71 us/run -        1 kB/run -    0.33 GB/s
  ARGMAX(type=f32,ne=[1024,10,1,1]):           24576 runs -    49.73 us/run -       40 kB/run -    0.77 GB/s
  ARGMAX(type=f32,ne=[32000,512,1,1]):           525 runs -  9097.55 us/run -    64002 kB/run -    6.71 GB/s

@github-actions github-actions bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs labels Nov 21, 2024
@JohannesGaessler JohannesGaessler self-requested a review November 21, 2024 07:56
Copy link
Collaborator

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, after I had already written the code and especially after #10318 I've been thinking that I set the wrong priorities for this kernel. I think the only comment of mine that needs to be addressed is the one about undefined behavior, the rest are only suggestions.

ggml/src/ggml-cuda/argmax.cu Outdated Show resolved Hide resolved
Comment on lines +27 to 30
if (val > maxval) {
maxval = val;
argmax = col;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In retrospect it probably makes more sense to do it like this; conditional statements are problematic for code optimization since they prevent the compiler from reordering instructions but there isn't much to do in one loop iteration anyways.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't measure a meaningful difference in performance and this should be easier to understand and maintain. Maybe in some hardware it would make a difference? I also would expect the compiler to be able to optimize simple conditionals like this, but that may be expecting too much.

ggml/src/ggml-cuda/argmax.cu Outdated Show resolved Hide resolved
if (warp_id == 0 && lane_id < n_warps) {
maxval = shared_maxval[lane_id];
argmax = shared_argmax[lane_id];
const unsigned int mask = (1u << n_warps) - 1u;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably faster to just have all threads participate in the shuffle unconditionally.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for doing this is that if there are less than 32 warps, then some values will not be written to the shared memory, so they should not be used.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion would be to just have the first warp clear the memory and then do a __syncthreads before reading again.

Comment on lines +63 to 65
if (warp_id == 0 && lane_id == 0) {
dst[row] = argmax;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My experience is that conditional returns/continues are faster than conditional writes but it probably doesn't matter much.


const dim3 blocks_dim(WARP_SIZE, 1, 1);
const int64_t num_blocks = nrows;
const int64_t num_threads = std::min<int64_t>(1024, (ne00 + WARP_SIZE - 1) / WARP_SIZE * WARP_SIZE);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to be efficient for 32 <= ne00 <= 1024 and ne00 >> 1024 but inefficient for 1024 < ne00 <= 4096. And in general, if you have a variable block size you should make it a template parameter.

Comment on lines 52 to 54
for (int offset = 16; offset > 0; offset >>= 1) {
const float val = __shfl_xor_sync(mask, maxval, offset, WARP_SIZE);
const int col = __shfl_xor_sync(mask, argmax, offset, WARP_SIZE);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CUDA documentation says:

Threads may only read data from another thread which is actively participating in the __shfl_sync() command. If the target thread is inactive, the retrieved value is undefined.

It doesn't explicitly mention __shfl_xor_sync but I suspect that the same hardware is used and that thus the same limitations apply.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the PTX documentation the behavior is definitely undefined.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what's the undefined behavior here, can you elaborate? The mask is set such as only the threads participating in the sync are used.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PTX documentations reads:

Note that results are undefined if a thread sources a register from an inactive thread or a thread that is not in membermask.

The problem is that even if you limit the participating threads via the mask they are still retrieving data from threads outside the mask. You would have to dynamically change the values of offset and in the most general case where n_warps is not a power of 2 you would need to use instructions other than __shfl_xor_sync.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thanks. It should be fixed now.

Comment on lines +3464 to +3467
test_cases.emplace_back(new test_argmax(GGML_TYPE_F32, {32, 1, 1, 1}));
test_cases.emplace_back(new test_argmax(GGML_TYPE_F32, {100, 10, 1, 1}));
test_cases.emplace_back(new test_argmax(GGML_TYPE_F32, {1024, 10, 1, 1}));
test_cases.emplace_back(new test_argmax(GGML_TYPE_F32, {2000, 10, 1, 1}));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may want to also check the case with ne01 and ne00 flipped where whether or not the writes are coalesced makes a comparatively larger difference. But that would be the case with a very large batch size and few classes and especially with language models that have large vocabulary sizes I think it's not an important use case.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean test for correctness or performance? These cases are the ones used in eval mode only.

I also tested the performance with [512,32000], and it drops to 480GB/s (compared to 730GB/s with [32000,512]). There are surely more optimization opportunities, but I don't think it is worth spending more time on this at moment.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only meant performance. I wrote the code on master in the context of the ggml MNIST example with an input shape of {10, 1000, 1, 1}. In principle, if you have a low number of classes but a large number of datapoints the number of writes should become significant and it would make sense to try and coalesce them (but with the code on master there are likely also issues with tail effects because the number of CUDA blocks is reduced by a factor of 32). In the first place, I should have written code with a use case like 256000, 128, 1, 1 in mind since that is going to be relevant for llama.cpp.

slaren and others added 2 commits November 21, 2024 13:32
static __global__ void argmax_f32(
const float * x, int32_t * dst, const int64_t ncols, const int64_t nrows) {
float maxval = -FLT_MAX;
int argmax = -1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the code again, I think either 64 bit should be used for the ne00 dimension or there should be an assert that 32 bit is enough.

Copy link
Collaborator Author

@slaren slaren Nov 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The output is int32, so it would definitely not work with ne00 larger than INT_MAX. In that case it might make more sense to add the assert to ggml_argmax instead. Other arg* functions will have the same issue.

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Nov 21, 2024
@slaren slaren merged commit a5e4759 into master Nov 21, 2024
55 checks passed
@slaren slaren deleted the sl/cuda-opt-argmax branch November 21, 2024 17:18
Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this pull request Nov 21, 2024
* cuda : optimize argmax

* remove unused parameter

ggml-ci

* fixup : use full warps

ggml-ci

* Apply suggestions from code review

Co-authored-by: Johannes Gäßler <[email protected]>

* fix ub

* ggml : check ne00 <= INT32_MAX in argmax and argsort

---------

Co-authored-by: Johannes Gäßler <[email protected]>
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Dec 20, 2024
* cuda : optimize argmax

* remove unused parameter

ggml-ci

* fixup : use full warps

ggml-ci

* Apply suggestions from code review

Co-authored-by: Johannes Gäßler <[email protected]>

* fix ub

* ggml : check ne00 <= INT32_MAX in argmax and argsort

---------

Co-authored-by: Johannes Gäßler <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants