-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vulkan: Optimize soft_max #10301
vulkan: Optimize soft_max #10301
Conversation
Large soft_max could already saturate memory, but small/medium sizes were pretty slow. The bulk of the gains for them comes from using a smaller workgroup size, and making the workgroup size match the subgroup size also makes the barriers much cheaper. Cache some values in locals to avoid refetching/recomputing. And stamp out a few "template instantiations" so smaller cases will fully unroll. Add a missing early return for OOB rows. This happens when there are more than 512 rows and the dispatch is 512 x H.
I can confirm this significantly improves softmax performance for small and medium sizes, but I also see a regression for large sizes. On your GPU and on my AMD Radeon Pro VII it seems to be minimal, but in the 4096, 4096, 5, 1 tests for Intel A770 and Nvidia RTX 3090 I see a big difference. Do you have an idea what causes this? |
The big difference between Ampere and Ada that comes to mind is the much larger L2 cache size on Ada. I'm guessing that with the larger workgroup size there are fewer rows being processed at a time and it can hit in the smaller L2 cache. I probably need to bring back the larger block size and use it when the rows are large enough. I have an RTX 3070 I can try this out on tomorrow. |
I haven't had a chance to test on RTX 3070, but I tested a wider variety of sizes on 4070 was able to see some similar effects. I brought back the 512 workgroup size for larger rows, and added some of the missing cases to unroll, and perf is a lot more consistent now. You can see with the previous commit there were perf dips where there were missing cases to unroll, and that for large enough rows the perf fell to about 1/3 of the bandwidth limit presumably due to not fitting in the cache. I tested with this code:
|
Restore the workgroup size of 512 case, use it for >1024. Use unrollable loops for more iteration counts.
21dbbe5
to
85fc297
Compare
Thank you, this looks a lot better. It now nearly consistently outperforms even CUDA on my 3090. Behaviour is a little different on the AMD Radeon Pro VII, where there's a bunch of tests with the first commit outperforming the second. But the margin is close enough that I don't think it matters. I'll follow up with a test on a more modern AMD GPU to check how it runs on RDNA. Intel is weird as usual, with very erratic performance. There's cases where the first commit outperforms the second significantly, and even cases where master outperforms both by a good margin. There's even a single case (1024, 4096, 5, 1) where both first and second commit drop to single digit performance for whatever reason. If you can see an easy pattern we could switch by vendor similar to the matrix multiplication shader selection, but it's not necessary for this PR. My apologies for the wide plots: |
Here's results from an AMD RX 6800 XT. Looks similar to the Radeon Pro VII, the huge L3 cache seems to benefit it quite a bit until the buffers become too large and it gets limited by VRAM bandwidth. Looking at the results, switching to the large shader at 1024 seems to be correct on Nvidia Ampere, but for AMD and Intel switching at 2048 might be better. I'm not sure if that would cause an issue with other test sizes, so for now it's fine as is I think. Let me know if you want to change anything or if it's ready to merge. |
Thanks for retesting. Looks like there's opportunities for more tuning on Intel, but I'd prefer to merge this as-is. I don't have Intel HW available to tune it myself. |
* vulkan: Optimize soft_max Large soft_max could already saturate memory, but small/medium sizes were pretty slow. The bulk of the gains for them comes from using a smaller workgroup size, and making the workgroup size match the subgroup size also makes the barriers much cheaper. Cache some values in locals to avoid refetching/recomputing. And stamp out a few "template instantiations" so smaller cases will fully unroll. Add a missing early return for OOB rows. This happens when there are more than 512 rows and the dispatch is 512 x H. * vulkan: Further soft_max optimizations Restore the workgroup size of 512 case, use it for >1024. Use unrollable loops for more iteration counts.
Large soft_max could already saturate memory, but small/medium sizes were pretty slow. The bulk of the gains for them comes from using a smaller workgroup size, and making the workgroup size match the subgroup size also makes the barriers much cheaper.
Cache some values in locals to avoid refetching/recomputing. And stamp out a few "template instantiations" so smaller cases will fully unroll.
Add a missing early return for OOB rows. This happens when there are more than 512 rows and the dispatch is 512 x H.
These sizes I benchmarked came from a stable diffusion network I was looking at a while back.