-
Notifications
You must be signed in to change notification settings - Fork 11
Home
GPUSorting aims to bring state-of-the-art GPU sorting techniques from CUDA and make them available in portable compute shaders. All sorting algorithms included in GPUSorting utilize wave/warp/subgroup (referred to as "wave" hereon) level parallelism but are completely agnostic of wave size. Wave size specialization is entirely accomplished through runtime logic, instead of through shader compilation defines. This has a minimal impact on performance and significantly reduces the number of shader permutations. Although GPUSorting aims to be portable to any wave size supported by HLSL, [4, 128], due to hardware limitations, it has only been tested on wave sizes 4, 16, 32, and 64. You have been warned!
GPUSorting includes two sorting algorithms, both based on those found in the CUB library: DeviceRadixSort and OneSweep. The two algorithms are almost identical, except for the way that the inter-threadblock prefix sum of digit counts is performed. In DeviceRadixSort, the prefix sum is done through an older technique, "reduce-then-scan," whereas in OneSweep, it is accomplished using "chained-scan-with-decoupled-lookback." Because "chained-scan" relies on forward thread-progress guarantees, OneSweep is less portable than DeviceRadixSort, and DeviceRadixSort should be used whenever portability is a concern. Again, due to a lack of hardware, I cannot say exactly how portable OneSweep is, but as a general rule of thumb, OneSweep appears to run correctly on anything that is not mobile or WARP. Use OneSweep at your own risk; you have been warned!