Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DeviceRadixSort #82

Merged
merged 6 commits into from
Feb 19, 2024
Merged

Add DeviceRadixSort #82

merged 6 commits into from
Feb 19, 2024

Conversation

b0nes164
Copy link
Contributor

@b0nes164 b0nes164 commented Dec 9, 2023

Add DeviceRadixSort (MIT License). Keep FidelityFX as a fallback, but modify host CPU code to use DeviceRadixSort. DeviceRadixSort still untested on AMD hardware. See issue #48.

Add DeviceRadixSort(MIT Licensce). Keep FidelityFX as a fallback, but modify host CPU code to use DeviceRadixSort.
@aras-p
Copy link
Owner

aras-p commented Dec 9, 2023

Hmm, testing it out on a Mac (M1 Max), and with every frame it seems to add more & more black results, as if it makes splats disappear (which would probably mean that some of the key payload data gets overwritten with zero or something).

Screenshot 2023-12-09 at 17 30 59

@b0nes164
Copy link
Contributor Author

b0nes164 commented Dec 10, 2023

What is the wave-width of the M1? I found a bug affecting warp-widths < 32.

Edit:
Reworked counting of bits less than a lane from shifting to using a mask. I suspect that DX12 and Metal handle shifting 1<<n for n > 31 differently, resulting in incorrect behavior in Metal case.

…ask.

Reworked counting of bits less than a lane from shifting to using a mask. I suspect that DX12 and Metal handle shifting `1<<n` for `n > 31` differently, resulting in incorrect behavior in Metal case.
@aras-p
Copy link
Owner

aras-p commented Dec 10, 2023

@b0nes164 the pushed fix does not help (and FWIW Apple M1 has warp width of 32). Wild guess: is the shader relying on D3D behavior of out-of-bounds buffer reads returning zeroes? Metal (and Vulkan, for that matter) does not have that guarantee, out of bounds buffer reads are undefined (and out of bounds buffer writes are too, and can hang/crash the system - whereas on D3D they are guaranteed to be ignored).

@b0nes164
Copy link
Contributor Author

b0nes164 commented Dec 11, 2023

No out-of-bounds hacks, input sizes that are not perfect multiples of the partition tiles size are handled by a specific routine. In my mind there is either an issue resulting from the cross compilation, or there is a race condition which I am not accounting for that does not present on my hardware.

Based on some light reading of the Metal language specification, it seems like there could be an issue with the Metal equivalent of WaveActiveBallot() because it returns a wrapper type rather than a uint4, but I have zero experience with Metal or cross compiling or how Unity specifically does the cross compilation. I also don't have a Mac, so I can't troubleshoot the code either.

I'm still trying to think of possibles fixes, but for now, this will have to be on hold.

@b0nes164
Copy link
Contributor Author

@aras-p, this might be a bit of a reach, but would it be possible to send you a standalone version of the sort to collect debug information, then send the results to me? Because LSD radix sorting performs a constant amount of work regardless of the input, the bug should reproduce in a standalone environment.

@aras-p
Copy link
Owner

aras-p commented Dec 27, 2023

@b0nes164 I do have a separate "testbed" project for GPU sorting codes, and I just plugged yours from this PR in there. At least on a Mac here, DeviceRadixSort works fine for smaller array sizes, and starts to have occasional failures at 10000 items and up, and "quite many" failures at sizes of several millions, i.e.

Screenshot 2023-12-27 at 10 21 27

The actual errors (i.e. how many elements are not correctly sorted, and where they are located) are different from run to run though.

If you want to send me something else to test here, I'd be happy to!

@b0nes164
Copy link
Contributor Author

b0nes164 commented Jan 11, 2024

@aras-p apologies for the delay, but here is the debugging routine for the sort. The limited testing you did was actually very useful because it rules out issues with the inter-threadblock prefix sum. My running theory is that for whatever reason on the M1, stale values are occasionally being pulled from shared memory, causing non-catastrophic errors in the downsweep kernel. According to the HLSL specification:

[I]n Direct3D 10 there is no synchronization of threads when writing to groupshared, so this means that each thread is limited to a single location in an array for writing.

Although we are using Dx12, I am unaware of any improved guarantees on shared memory synchronization without barriers. My best guess is that on my hardware, shared memory accesses are kept coherent on a warp level, guaranteeing that fresh values from within the warp are being pulled from shared memory. Because this scenario only occurs in the "full partition" portion of the code, it supports your testing results where the bug only presents in sorts of input size 7680 or greater.

The testing procedure:

  1. Attach the dispatcher to an empty game object.
  2. Attach the compute shader to the script,
  3. Run the scene, upon completion "Test Complete" will output in the debug log (should take less than 30s).
  4. Zip the resulting text files in the Unity project back to me here.

Both files can be found under "MetalDebug" as public gists at:
https://gist.github.com/b0nes164

@aras-p
Copy link
Owner

aras-p commented Jan 11, 2024

My best guess is that on my hardware, shared memory accesses are kept coherent on a warp level, guaranteeing that fresh values from within the warp are being pulled from shared memory

@b0nes164 "Something along those lines" was also my guess, i.e. some sort of coherency or memory model assumptions that happen to work out fine on your GPU architecture, but not necessarily on others.

Anyway here's the test results I got on M1 Max. I had to add #pragma require wavebasic to the compute shader btw, since without it Unity's shader compiler targets too old Metal version, and errors out about wave ops.

MetalDebugOutput.zip

@b0nes164
Copy link
Contributor Author

Interesting, there was an issue with the inter-threadblock prefix-sum with the offending line here:

g_scan[circularLaneShift + waveOffset] = g_scan[WaveGetLaneIndex() + waveOffset] +
WavePrefixSum(g_scan[WaveGetLaneIndex() + waveOffset]);
b_passHist[i + offset] = (WaveGetLaneIndex() ? g_scan[WaveGetLaneIndex() + waveOffset] : 0) + aggregate;

Fortunately this line can be fixed without adding an additional barrier. Although the downsweep kernel came out clean, I will be adding additional barriers as a safety measure.

Fix scan kernel pulling stale values from shared memory by moving operations into registers.

Fix bug affecting wave sizes != 32 where the preprocessor macro incorrectly defined the size of the downsweep histograms.

Fix non-presenting but potential bug where the final barrier in the partial partition routine could display undefined behaviour.
@b0nes164
Copy link
Contributor Author

b0nes164 commented Jan 11, 2024

Alright the fix for several bugs has been pushed! Overall the performance didn't suffer too badly-- about a ~5% reduction on extremely large inputs, and a negligible reduction on the lower end. I've been having trouble activating a Unity license on my AMD computer, so the algorithm is still untested on that front. I plan to make a headless DX12 implementation to circumvent the license issue, but it might be a couple more weeks until I get the chance to do it.

@aras-p
Copy link
Owner

aras-p commented Jan 11, 2024

Nice! That seems to work correctly now here. On Bicycle 3DGS scene at "Medium" quality, total render frame time goes 23.3ms -> 22.0ms on M1 Max. I'll do some more testing and merge this request sometime later (maybe once you or someone does testing on AMD / Intel). Thanks!

if (bits == 1)
{
for (int wavePart = 0; wavePart < waveParts; ++wavePart)
g_localHist[key >> e_radixShift & RADIX_MASK] += countbits(waveFlags[wavePart]);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be faster to do the reduce of countbits first and then do a single update to g_localHist?

Copy link
Contributor Author

@b0nes164 b0nes164 Jan 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be, but my concern in this case is the register count. Assuming that the HLSL compiler will return a result similar to CUDA, we are sitting right on the fence where registers start cutting into occupancy. Furthermore, this portion of the routine is only invoked for the last partition, and has extremely negligible impact on the performance of the algorithm. Whereas, using more registers here might make the compiler think it needs more registers, which could severely impact the entire algorithm.

@raphlinus
Copy link

I reviewed this for a work-in-progress survey of GPU sorting techniques. Overall it looks impressive, though I did find a couple of correctness concerns.

First, there is no wave barrier between offset = g_localHist[...] (line 334) and g_local_hist[...] += ... (line 338). See gpuweb/gpuweb#4437 for a more in-depth discussion of this issue, which I believe motivates adding subgroup (wave) barrier to WebGPU. HLSL does not have this intrinsic, though it's likely that most GPUs execute waves in lockstep so the barrier is implicit. Even so, when the HLSL gets translated to Metal or Vulkan (both of which do have subgroup barriers), it is at least technically a data race. Using a full GroupMemoryBarrierWithGroupSync would satisfy correctness concerns with some performance loss. On M1 Max on my (very similar) codebase, I observed about a 2% hit, but it may be larger on other GPUs. I'd certainly be interested in more data on this.

In an earlier draft of that survey, I complained that wave size assigned to one compute shader may not be constant on another, compiled on the same GPU (especially Intel), so the probe strategy is invalid. That's still true, but I think it's not a real worry, as wave size is guaranteed to be between 4 and 128 inclusive on all GPUs. This is documented in WaveGetLaneCount and these limits are the same as Vulkan. So I believe testing wave size at all is a no-op.

I'm impressed that this algorithm is portable to different wave sizes, but I think this comes at a performance cost. My experiment was specialized to 32, and I was able to extract some gains:

  • wave ballot accumulation is a single uint, so no need to iterate over wavePart
  • WLMS accumulation can happen on all waves in parallel, no need to sequentialize by wave

Of course, the downside to the latter is that the histogram array needs storage proportional to digit size * number of waves. I was only able to get good results with 4 bit digits, so it's possible that doing fewer passes will outweigh the concerns here.

The other concern is of course that support for specializing by wave size is extremely spotty in GPU infrastructure; really only Vulkan 1.3 does it properly. I'm pushing for WebGPU to get this right; not sure how Unity handles it.

@aras-p
Copy link
Owner

aras-p commented Jan 21, 2024

not sure how Unity handles it

Starting with Unity 2023 it seems that they made it easier to target same shader for different wave sizes, but the docs are not exactly clear how to use it (currently this gaussian splatting project targets Unity 2022 because it's the current LTS version though).

Starting with 2023 it looks like you can do a #pragma multi_compile your compute shader into several variants using UNITY_DEVICE_SUPPORTS_WAVE_8 .. UNITY_DEVICE_SUPPORTS_WAVE_128 special defines, and then unity will pick the right one at runtime (docs page). Before that, you could compile into several variants, but would have to pick the right one manually, which is somewhat more cumbersome.

//Each wave clears its own portion of shared memory
{
const int waveHistEnd = (WAVE_INDEX + 1) * RADIX;
for (int i = WaveGetLaneIndex() + WAVE_INDEX * RADIX; i < waveHistEnd; i += WaveGetLaneCount())

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really having trouble understanding how this will perform with wave sizes less than 32. In particular, with a wave size of 16, WAVE_INDEX will be 0..32, and the range of locations cleared will be 0..8192. But g_localHist is sized to 7680. The situation is much worse for a wave size of 8.

By my logic, the size of g_localHist should be RADIX * DS_THREADS / WaveGetLaneCount(), and I do not understand at all where this 7680 is coming from. Of course the problem with that is that WaveGetLaneCount() is not known at compile time.

Copy link
Contributor Author

@b0nes164 b0nes164 Jan 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, no this is definitely an issue. The partition size of 7680 was simply the largest partition tile size I could choose that would also fit into HLSL's 32kb limit of shared memory during the scattering.

It seems like multi-compiling might the only option to cover all the partition sizes. see the comment below on why I was trying to avoid speicialized shader variants,

@b0nes164
Copy link
Contributor Author

b0nes164 commented Jan 21, 2024

First, there is no wave barrier between offset = g_localHist[...] (line 334) and g_local_hist[...] += ... (line 338). See gpuweb/gpuweb#4437 for a more in-depth discussion of this issue, which I believe motivates adding subgroup (wave) barrier to WebGPU. HLSL does not have this intrinsic, though it's likely that most GPUs execute waves in lockstep so the barrier is implicit.

There should definitely be a barrier there, but as you mentioned, there is no subgroup level barrier available in HLSL. I have a mind to change that portion of the code entirely because of these issues.

wave ballot accumulation is a single uint, so no need to iterate over wavePart
WLMS accumulation can happen on all waves in parallel, no need to sequentialize by wave

Of course, the downside to the latter is that the histogram array needs storage proportional to digit size * number of waves. I was only able to get good results with 4 bit digits, so it's possible that doing fewer passes will outweigh the concerns here.

As I mentioned in the comment, we only use sequential WLMS in the final partition as a way of easily dealing with non-standard size partitions. Because it is only invoked once per binning pass, the performance difference is negligible. (For example you have an input of size 2 million, with a partition size of 7680, 261 partitions, only one is performed sequentially).

In an earlier draft of that survey, I complained that wave size assigned to one compute shader may not be constant on another, compiled on the same GPU (especially Intel), so the probe strategy is invalid.

I did not know that, and will change this routine then. As Aras mentioned, I didn't find any documentation on Unity exposing the functionality to grab the adapter description as you would in Vulkan or DX12.

I'm impressed that this algorithm is portable to different wave sizes, but I think this comes at a performance cost.

I thought this too at first. The primary downsides being less loop unrolling, and more control logic being necessary, but after implementing here, I found the difference in performance to be negligible on my GPU. More testing is definitely required, but as Aras mentioned, the solution in Unity is essentially to maintain a specialized version of the code for every possible wave size, which makes maintainability harder, and cuts into compile times for the shader.

@raphlinus
Copy link

A quick comment: I now see I misread the code and see that "last partition" is its own specialized thing. So I withdraw my assertion that sequentializing in this case has a meaningful performance impact.

Getting the wave sizes is a big problem across the GPU ecosystem, and WebGPU is struggling with it. My gut feeling is that a production implementation will need specialized versions when you do know the wave size, and a fallback implementation when you don't. But perhaps we will figure out how to make WLMS wave-size portable.

@b0nes164
Copy link
Contributor Author

As Raph pointed out, there are currently issues with the static allocation of shared memory, and as Aras pointed out, it does not seem like Unity 2022 has support for appropriately querying the GPU for the wave/warp/subgroup size of the device. So I will assume that #pragma multi_compile is off the table.

The constraints we are working under is that each warp needs a histogram of size equivalent to the number of radix digit bins, 256. As warps get smaller, we need more shared memory to accommodate the warp histograms. Shared memory has the added drawback that it must be statically allocated (at least in HLSL). Fortunately there is a solution that should allow us to support waves down to size 8, with no changes to occupancy or thread-count:

  1. We combine the g_waveHist and g_localHist into one contiguous block of memory, giving us a total memory size of 32kb.
  2. We then delay pulling the global histogram values into shared memory until after the WLMS is performed, allowing us to use the entire memory block for WLMS.
  3. Because we use a fixed partition size, we know that the total sum of digit counts for any given digit cannot exceed the size of the partition tile. Thus we can use uint16_t instead of uint32_t to store our digit accumulations, effectively doubling the size of the shared memory during WLMS.

This may be messy in HLSL, because it does not have reinterpret_cast, but it is worth a shot.

Upsweep kernel:
Change the threadgroup size from 64 to 128 to accommodate larger minimum waves.

Fix a potential race condition caused by circular shifting during the exclusive prefix sum for wave width greater or equal to 16.

Add a fused brent-kung scan to support wave widths less than 16.

Scan kernel:
Removed wave size polling. Change to fixed threadblock dispatch sizes.

Update the exclusive scan to accommodate the fixed threadblock dispatch.

Add support wave widths less than sixteen.

Downsweep:
Moved loading the pass and local histogram to later in the kernel to allow full use of the shared memory during WLMS.

Fixed a potential race condition by adding a second barrier during the WLMS.

Fixed a potential race condition caused by circular shifting during the exclusive prefix sum across reduced wave histograms.

Added support for wave width 8, with wave width 4 planned.

Added bitpacking to maximize shared memory utilization during WLMS for wave width 4.

Fixed a potential race condition during the partial routine by adding a second barrier during WLMS.
@b0nes164
Copy link
Contributor Author

b0nes164 commented Jan 28, 2024

Some general notes about progress so far:

This is a preliminary commit to show the general direction of the algorithm. Each kernel now contains sections to handle different wave sizes, which is done logically by using WaveGetLaneCount(). In later versions this should be done with the compiler using UNITY_DEVICE_SUPPORTS_WAVE_XXX.

Each kernel contains of portion of code to support "big" waves, 16 and above, and "small" waves, below 16. We choose 16 as the dividing size because waves 16 and above can easily perform the local prefix sums required for a 8-bit radix pass, $16^{2} = 256$, whereas waves 8 and below are too small and require a more specialized approach. Namely, a wave-width-base Brent-Kung scan with a fused "upsweep" and "downsweep."

Although the code is still untested on wave-widths not 32, I have completed my DX12 implementation, and am beginning testing in earnest. In particular, DX12 has access to the [WaveSize(<numLanes>)] attribute as part of shader model 6.6, which allows the user to force specific wave-widths on kernels, which will be extremely useful for testing.

This version of the code only supports down to a wave width of 8, but wave-width of 4 is very achievable simply by performing the WLMS serially. I will be adding this soon, and I expect this to still be a significant performance win over losing occupancy due to shared memory constraints.

I have increased the threadgroup sizes of the Upsweep and Scan to 128 to make the kernels more amenable to the maximum possible wave-width. I also plan to reduce the threadgroup size of the Downsweep kernel to 256 to accommodate hardware with lower warp hosting capacity.

Upsweep Kernel:

Fix bug with prefix sum for wave size less than 16, where aggregate was not being calculated properly.

Fix bug with prefix sum for wave size less than 16, where downsweep was off by one.

Scan Kernel:

Fix bug with prefix sum for wave size less than 16, where aggregate was not being calculated properly.

Fix bug with prefix sum for wave size less than 16, where downsweep was off by one.

Downsweep Kernel:

Add support for wave size 4.

Split key loading and scattering into less than 16 and greater or equal to 16.

Fix bug where waveflag was improperly initialized for waves less than 16.
@b0nes164
Copy link
Contributor Author

b0nes164 commented Jan 31, 2024

Test coverage results:

Adapter Wave Size Test Results
WARP [4, 4] Passing
Intel HD Graphics 620 [16, 16] Passing
Nvidia RTX 2080 Super [32, 32] Passing
AMD 7900 XT allowed to vary from [32, 64] Passing
AMD 7900 XT locked to 64 Passing
Algorithm version $2^{28}$ uniform random 32-bit key value pairs per second, DX12, 2080 Super
Dec 8th commit $4.15 * 10 ^ {9}$
Current $4.17 * 10 ^ {9}$

Interestingly, shuffling things around, and moving some things into registers increased performance, which is why the performance increased slightly, despite significantly more barriers used. Performance could be better on higher end cards because lowering partition sizes cuts into cache-line efficiency during the scatter, but is unavoidable without compiler variants.

I am now confident that this version of the sort is ready to be merged.

@b0nes164
Copy link
Contributor Author

I have one last update in the works, which slightly improves performance, and refactors the code for readability. I plan to commit this in the next few days.

Removed all preprocessor macro functions.

Refactor some variable names, and cut line length to 100.

Unroll WLMS loop.

Use laneMaskLt instead of laneMaskLe for WLMS.

Remove redundant logic in scan kernel.

Fix silent bug during WLT16 exclusive prefix sum.
@b0nes164
Copy link
Contributor Author

Test coverage results:

Adapter Wave Size Test Results
WARP [4, 4] Passing
Intel HD Graphics 620 [16, 16] Passing
Nvidia RTX 2080 Super [32, 32] Passing
AMD 7900 XT allowed to vary from [32, 64] Passing
AMD 7900 XT locked to 64 Passing
Algorithm version $2^{28}$ uniform random 32-bit key value pairs per second, DX12 standalone, 2080 Super
Dec 8th commit $4.15 * 10 ^ {9}$
Jan 30th commit $4.17 * 10 ^ {9}$
Current $4.44 * 10 ^ {9}$

@aras-p
Copy link
Owner

aras-p commented Feb 19, 2024

@b0nes164 very nice! Do you have some sort of "high level outline" of the algorithm used here? Like, is this similar to OneSweep at this point, or different in some way?

@aras-p aras-p merged commit a1febce into aras-p:main Feb 19, 2024
@b0nes164
Copy link
Contributor Author

b0nes164 commented Feb 19, 2024

@aras-p

Do you have some sort of "high level outline" of the algorithm used here?

Not yet, but it is on my to do list.

[I]s this similar to OneSweep at this point, or different in some way?

Yes, this is almost exactly the same as OneSweep except for one key difference: Reduce-Then-Scan (RTS) is used in place of Chained-Scan-with-Decoupled-Lookback (CSDL) to perform the inter-threadblock prefix sum over digit counts.

During a counting sort each key is serially dependent on the counts of preceeding keys, and on a GPU these key counts somehow have to be communicated between the threadblocks during the sort. RTS and CSDL are two approaches to solving this problem. The issue is that, in my experience, CSDL does not play nicely outside of CUDA. This is because Nvidia and AMD have different ways of performing the inter-threadblock communication that is required during CSDL. According to Raph, Nvidia utilizes forward thread progress guaruntees, whereas AMD utilizes something called Global Data Share (see 1.2.2.2 RDNA 3 ISA3) which is dedicated low latency but globally coherent device level memory. To the best of my knowledge, neither of these hardware specific features are exposed in the public D3D12 api, which makes CSDL and OneSweep impractical.

That being said, I know from talking to a D3D12 dev that Global Data Share is exposed interally, and is used for what I'm guessing is D3D12's mesh shader implementation where CSDL would be used to compact culled primitives.

Also apologies, but I have one last update for DeviceRadixSort to fix a silent bug that passed my testing.

@jacemiller
Copy link

Thank you to aras-p and b0nes164 for this effort!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants