Linux Debug errors with CUDA 11.0 #569

ptheywood · 2021-06-23T08:31:32Z

CUDA 11.0 Debug builds on linux are reporting errors at runtime.

This is not an issue in 11.1 or 11.3 (unsure on 11.2 right now).

This occurs for the spatial 3D boids example, unsure if others are effected (I assume so).

This has been confirmed to occur on 2 separate machines (mavericks, blackmass).

/home/ptheywood/code/flamegpu/FLAMEGPU2/include/flamegpu/gpu/CUDAErrorChecking.h(37): CUDA Error: /home/ptheywood/code/flamegpu/FLAMEGPU2/src/flamegpu/gpu/CUDAAgent.cu(263): misaligned address
terminate called after throwing an instance of 'CUDAError'
  what():  /home/ptheywood/code/flamegpu/FLAMEGPU2/include/flamegpu/gpu/CUDAErrorChecking.h(37): CUDA Error: /home/ptheywood/code/flamegpu/FLAMEGPU2/src/flamegpu/gpu/CUDAAgent.cu(263): misaligned address

The text was updated successfully, but these errors were encountered:

ptheywood · 2021-07-13T17:06:41Z

This occurs during validateIDCollisions, in CUDA 11.0 only, for sufficiently large populations (for circles_spatial3d, somewhere between 4000 and 4300 on a titan v?).

It is caught by the first cuda error check after a call to cub::DeviceReduce::Sum, so it may just be a cub issue? or an nvcc issue?

CUB in CUDA 11.0 should be 1.9.9, 11.1 is 1.9.10, while we are explicitly using 1.10.0 for the improved CMake support.

According to cuda-memcheck, it's Invalid __global__ write of size 16 in

cub::DeviceReduceKernel<cub::DeviceReducePolicy<unsigned int, unsigned int, int, cub::Sum>::Policy600, unsigned int*, unsigned int*, int, cub::Sum>(unsigned int, int, cub::Sum, cub::GridEvenShare<int>, cub::DeviceReducePolicy<unsigned int, unsigned int, int, cub::Sum>::Policy600)

Via cuda-gdb, this occurs at agent_reduce.cuh:259, which is during a vectorised read.

This continues to occur if i bump the Thrust/CUB version to 1.13.0 via cmake.

vec_items is the variable being written to. It is a reinterpret casted <VectorT*> from an InputT[ITEMS_PER_THREAD] which is a local memory / register array subject to the compiler. So feels like a compiler bug? (especially as it is not an issue in more recent CUDA versions?

CUDA 10.0 and 10.2 don't experience this issue. So calling it a CUDA 11.0 specific bug.

ptheywood added the bug label Jun 23, 2021

ptheywood mentioned this issue Jul 13, 2021

Known issue, cub error checking and cub version bump. #592

Merged

ptheywood added the wontfix label Jul 13, 2021

ptheywood closed this as completed in #592 Jul 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Linux Debug errors with CUDA 11.0 #569

Linux Debug errors with CUDA 11.0 #569

ptheywood commented Jun 23, 2021 •

edited

Loading

ptheywood commented Jul 13, 2021 •

edited

Loading

Linux Debug errors with CUDA 11.0 #569

Linux Debug errors with CUDA 11.0 #569

Comments

ptheywood commented Jun 23, 2021 • edited Loading

ptheywood commented Jul 13, 2021 • edited Loading

ptheywood commented Jun 23, 2021 •

edited

Loading

ptheywood commented Jul 13, 2021 •

edited

Loading