-
Notifications
You must be signed in to change notification settings - Fork 16
Multi Dimensional Statically Sized Arrays
// C-style
template <class T, int rank, unsigned D0,
unsigned D1,
unsigned D2,
unsigned D3>
SArray< T , rank , D0 [, D1 , D2 , D3] >();
// Fortran-style
template <class T, int rank, int L0, int L0,
int L1, int U1,
int L2, int U2,
int L3, int U3>
FSArray< T , rank , SB<L0,U0> [, SB<L1,U1> , SB<L2,U2> , SB<L3,U3>] >();
There are times when you want to have a small, low-overhead array that's just created on the local stack rather than allocated on the heap. E.g.,
- You want an array that's thread-private inside a
parallel_for
- You want a small read-only array to be put into "constant" memory on GPUs in kernel launches
The idea is for SArray
(C-style) and FSArray
(Fortran-style) stack-allocated, statically-sized objects to be "small" (order of Kb's at most?), though the user can do what they wish. The reason is that large arrays on the stack have overflow issues in threading and can make code sanitizers like valgrind struggle.
Data is accessed via the parentheses operator just like dynamically allocated multi-dimensional Array objects. These also share the get_dimensions()
, get_lbounds()
, get_ubounds()
, and data()
methods from the dynamically allocated Array
objects. Further, SArray
and FSArray
objects can be printed with std::cout
and std::cerr
.
Unlike Array
objects, SArray
and FSArray
objects are always deep copied. Therefore, a = b
copies the data itself in a deep copy if they are both statically-sized array objects. Also, since this data lives in the stack of whatever context they are created in, they don't really have a memory space. If you create one on the host and then use it in a device kernel, it is implicitly copied by value to the device entirely and then lives on the device. Its lifetime is tied rigidly to the scope in which it is created just like a Fortran "automatic" array.
C-style SArray
objects always have dimensions bounds from 0
,...,N-1
if N
is passed as a template dimension size parameter.
using yakl::SArray;
int constexpr ord = 3;
// Create a 2-D stack array
SArray<float,2,ord,ord> recon_stencil_to_coefs;
for (int j=0; j < ord; j++) {
for (int i=0; i < ord; i++) {
recon_stencil_to_coefs(j,i) = ...;
}
}
// In the following kernel, recon_stencil_to_coefs is copied by value due to the
// lambda expression. Therefore, in CUDA and HIP, it exists in GPU "constant" memory
// because it was passed by parameter.
// Further, "stencil" is on the kernel stack and each thread has its own copy of it
parallel_for( Bounds<2>(ny,nx) , YAKL_LAMBDA (int j, int i) {
SArray<float,1,ord> stencil;
// Load stencil from larger array
// Matrix multiply recon matrix by stencil to get coefs
});
Fortran-style FSArray
objects always have dimensions bounds from 1
,...,N
if N
is passed as a template dimension size parameter. This lower bound can be changed via the "Static Bounds" or SB
class. E.g., FSArray <float , 2 , SB<-2,2> , 10 > arr;
is the equivalent of the Fortran automatic array declaration real :: arr(-2:2,10)
. The SB<lower,upper>
syntax is admittedly kind of ugly, but C++ limitations require it instead of a more desirable {lower,upper}
syntax.
using yakl::FSArray;
using yakl::SB
int constexpr ord = 3;
int constexpr hs = (ord-1)/2;
// Create a 2-D stack array
FSArray<float,2,ord,ord> recon_stencil_to_coefs;
for (int j=1; j <= ord; j++) {
for (int i=1; i <= ord; i++) {
recon_stencil_to_coefs(j,i) = ...;
}
}
// In the following kernel, recon_stencil_to_coefs is copied by value due to the
// lambda expression. Therefore, in CUDA and HIP, it exists in GPU "constant" memory
// because it was passed by parameter.
// Further, "stencil" is on the kernel stack and each thread has its own copy of it
parallel_for( Bounds<2>(ny,nx) , YAKL_LAMBDA (int j, int i) {
FSArray<float,1,SB<-hs,hs>> stencil;
// Load stencil from larger array
// Matrix multiply recon matrix by stencil to get coefs
});