copyto! does not support CPU SubArrays #422

ww1g11 · 2019-10-18T11:35:06Z

Describe the bug
Copyto! does not work for subarray when scalar get index is disallowed.

To Reproduce
The Minimal Working Example (MWE) for this bug:

using CuArrays
CuArrays.allowscalar(false)
N = 100
u_c = rand(N, 2);
u_d = CuArrays.CuArray(ones(N));
u_v = view(u_c, :, 1);
copyto!(u_d, u_v) #works
copyto!(u_v, u_d) #does not work

Environment details (please complete this section)
Details on Julia:

Julia Version 1.1.0
Commit 80516ca202 (2019-01-21 21:24 UTC)

Status `~/.julia/environments/v1.1/Project.toml`
  [3a865a2d] CuArrays v1.2.1

The text was updated successfully, but these errors were encountered:

david-macmahon · 2022-08-06T05:38:10Z

I can see why avoiding scalar indexing for copyto! is non-trivial with views in general, but it would be great to have copyto! avoid non-scalar indexing for contiguous views. One of SubArray's type parameters (L) indicates whether the SubArray supports "fast linear indexing". I don't know whether that is exactly the same as being strictly contiguous in memory (i.e. in a DMA sense), but it would be nice if copyto! could somehow be a bit more discerning about when to fallback to scalar indexing for views.

maleadt · 2022-08-06T07:09:59Z

The problem is that we don't want to generalize our copy routines to support all kinds of CPU arrays, because doing so introduces a whole bunch of ambiguities (we tried in the past, #284), while we also don't just want to duplicate the copy routines in GPUArrays (

GPUArrays.jl/src/host/abstractarray.jl

Lines 43 to 214 in 2b7dbeb

    
           # memory copying 
        
           function Base.copy!(dst::AbstractGPUVector, src::AbstractGPUVector) 
        
               axes(dst) == axes(src) || throw(ArgumentError( 
        
               "arrays must have the same axes for `copy!`. consider using `copyto!` instead")) 
        
               copyto!(dst, src) 
        
           end 
        
           ## basic linear copies of identically-typed memory 
        
           # expects the GPU array type to have linear `copyto!` methods (i.e. accepting an integer 
        
           # offset and length) from and to CPU arrays and between GPU arrays. 
        
           for (D, S) in ((AnyGPUArray, Array), 
        
                          (Array, AnyGPUArray), 
        
                          (AnyGPUArray, AnyGPUArray)) 
        
               @eval begin 
        
                   function Base.copyto!(dest::$D{<:Any, N}, rdest::UnitRange, 
        
                                         src::$S{<:Any, N}, ssrc::UnitRange) where {N} 
        
                       drange = CartesianIndices((rdest,)) 
        
                       srange = CartesianIndices((ssrc,)) 
        
                       copyto!(dest, drange, src, srange) 
        
                   end 
        
                   function Base.copyto!(dest::$D, d_range::CartesianIndices{1}, 
        
                                         src::$S, s_range::CartesianIndices{1}) 
        
                       len = length(d_range) 
        
                       if length(s_range) != len 
        
                           throw(ArgumentError("Copy range needs same length. Found: dest: $len, src: $(length(s_range))")) 
        
                       end 
        
                       len == 0 && return dest 
        
                       d_offset = first(d_range)[1] 
        
                       s_offset = first(s_range)[1] 
        
                       copyto!(dest, d_offset, src, s_offset, len) 
        
                   end 
        
                   Base.copyto!(dest::$D, src::$S) = copyto!(dest, 1, src, 1, length(src)) 
        
               end 
        
           end 
        
           # kernel-based variant for copying between wrapped GPU arrays 
        
           function linear_copy_kernel!(ctx::AbstractKernelContext, dest, dstart, src, sstart, n) 
        
               i = linear_index(ctx)-1 
        
               if i < n 
        
                   @inbounds dest[dstart+i] = src[sstart+i] 
        
               end 
        
               return 
        
           end 
        
           function Base.copyto!(dest::AnyGPUArray, dstart::Integer, 
        
                                 src::AnyGPUArray, sstart::Integer, n::Integer) 
        
               n == 0 && return dest 
        
               n < 0 && throw(ArgumentError(string("tried to copy n=", n, " elements, but n should be nonnegative"))) 
        
               destinds, srcinds = LinearIndices(dest), LinearIndices(src) 
        
               (checkbounds(Bool, destinds, dstart) && checkbounds(Bool, destinds, dstart+n-1)) || throw(BoundsError(dest, dstart:dstart+n-1)) 
        
               (checkbounds(Bool, srcinds, sstart)  && checkbounds(Bool, srcinds, sstart+n-1))  || throw(BoundsError(src,  sstart:sstart+n-1)) 
        
               gpu_call(linear_copy_kernel!, 
        
                        dest, dstart, src, sstart, n; 
        
                        elements=n) 
        
               return dest 
        
           end 
        
           # variants that materialize the GPU wrapper before copying from or to the CPU 
        
           function Base.copyto!(dest::Array, dstart::Integer, 
        
                                 src::WrappedGPUArray, sstart::Integer, n::Integer) 
        
               n == 0 && return dest 
        
               temp = similar(parent(src), n) 
        
               copyto!(temp, 1, src, sstart, n) 
        
               copyto!(dest, dstart, temp, 1, n) 
        
               return dest 
        
           end 
        
           function Base.copyto!(dest::WrappedGPUArray, dstart::Integer, 
        
                                 src::Array, sstart::Integer, n::Integer) 
        
               n == 0 && return dest 
        
               temp = similar(parent(dest), n) 
        
               copyto!(temp, 1, src, sstart, n) 
        
               copyto!(dest, dstart, temp, 1, n) 
        
               return dest 
        
           end 
        
           # variants that converts values on the CPU when there's a type mismatch 
        
           # 
        
           # we prefer to convert on the CPU where there's typically more memory / less memory pressure 
        
           # to quickly perform these very lightweight conversions 
        
           function Base.copyto!(dest::Array{T}, dstart::Integer, 
        
                                 src::AnyGPUArray{U}, sstart::Integer, 
        
                                 n::Integer) where {T,U} 
        
               n == 0 && return dest 
        
               temp = Vector{U}(undef, n) 
        
               copyto!(temp, 1, src, sstart, n) 
        
               copyto!(dest, dstart, temp, 1, n) 
        
               return dest 
        
           end 
        
           function Base.copyto!(dest::AnyGPUArray{T}, dstart::Integer, 
        
                                 src::Array{U}, sstart::Integer, n::Integer) where {T,U} 
        
               n == 0 && return dest 
        
               temp = Vector{T}(undef, n) 
        
               copyto!(temp, 1, src, sstart, n) 
        
               copyto!(dest, dstart, temp, 1, n) 
        
               return dest 
        
           end 
        
           ## generalized blocks of heterogeneous memory 
        
           function cartesian_copy_kernel!(ctx::AbstractKernelContext, dest, dest_offsets, src, src_offsets, shape, length) 
        
               i = linear_index(ctx) 
        
               if i <= length 
        
                   idx = CartesianIndices(shape)[i] 
        
                   @inbounds dest[idx + dest_offsets] = src[idx + src_offsets] 
        
               end 
        
               return 
        
           end 
        
           function Base.copyto!(dest::AnyGPUArray{<:Any, N}, destcrange::CartesianIndices{N}, 
        
                                 src::AnyGPUArray{<:Any, N}, srccrange::CartesianIndices{N}) where {N} 
        
               shape = size(destcrange) 
        
               if shape != size(srccrange) 
        
                   throw(ArgumentError("Ranges don't match their size. Found: $shape, $(size(srccrange))")) 
        
               end 
        
               len = length(destcrange) 
        
               len == 0 && return dest 
        
               dest_offsets = first(destcrange) - oneunit(CartesianIndex{N}) 
        
               src_offsets = first(srccrange) - oneunit(CartesianIndex{N}) 
        
               gpu_call(cartesian_copy_kernel!, 
        
                        dest, dest_offsets, src, src_offsets, shape, len; 
        
                        elements=len) 
        
               dest 
        
           end 
        
           for (dstTyp, srcTyp) in (AbstractGPUArray=>Array, Array=>AbstractGPUArray) 
        
               @eval function Base.copyto!(dst::$dstTyp{T,N}, dstrange::CartesianIndices{N}, 
        
                                           src::$srcTyp{T,N}, srcrange::CartesianIndices{N}) where {T,N} 
        
                   isempty(dstrange) && return dst 
        
                   if size(dstrange) != size(srcrange) 
        
                       throw(ArgumentError("source and destination must have same size (got $(size(srcrange)) and $(size(dstrange)))")) 
        
                   end 
        
                   # figure out how many dimensions of the Cartesian ranges map onto contiguous memory 
        
                   # in both source and destination. we will copy these one by one as linear ranges. 
        
                   contiguous_dims = 1 
        
                   for dim in 2:N 
        
                       # a slice is broken up if the previous dimension didn't cover the entire range 
        
                       if axes(src, dim-1) == axes(srcrange, dim-1) && 
        
                       axes(dst, dim-1) == axes(dstrange, dim-1) 
        
                           contiguous_dims = dim 
        
                       else 
        
                           break 
        
                       end 
        
                   end 
        
                   m = prod(size(dstrange)[1:contiguous_dims])       # inner, contiguous length 
        
                   n = prod(size(dstrange)[contiguous_dims+1:end])   # outer non-contiguous length 
        
                   @assert m*n == length(srcrange) == length(dstrange) 
        
                   # copy linear slices 
        
                   for i in 1:m:m*n 
        
                       srcoff = LinearIndices(src)[srcrange[i]] 
        
                       dstoff = LinearIndices(dst)[dstrange[i]] 
        
                       # TODO: Use asynchronous memory copies 
        
                       copyto!(dst, dstoff, src, srcoff, m) 
        
                   end 
        
                   dst 
        
               end 
        
           end

) and any back-end to have a method for both Array and a specifically-typed contiguous CPU-based SubArray.

david-macmahon · 2022-08-06T19:32:48Z

I completely agree about not wanting to support all kinds of CPU arrays, but IMHO memory contiguous views seem like an important enough subset of CPU arrays to be worthy of support.

The use case where I first encountered this limitation was where I wanted to have B (where B >= 2) MxN matrices on the CPU so that when one is being transferred to the GPU the other(s) can still continue to be populated on the CPU side. I first implemented this as a single MxNxB array on the CPU and made MxN views of each slice along the B dimension. But I soon discovered that copyto! doesn't work with views (i.e. SubArrays). The workaround for this case is to just make a vector (or tuple) of B separate MxN Arrays. This results in more calls to Mem.pin and more-but-smaller DMA mappings, which seems less efficient but haven't had any problems with that in practice.

A more problematic use case is where the GPU produces MxN parts of a larger MxBN output array. I would like to be able to use copyto! to copy each MxN part directly into its place in the larger MxBN output array in CPU memory, but AFAIK this is not currently possible. The workaround is to have B different MxN output arrays, but treating those as a larger MxBN array is cumbersome at best.

It seems like there must be some way to have a copyto! method that takes a SubArray{T,N,Array,I,true} and a CuArray and performs a DMA copy if the SubArray is memory contiguous or falls back to scalar indexing if the SubArray in not memory contiguous. Maybe SubArray{T,N,Array,I,true} is already guaranteed to be memory contiguous?

albertomercurio · 2024-02-21T13:13:03Z

Hello,

do you have any news about this issue? I think that it would be very useful for a lot of applications.

maleadt · 2024-02-21T14:51:10Z

Nobody seems to be working on this, so no there have been no updates.

maleadt transferred this issue from JuliaGPU/CuArrays.jl May 27, 2020

maleadt added bug cuda array enhancement and removed bug labels May 27, 2020

VaclavMacha mentioned this issue May 28, 2022

Unnecessary allocation in getobs! for arrays JuliaML/MLUtils.jl#86

Closed

maleadt changed the title ~~Copyto! does not work for subarray when scalar get index is disallowed~~ copyto! does not support CPU SubArrays Aug 6, 2022

maleadt transferred this issue from JuliaGPU/CUDA.jl Aug 6, 2022

maleadt removed the cuda array label Oct 3, 2022

maleadt mentioned this issue Oct 20, 2022

copyto! does not work between a CuArray and a view(Array) JuliaGPU/CUDA.jl#1634

Closed

albertomercurio mentioned this issue Oct 20, 2022

GPU support for expv and expv_timestep SciML/ExponentialUtilities.jl#101

Merged

marius311 mentioned this issue Dec 21, 2022

make extract_gradient[_chunk]! GPU compatible JuliaDiff/ForwardDiff.jl#619

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

copyto! does not support CPU SubArrays #422

copyto! does not support CPU SubArrays #422

ww1g11 commented Oct 18, 2019

david-macmahon commented Aug 6, 2022

maleadt commented Aug 6, 2022

david-macmahon commented Aug 6, 2022

albertomercurio commented Feb 21, 2024

maleadt commented Feb 21, 2024

copyto! does not support CPU SubArrays #422

copyto! does not support CPU SubArrays #422

Comments

ww1g11 commented Oct 18, 2019

david-macmahon commented Aug 6, 2022

maleadt commented Aug 6, 2022

david-macmahon commented Aug 6, 2022

albertomercurio commented Feb 21, 2024

maleadt commented Feb 21, 2024