Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

copyto! does not support CPU SubArrays #422

Open
ww1g11 opened this issue Oct 18, 2019 · 5 comments
Open

copyto! does not support CPU SubArrays #422

ww1g11 opened this issue Oct 18, 2019 · 5 comments

Comments

@ww1g11
Copy link

ww1g11 commented Oct 18, 2019

Describe the bug
Copyto! does not work for subarray when scalar get index is disallowed.

To Reproduce
The Minimal Working Example (MWE) for this bug:

using CuArrays
CuArrays.allowscalar(false)
N = 100
u_c = rand(N, 2);
u_d = CuArrays.CuArray(ones(N));
u_v = view(u_c, :, 1);
copyto!(u_d, u_v) #works
copyto!(u_v, u_d) #does not work

Environment details (please complete this section)
Details on Julia:

Julia Version 1.1.0
Commit 80516ca202 (2019-01-21 21:24 UTC)

Status `~/.julia/environments/v1.1/Project.toml`
  [3a865a2d] CuArrays v1.2.1
@david-macmahon
Copy link

I can see why avoiding scalar indexing for copyto! is non-trivial with views in general, but it would be great to have copyto! avoid non-scalar indexing for contiguous views. One of SubArray's type parameters (L) indicates whether the SubArray supports "fast linear indexing". I don't know whether that is exactly the same as being strictly contiguous in memory (i.e. in a DMA sense), but it would be nice if copyto! could somehow be a bit more discerning about when to fallback to scalar indexing for views.

@maleadt
Copy link
Member

maleadt commented Aug 6, 2022

The problem is that we don't want to generalize our copy routines to support all kinds of CPU arrays, because doing so introduces a whole bunch of ambiguities (we tried in the past, #284), while we also don't just want to duplicate the copy routines in GPUArrays (

# memory copying
function Base.copy!(dst::AbstractGPUVector, src::AbstractGPUVector)
axes(dst) == axes(src) || throw(ArgumentError(
"arrays must have the same axes for `copy!`. consider using `copyto!` instead"))
copyto!(dst, src)
end
## basic linear copies of identically-typed memory
# expects the GPU array type to have linear `copyto!` methods (i.e. accepting an integer
# offset and length) from and to CPU arrays and between GPU arrays.
for (D, S) in ((AnyGPUArray, Array),
(Array, AnyGPUArray),
(AnyGPUArray, AnyGPUArray))
@eval begin
function Base.copyto!(dest::$D{<:Any, N}, rdest::UnitRange,
src::$S{<:Any, N}, ssrc::UnitRange) where {N}
drange = CartesianIndices((rdest,))
srange = CartesianIndices((ssrc,))
copyto!(dest, drange, src, srange)
end
function Base.copyto!(dest::$D, d_range::CartesianIndices{1},
src::$S, s_range::CartesianIndices{1})
len = length(d_range)
if length(s_range) != len
throw(ArgumentError("Copy range needs same length. Found: dest: $len, src: $(length(s_range))"))
end
len == 0 && return dest
d_offset = first(d_range)[1]
s_offset = first(s_range)[1]
copyto!(dest, d_offset, src, s_offset, len)
end
Base.copyto!(dest::$D, src::$S) = copyto!(dest, 1, src, 1, length(src))
end
end
# kernel-based variant for copying between wrapped GPU arrays
function linear_copy_kernel!(ctx::AbstractKernelContext, dest, dstart, src, sstart, n)
i = linear_index(ctx)-1
if i < n
@inbounds dest[dstart+i] = src[sstart+i]
end
return
end
function Base.copyto!(dest::AnyGPUArray, dstart::Integer,
src::AnyGPUArray, sstart::Integer, n::Integer)
n == 0 && return dest
n < 0 && throw(ArgumentError(string("tried to copy n=", n, " elements, but n should be nonnegative")))
destinds, srcinds = LinearIndices(dest), LinearIndices(src)
(checkbounds(Bool, destinds, dstart) && checkbounds(Bool, destinds, dstart+n-1)) || throw(BoundsError(dest, dstart:dstart+n-1))
(checkbounds(Bool, srcinds, sstart) && checkbounds(Bool, srcinds, sstart+n-1)) || throw(BoundsError(src, sstart:sstart+n-1))
gpu_call(linear_copy_kernel!,
dest, dstart, src, sstart, n;
elements=n)
return dest
end
# variants that materialize the GPU wrapper before copying from or to the CPU
function Base.copyto!(dest::Array, dstart::Integer,
src::WrappedGPUArray, sstart::Integer, n::Integer)
n == 0 && return dest
temp = similar(parent(src), n)
copyto!(temp, 1, src, sstart, n)
copyto!(dest, dstart, temp, 1, n)
return dest
end
function Base.copyto!(dest::WrappedGPUArray, dstart::Integer,
src::Array, sstart::Integer, n::Integer)
n == 0 && return dest
temp = similar(parent(dest), n)
copyto!(temp, 1, src, sstart, n)
copyto!(dest, dstart, temp, 1, n)
return dest
end
# variants that converts values on the CPU when there's a type mismatch
#
# we prefer to convert on the CPU where there's typically more memory / less memory pressure
# to quickly perform these very lightweight conversions
function Base.copyto!(dest::Array{T}, dstart::Integer,
src::AnyGPUArray{U}, sstart::Integer,
n::Integer) where {T,U}
n == 0 && return dest
temp = Vector{U}(undef, n)
copyto!(temp, 1, src, sstart, n)
copyto!(dest, dstart, temp, 1, n)
return dest
end
function Base.copyto!(dest::AnyGPUArray{T}, dstart::Integer,
src::Array{U}, sstart::Integer, n::Integer) where {T,U}
n == 0 && return dest
temp = Vector{T}(undef, n)
copyto!(temp, 1, src, sstart, n)
copyto!(dest, dstart, temp, 1, n)
return dest
end
## generalized blocks of heterogeneous memory
function cartesian_copy_kernel!(ctx::AbstractKernelContext, dest, dest_offsets, src, src_offsets, shape, length)
i = linear_index(ctx)
if i <= length
idx = CartesianIndices(shape)[i]
@inbounds dest[idx + dest_offsets] = src[idx + src_offsets]
end
return
end
function Base.copyto!(dest::AnyGPUArray{<:Any, N}, destcrange::CartesianIndices{N},
src::AnyGPUArray{<:Any, N}, srccrange::CartesianIndices{N}) where {N}
shape = size(destcrange)
if shape != size(srccrange)
throw(ArgumentError("Ranges don't match their size. Found: $shape, $(size(srccrange))"))
end
len = length(destcrange)
len == 0 && return dest
dest_offsets = first(destcrange) - oneunit(CartesianIndex{N})
src_offsets = first(srccrange) - oneunit(CartesianIndex{N})
gpu_call(cartesian_copy_kernel!,
dest, dest_offsets, src, src_offsets, shape, len;
elements=len)
dest
end
for (dstTyp, srcTyp) in (AbstractGPUArray=>Array, Array=>AbstractGPUArray)
@eval function Base.copyto!(dst::$dstTyp{T,N}, dstrange::CartesianIndices{N},
src::$srcTyp{T,N}, srcrange::CartesianIndices{N}) where {T,N}
isempty(dstrange) && return dst
if size(dstrange) != size(srcrange)
throw(ArgumentError("source and destination must have same size (got $(size(srcrange)) and $(size(dstrange)))"))
end
# figure out how many dimensions of the Cartesian ranges map onto contiguous memory
# in both source and destination. we will copy these one by one as linear ranges.
contiguous_dims = 1
for dim in 2:N
# a slice is broken up if the previous dimension didn't cover the entire range
if axes(src, dim-1) == axes(srcrange, dim-1) &&
axes(dst, dim-1) == axes(dstrange, dim-1)
contiguous_dims = dim
else
break
end
end
m = prod(size(dstrange)[1:contiguous_dims]) # inner, contiguous length
n = prod(size(dstrange)[contiguous_dims+1:end]) # outer non-contiguous length
@assert m*n == length(srcrange) == length(dstrange)
# copy linear slices
for i in 1:m:m*n
srcoff = LinearIndices(src)[srcrange[i]]
dstoff = LinearIndices(dst)[dstrange[i]]
# TODO: Use asynchronous memory copies
copyto!(dst, dstoff, src, srcoff, m)
end
dst
end
end
) and any back-end to have a method for both Array and a specifically-typed contiguous CPU-based SubArray.

@maleadt maleadt changed the title Copyto! does not work for subarray when scalar get index is disallowed copyto! does not support CPU SubArrays Aug 6, 2022
@maleadt maleadt transferred this issue from JuliaGPU/CUDA.jl Aug 6, 2022
@david-macmahon
Copy link

I completely agree about not wanting to support all kinds of CPU arrays, but IMHO memory contiguous views seem like an important enough subset of CPU arrays to be worthy of support.

The use case where I first encountered this limitation was where I wanted to have B (where B >= 2) MxN matrices on the CPU so that when one is being transferred to the GPU the other(s) can still continue to be populated on the CPU side. I first implemented this as a single MxNxB array on the CPU and made MxN views of each slice along the B dimension. But I soon discovered that copyto! doesn't work with views (i.e. SubArrays). The workaround for this case is to just make a vector (or tuple) of B separate MxN Arrays. This results in more calls to Mem.pin and more-but-smaller DMA mappings, which seems less efficient but haven't had any problems with that in practice.

A more problematic use case is where the GPU produces MxN parts of a larger MxBN output array. I would like to be able to use copyto! to copy each MxN part directly into its place in the larger MxBN output array in CPU memory, but AFAIK this is not currently possible. The workaround is to have B different MxN output arrays, but treating those as a larger MxBN array is cumbersome at best.

It seems like there must be some way to have a copyto! method that takes a SubArray{T,N,Array,I,true} and a CuArray and performs a DMA copy if the SubArray is memory contiguous or falls back to scalar indexing if the SubArray in not memory contiguous. Maybe SubArray{T,N,Array,I,true} is already guaranteed to be memory contiguous?

@albertomercurio
Copy link
Contributor

Hello,

do you have any news about this issue? I think that it would be very useful for a lot of applications.

@maleadt
Copy link
Member

maleadt commented Feb 21, 2024

Nobody seems to be working on this, so no there have been no updates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants