Adding many-dimensional arrays is slow #20163

amilsted · 2017-01-20T21:03:05Z

Hi,

I often want to add together two arrays with N small dimensions, where e.g. N=18 and each dimension has size 2.

N = 18
a = rand(2^N)
ar = reshape(a, ntuple(x->2, N))

Initially, I was unknowingly adding ReshapedArray objects with these dimensions, which is very slow, at least done naively using +:

v = view(a, 1:2^N)
@time v + v #a few ms for N=18 
vr = reshape(v, ntuple(x->2, N))
@time vr + vr #around 30s(!) for N=18

I see a similar slowdown with just Arrays if I try to use broadcasting.

@time a .+ a #a few ms
@time ar .+ ar #around 30s again!

Could these just be extreme cases of the following simple case?:

@time a + a #a few hundred ns
@time ar + ar #around 65ms (around 100 times slower)

I thought it might be due to linear indexing not being used, buy my own attempt to use it

function myplus(x, y)
    res = zeros(x)
    @inbounds for j in 1:length(x)
        res[j] = x[j] + y[j]
    end
end
@time myplus(ar,ar) #around 40 ms

was not much better, and obviously skips some safety checks.

This is with

Julia Version 0.6.0-dev.1230
Commit ce6a0c3 (2016-11-10 21:09 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Xeon(R) CPU @ 2.60GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
LAPACK: libopenblas64_
LIBM: libopenlibm
LLVM: libLLVM-3.7.1 (ORCJIT, sandybridge)

Julia 0.5.0 performs similarly, except that a .+ a and a + a switch speeds, with a .+ a being faster (hundreds of ns) on 0.5.0.

The text was updated successfully, but these errors were encountered:

timholy · 2017-01-20T21:42:50Z

The core problem here is that julia's type-inference machinery basically punts when functions have more than 15 arguments due to this constant. There have been attempts to set it much higher (#15080), but that causes problems elsewhere.

Your linear-indexing trick should work, though (EDIT: for regular arrays, not for ReshapedArrays, I didn't catch that part initially). You can manually check that they are of compatible size with a line like

    @boundscheck indices(x) == indices(y) || throw(DimensionMismatch("x and y must have the same indices, got $(indices(x)) and $(indices(y))"))

You should also return res from the function. I would expect that to be extremely fast (for Arrays, not ReshapedArrays). ~~Are you sure you're warming up first? The performance tips page suggests some ways to diagnose problems.~~ Without help from inference, I am not sure whether it's even possible to achieve good performance for ReshapedArrays.

amilsted · 2017-01-20T22:45:47Z

I see. Interesting! But, as a workaround, couldn't the add function for Arrays just treat the fast-linear-indexing case specially, without involving any functions with many arguments? Does that slow the low-dim cases down too much?

Anyway, I tried your suggestions and found something interesting:

function myplus1(x, y)
    @boundscheck length(x) == length(y) || throw(DimensionMismatch("x and y must have the same length, got $(length(x)) and $(length(y))"))
    res = zeros(promote_type(eltype(x), eltype(y)), size(x))
    @inbounds @simd for j in 1:length(x)
        res[j] = x[j] + y[j]
    end
    res
end

function myplus2(x, y)
    @boundscheck length(x) == length(y) || throw(DimensionMismatch("x and y must have the same length, got $(length(x)) and $(length(y))"))
    res = zeros(promote_type(eltype(x), eltype(y)), length(x))
    @inbounds @simd for j in 1:length(x)
        res[j] = x[j] + y[j]
    end
    reshape(res, size(x))
end

@time a + a #0.000511 seconds (6 allocations: 2.000 MB)
@time res = myplus1(ar,ar) #0.035502 seconds (1.05 M allocations: 21.988 MB, 6.36% gc time)
@time res = myplus1(a,a) #0.000647 seconds (6 allocations: 2.000 MB)
@time res = myplus2(ar,ar) #0.000676 seconds (18 allocations: 2.002 MB)
@time res = myplus2(a,a) #0.000647 seconds (6 allocations: 2.000 MB)

These numbers are about the lowest I could get on this machine by repeating a number of times. So indeed, linear indexing does seem to fix things, but not for the destination array! If res has many dimensions I see the same behaviour as I did in my original report. If it is instead a vector, which I reshape afterwards, I see similar performance to the 1-dimensional case.

PS: Julia 0.5.0 is the same. Also, I deliberately relaxed the bounds checking to check the linear sizes only.

mbauman · 2017-01-20T22:54:32Z

Yes, that's another symptom of the same problem. The issue is that inference gives up trying to figure out what type res will have since it is calling zeros with a very large tuple. You can see this with @code_warntype myplus1(ar,ar). That means that it can no longer optimize those setindex! calls within the loop, adding lots of costly overhead to each iteration.

amilsted · 2017-01-20T23:19:53Z

Oh dear. Yes, I see. Well I guess I know what to watch out for now! Is there an obvious reason why + on arrays doesn't do roughly what myplus2() does if possible? Is it mainly so people like me keep complaining and someone eventually fixes the underlying problem? 😉

timholy · 2017-01-20T23:28:00Z

Is it mainly so people like me keep complaining and someone eventually fixes the underlying problem? 😉

Definitely 😄.

The main reason is that there are many AbstractArrays that don't have efficient linear indexing. (Your ReshapedArrays above are often a good example; most SubArrays created from view are another.) We generally prefer to have only one implementation of an operation if possible, and so we tend to write the code as generically as possible. EDIT: the real trick of Julia is that often you get that generality without paying a price in terms of performance, if you organize your API and dispatch hierarchy carefully.

You may be winning the prize for the highest array dimension yet reported in what I'm assuming is "real code." (Congratulations!) So you're challenging the system in new and exciting ways.

amilsted · 2017-01-20T23:39:28Z

😄 It is indeed real code! The task is sparse diagonalization of a d^N * d^N matrix which is a sum of "local" terms (matrices that are e.g. kron(eye(d), eye(d), ...., eye(d), something_else, eye(d), ..., eye(d))). This is one way of doing efficient "Exact Diagonalization" of quantum many-body systems. Operations on vectors such as those above is part of a matrix-vector multiplication function fed to eigs() via @Jutho 's LinearMaps. Each dimension in the (d,d,d,...,d)-shaped vector represents a physical particle.

timholy · 2017-01-21T00:04:54Z

Interesting. Addressing this very generally is not a simple problem, but there is conceivably a way forward. Very briefly, the engine that underlies much of Julia's array handling is the processing of tuples---the compiler understands a lot about tuples so there is really quite an impressive amount of "computation" that can be done at compile-time (rather than run-time) if you implement operations in terms of tuples. There are a couple of different "dialects" of tuple processing, for example to add all the elements in a tuple you could have

foo1(t) = _foo1(0, t...)
_foo1(output, a, b...) = _foo1(output+a, b...)
_foo1(output) = output

versus

foo2(t) = _foo2(0, t)
_foo2(output, a) = _foo2(output+a[1], Base.tail(a))
_foo2(output, ::Tuple{}) = output

The first blows up the number of arguments, but the second (in principle) does not. The reason I say "in principle" is that it relies on Base.tail, and the implementation of tail does blow up the number of arguments. However, it might be possible to come up with a set of primitive operations and "whitelist" those as far as MAX_TUPLETYPE_LEN goes. That's what I mean when I say there may be a way forward.

The rub is that getindex and several other core functions require argument-splatting, so I don't think our current API is quite up to this kind of change.

tknopp · 2017-01-21T07:07:07Z

@amilsted can't you just reshape the array locally to a vector before adding them?

amilsted · 2017-01-21T21:11:11Z

Hi @tknopp. Yes, that would seem to be one work-around strategy. Since reshape() is cheap, I can define wrappers that do reshaping before and after the operations I want to perform.

It's not an insurmountable problem by any means, it just caught be by surprise. I think I remember reading about the large-tuples limitation back when I tried Julia for the first time, but I had not noticed that this was the problem here, perhaps because I rarely display or type out the the size tuples of these arrays. Is getting Julia to warn about crossing the tuple-length threshold an option?

tknopp · 2017-01-21T22:19:12Z

You are pushing the limits of Julias type system. So the best thing seems to be making oneself familiar with the limits and introduced simple workarounds. Regarding the warning there is the @code_warntype that could be helpful.

amilsted · 2017-01-21T22:30:06Z

Yes, I should use @code_warntype more, and working around the limitations is of course what I will do. However, I was imagining a printed WARNING that would alert me, and presumably others, to the problem when it occurs. Since performance already becomes bad if the MAX_TUPLETYPE_LEN limit is crossed, I am wondering whether emitting a warning would be useful to people without having a significant negative impact. For me the benefit of time saved in diagnosing the performance issue is clear.

Jutho · 2018-01-10T11:57:47Z

@timholy, if I remember correctly you indeed increased max_tuple_length at some point to 42. I guess problems where mainly increased compilation time?

Currently it's 15, which I assume is just another ad hoc choice which is sufficiently high for most purposes. However, the example @amilsted points out here, seems to point towards a more objective criterion. Given that tuples occur often in the context of Arrays. Given that most people use arrays with Float64 entries, I guess an array is considered large on todays desktops/laptops if has a total memory size of say a gigabyte, being 2^27 elements. Now the largest dimensionality that array can have, assuming that nobody wants a dimension of size 1, is 27, i.e. every dimension has size 2. This does appear in practice, indeed, in the context of joint probability distributions for 27 bits, or of quantum wave functions for 27 qubits. Would something like 25 <= max_tuple_length <=30 be a more motivated choice that is still feasible, or would it already cause the same problems as 42?

timholy · 2018-01-10T13:23:47Z

Very insightful, @Jutho, to come up with a principled way to set these parameters. That seems very compelling to me. I'll have to defer to @vtjnash, @JeffBezanson, or others to say what's feasible.

oscardssmith · 2018-01-10T15:38:18Z

Also, could it be set to a different value for repl use to make it more responsive? That would likely mitigate some of the compilation time problems.

andyferris · 2018-01-11T20:48:38Z

Is this still slow on v0.7? What about: #22370 (comment)?

Jutho · 2018-01-11T21:51:35Z

Thanks for the reference @andyferris, that is indeed amazing. It will be a long time before we can run exact diagonalization of a quantum spin system with 9001 spins :-).

brenhinkeller · 2022-11-19T21:12:07Z

The current (1.8.3) situation seems better

a = rand(2^N);
v = view(a, 1:2^N);
ar = reshape(a, ntuple(x->2, N));
vr = reshape(v, ntuple(x->2, N));
using BenchmarkTools

The reshapes are still slow, whether of views or otherwise -- but not 30s slow, and everything else looks pretty consistent:

julia> @btime $a + $a;
  38.916 μs (2 allocations: 2.00 MiB)

julia> @btime $a .+ $a;
  38.875 μs (2 allocations: 2.00 MiB)

julia> @btime $v + $v;
  39.417 μs (2 allocations: 2.00 MiB)

julia> @btime $v .+ $v;
  38.917 μs (2 allocations: 2.00 MiB)

julia> @btime $ar + $ar;
  3.210 ms (3 allocations: 2.00 MiB)

julia> @btime $ar .+ $ar;
  3.256 ms (3 allocations: 2.00 MiB)

julia> @btime $vr + $vr;
  2.935 ms (3 allocations: 2.00 MiB)

julia> @btime $vr .+ $vr;
  2.922 ms (3 allocations: 2.00 MiB)

julia> versioninfo()
Julia Version 1.8.3
Commit 0434deb161e (2022-11-14 20:14 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin21.3.0)
  CPU: 10 × Apple M1 Max
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, apple-m1)
  Threads: 1 on 8 virtual cores

Granted, the reshapes are still 85x slower than they could be.. That's anectodally a well-known issue, but I don't see another open issue adressing it, so maybe that could become the focus here henceforth

ararslan added arrays [a, r, r, a, y, s] performance Must go faster labels Jan 20, 2017

andyferris mentioned this issue Jun 3, 2018

Bump tuple inference length cutoff from 16 to 32 #27398

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding many-dimensional arrays is slow #20163

Adding many-dimensional arrays is slow #20163

amilsted commented Jan 20, 2017

timholy commented Jan 20, 2017 •

edited

Loading

amilsted commented Jan 20, 2017

mbauman commented Jan 20, 2017

amilsted commented Jan 20, 2017

timholy commented Jan 20, 2017 •

edited

Loading

amilsted commented Jan 20, 2017 •

edited

Loading

timholy commented Jan 21, 2017 •

edited

Loading

tknopp commented Jan 21, 2017

amilsted commented Jan 21, 2017

tknopp commented Jan 21, 2017

amilsted commented Jan 21, 2017

Jutho commented Jan 10, 2018

timholy commented Jan 10, 2018

oscardssmith commented Jan 10, 2018

andyferris commented Jan 11, 2018

Jutho commented Jan 11, 2018

brenhinkeller commented Nov 19, 2022

Adding many-dimensional arrays is slow #20163

Adding many-dimensional arrays is slow #20163

Comments

amilsted commented Jan 20, 2017

timholy commented Jan 20, 2017 • edited Loading

amilsted commented Jan 20, 2017

mbauman commented Jan 20, 2017

amilsted commented Jan 20, 2017

timholy commented Jan 20, 2017 • edited Loading

amilsted commented Jan 20, 2017 • edited Loading

timholy commented Jan 21, 2017 • edited Loading

tknopp commented Jan 21, 2017

amilsted commented Jan 21, 2017

tknopp commented Jan 21, 2017

amilsted commented Jan 21, 2017

Jutho commented Jan 10, 2018

timholy commented Jan 10, 2018

oscardssmith commented Jan 10, 2018

andyferris commented Jan 11, 2018

Jutho commented Jan 11, 2018

brenhinkeller commented Nov 19, 2022

timholy commented Jan 20, 2017 •

edited

Loading

timholy commented Jan 20, 2017 •

edited

Loading

amilsted commented Jan 20, 2017 •

edited

Loading

timholy commented Jan 21, 2017 •

edited

Loading