Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't broadcast out-of-place #235

Closed
ChrisRackauckas opened this issue Jan 5, 2018 · 7 comments
Closed

Don't broadcast out-of-place #235

ChrisRackauckas opened this issue Jan 5, 2018 · 7 comments

Comments

@ChrisRackauckas
Copy link
Member

ChrisRackauckas commented Jan 5, 2018

It seems that broadcast is harmful to most uses of the out-of-place format. It drastically slows down things like static arrays and ArrayFire arrays that are made to be used out-of-place. Since we should prioritize these for this path, it makes more sense for this path to be non-broadcasting. The reason we put broadcasting there in the first place was for fusion, but fusion is only required if one thinks that temporary allocations are bad for their type, which is opposite of the assumption that the out-of-place branch should be making. If one needs broadcasting, they should be using the route for mutables anyways.

This reverts some hard work by @devmotion, we probably should've checked more into this first. I'm sorry about that but still appreciate your hard work. But the net result is from

779.570 μs (807 allocations: 28.69 KiB)

to

19.320 μs (487 allocations: 19.94 KiB)

so it's huge for

using StaticArrays, OrdinaryDiffEq
using BenchmarkTools

α=1
β=1
f(t,u) = α*u
g(t,u) = β*u
dt = 1//2^(4)
tspan = (0.0,1.0)
u0 = @SVector [2.0,3.0]
prob = ODEProblem(f,u0,(0.0,1.0))

and puts us on similar grounds with BridgeDiffEq.jl which was made specifically for fixed timestep static arrays, so this is the "true optimization" which we were missing. (16.979 μs (698 allocations: 20.98 KiB) for Bridge. Most of the difference in the profiling is now just building the integrator type and other small setup bits which are a constant factor). It also evades #106 for out-of-place which will cleanup our code quite a bit. I'll get to work on getting this package-wide changed.

@ChrisRackauckas ChrisRackauckas changed the title Dpm Don't broadcast out-of-place Jan 5, 2018
@ChrisRackauckas
Copy link
Member Author

@rveltz this should give similar speedups on ArrayFire arrays. Do you have any test equations you'd like us to run?

@ChrisRackauckas
Copy link
Member Author

BTW @Datseris, this makes static array paths finally faster.

f2(t,u,du) = (du.=α*u)
g2(t,u,du) = (du.=β*u)
u0 = [2.0,3.0]
prob = ODEProblem(f2,u0,(0.0,1.0))
@btime sol1 = solve(prob,BS3(),dt=dt,adaptive=false)

gives

24.005 μs (281 allocations: 20.55 KiB)

@Datseris
Copy link
Contributor

Datseris commented Jan 5, 2018

Let me know when this is operational in a stable version and then I will benchmark with "real" multidimensional systems. The difference of 5 μs is not big enough at the moment to make it compelling, plus the in-place version is always scalable: the performance will be "nice" regardless of dimensionallity. Can the same be said about static arrays?

Normally you would use static arrays and expect to get orders of magnitude better performance for small dimensionality. Here it is not even close to a single order of magnitude and the dimension is a mere 2, the lowest possible.

P.S.: I was always thinking that there is so much crazy amount of work put into supporting both in-place and out of place versions, but never really understood why,... Why even bother to support the out-of-place?

@ChrisRackauckas
Copy link
Member Author

the performance will be "nice" regardless of dimensionallity. Can the same be said about static arrays?

No, static arrays will not scale. That's their limitation.

Normally you would use static arrays and expect to get orders of magnitude better performance for small dimensionality.

orders? No. Maybe an order of magnitude, but if that's true then either the mutable code could be faster or it's using a static array fast path. Static array fast paths, like the fast QR, is so nice when you get it though!

Why even bother to support the out-of-place?

Well, it's really not that much work since it's already written. Probably <5% of my time is spent working on them. Also, they are easier to get working, so when I am developing a new algorithm I always write the out of place version first, so why not keep it? They are more efficient when applicable, and they can be easier for users to do some things efficiently. For example, you don't have to use reinit with out-of-place because there's essentially no cache variables. It also makes weird stuff like resizing easier if it doesn't need to be applied to caches. So it's a nice touch that users who are interested in small systems can use to get maybe and extra 2x-3x out in the best of cases.

@Datseris
Copy link
Contributor

Datseris commented Jan 5, 2018

Fair enough.

I do have a comment about the speedup. You can indeed get more than one order in the case of 2 and 3 dimensions, this is what I realized when working with discrete systems. Of course at my use case that was because of the absurdly fast QR method native for 2 and 3D static matrices, which made my function crazy fast.

But its probably fair to say that expecting more than one order in a differential equation setting is asking too much.

EDIT: Let me know though when this is in a stable release, and I will test it just so that I can help the ecosystem.

@ChrisRackauckas
Copy link
Member Author

ChrisRackauckas commented Jan 5, 2018

But its probably fair to say that expecting more than one order in a differential equation setting is asking too much.

Yes, for a non-stiff diffeq solver it's literally the difference of

@. tmp = a1*k1 + k2*k2

with vectors vs

tmp = a1*k1 + a2*k2

with static arrays, and both are quite optimized in these cases. It's the "harder parts", which would only show up in stiff solvers (because of linear solvers, that difference in the QR is huge!) or in the user's f, which would make a difference. If your f is much faster with static arrays, you'll see it with the diffeq code as well. The tests I posted had a trivial f though.

@ChrisRackauckas
Copy link
Member Author

Done here and StochasticDiffEq.jl. Wasn't bad at all 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants