Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MultiScaleModels Profiling #9

Closed
ChrisRackauckas opened this issue Jan 21, 2017 · 7 comments
Closed

MultiScaleModels Profiling #9

ChrisRackauckas opened this issue Jan 21, 2017 · 7 comments

Comments

@ChrisRackauckas
Copy link
Member

I optimized the ODE solvers around these guys a bit. As a test case, I took a linear system on the embryo from the tests. The embryo is 4 layers and a total length of 13, and so it uses very small leaf nodes. This means that, with a cheap problem and a high hierarchy, it should accentuate any differences and give an upper bound on the cost of using a MultiScaleModel.

I then benchmarked the same problem solved with MultiScaleModels and then just having it be a regular length 13 contiguous array:

prob = ODEProblem(f,em,(0.0,1500.0))
@benchmark sol1 = solve(prob,Tsit5(),save_timeseries=false)
prob = ODEProblem(f,em[:],(0.0,1500.0))
@benchmark sol2 = solve(prob,Tsit5(),save_timeseries=false)

to get the results:

BenchmarkTools.Trial: 
  memory estimate:  230.28 kb
  allocs estimate:  10920
  --------------
  minimum time:     16.469 ms (0.00% GC)
  median time:      16.942 ms (0.00% GC)
  mean time:        17.033 ms (0.46% GC)
  maximum time:     26.385 ms (33.77% GC)
  --------------
  samples:          294
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

BenchmarkTools.Trial: 
  memory estimate:  196.37 kb
  allocs estimate:  10409
  --------------
  minimum time:     7.500 ms (0.00% GC)
  median time:      7.846 ms (0.00% GC)
  mean time:        7.907 ms (0.51% GC)
  maximum time:     13.730 ms (42.01% GC)
  --------------
  samples:          632
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

This puts an upper bound for the cost of using MultiScaleModels at just over 2x. But this is without saving: saving the timeseries is more costly with a MMM. Measuring the cost with saving every step:

prob = ODEProblem(f,em,(0.0,1500.0))
@benchmark sol1 = solve(prob,Tsit5())
prob = ODEProblem(f,em[:],(0.0,1500.0))
@benchmark sol2 = solve(prob,Tsit5())
BenchmarkTools.Trial: 
  memory estimate:  3.20 mb
  allocs estimate:  43891
  --------------
  minimum time:     26.454 ms (0.00% GC)
  median time:      27.738 ms (0.00% GC)
  mean time:        28.962 ms (4.21% GC)
  maximum time:     39.190 ms (27.02% GC)
  --------------
  samples:          173
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

BenchmarkTools.Trial: 
  memory estimate:  1.52 mb
  allocs estimate:  21060
  --------------
  minimum time:     9.414 ms (0.00% GC)
  median time:      9.910 ms (0.00% GC)
  mean time:        10.252 ms (3.11% GC)
  maximum time:     18.033 ms (35.72% GC)
  --------------
  samples:          488
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

This puts an upper bound around 3x. So the maximal cost of the abstraction is about 2x-3x for ODEs. It's actually lower for SDEs and DDEs because more of the calculation in those domains are spent on the interpolation and noise generation, which are actually able to mostly avoid the costs of MMMs. So the final product is something where the abstract cost is less than the performance difference between OrdinaryDiffEq and other packages, meaning using MMMs in OrdinaryDiffEq should still be slightly faster than using other packages with contiguous arrays. I think this counts as well optimized, and so my goal will be to now make this all compatible with the solvers for stiff equations since will make a large impact.

@zahachtah I think you might be interested in these results.

@ChrisRackauckas
Copy link
Member Author

v0.6 DiffEq got a lot faster while this stayed the same:

prob = ODEProblem(f,em,(0.0,1500.0))
@benchmark sol1 = solve(prob,Tsit5(),save_everystep=false)
prob = ODEProblem(f,em[:],(0.0,1500.0))
@benchmark sol2 = solve(prob,Tsit5(),save_everystep=false)


BenchmarkTools.Trial: 
  memory estimate:  24.44 KiB
  allocs estimate:  323
  --------------
  minimum time:     393.737 μs (0.00% GC)
  median time:      432.086 μs (0.00% GC)
  mean time:        452.956 μs (0.88% GC)
  maximum time:     4.695 ms (89.11% GC)
  --------------
  samples:          10000
  evals/sample:     1


BenchmarkTools.Trial: 
  memory estimate:  103.53 KiB
  allocs estimate:  1774
  --------------
  minimum time:     27.596 ms (0.00% GC)
  median time:      28.951 ms (0.00% GC)
  mean time:        28.998 ms (0.00% GC)
  maximum time:     32.709 ms (0.00% GC)
  --------------
  samples:          173
  evals/sample:     1
prob = ODEProblem(f,em,(0.0,1500.0))
@benchmark sol1 = solve(prob,Tsit5())
prob = ODEProblem(f,em[:],(0.0,1500.0))
@benchmark sol2 = solve(prob,Tsit5())

BenchmarkTools.Trial: 
  memory estimate:  13.31 MiB
  allocs estimate:  123635
  --------------
  minimum time:     93.224 ms (0.00% GC)
  median time:      102.820 ms (6.55% GC)
  mean time:        100.983 ms (4.83% GC)
  maximum time:     110.067 ms (12.34% GC)
  --------------
  samples:          50
  evals/sample:     1


BenchmarkTools.Trial: 
  memory estimate:  1.21 MiB
  allocs estimate:  6499
  --------------
  minimum time:     939.406 μs (0.00% GC)
  median time:      1.068 ms (0.00% GC)
  mean time:        1.208 ms (7.40% GC)
  maximum time:     4.706 ms (52.20% GC)
  --------------
  samples:          4123
  evals/sample:     1

Don't know what to say about that. Probably could use some optimizations. Or the "full broadcast" integrators would probably do well, since this is using the indexing versions right now.

The major change that could have caused this is that now the "non-user facing cache variables" also match the type that the user gives. Before they were transformed into contiguous arrays since they were not shown to the user. That had some weird side-effects though, and for example slows down "add_daughter" types of events. So it's somewhat a wash... broadcast will likely be the savior here.

@zahachtah
Copy link

Do you anticipate 0.7 or 1.0 do improve the 2-3x slower performance?

@ChrisRackauckas
Copy link
Member Author

It'll fix it. The detail is that linear indexing is slow because it has to do a binary search, but the built in broadcast is fast. A v0.6 bug prevented broadcast in most ODE/SDE methods: SciML/OrdinaryDiffEq.jl#106

However, this bug was fixed in Julia v0.7: JuliaLang/julia#22255 . So in the v0.7 upgrade we will be making all of those algorithms internally use broadcasting which will get rid of the slow indexing that MultiScaleArrays is hitting (and it also has other nice side effects, like all of the RK methods will be GPU-compatible!).

With ArrayPartitions from RecursiveArrayTools.jl we found that grouping small broadcasts can actually be faster than contiguous loops because of smart cache handling, so there is a chance this can do extremely well. One issue this library will still have is that copying a MultiScaleArray is more expensive than a standard Array, so if you have a lot of saving going on that will be more expensive. But together I wouldn't be surprised if MultiScaleArrays is a small cost (1.3x?) or just a wash (<1.1x performance loss). The 2x-3x shouldn't exist after these changes. I am very very very happy that compiler change happened in Base.

@zahachtah
Copy link

Oh, thats very cool! I may at times have very large arrays (10000+), do you think threading the ode will be efficient too in 0.7?

@ChrisRackauckas
Copy link
Member Author

Yes, for arrays of that size it would be a good idea. I want to create a package which makes broadcast multithreaded via a wrapper. That's pretty simple on v0.7, so I was waiting on that as the solution.

@zahachtah
Copy link

zahachtah commented Jul 17, 2018 via email

@ChrisRackauckas
Copy link
Member Author

Or just have tuple types for holding the nodes: #26 . TypeSortedCollections.jl will work similarly. ArrayPartition works as well. These all require indexing with literals or broadcasting to get the type-stable operations though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants