Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with scaling pmap beyond 6-7 cores #11354

Closed
jasonmorton opened this issue May 19, 2015 · 4 comments
Closed

Problems with scaling pmap beyond 6-7 cores #11354

jasonmorton opened this issue May 19, 2015 · 4 comments
Labels
parallelism Parallel or distributed computation performance Must go faster

Comments

@jasonmorton
Copy link

I'm working with a 16 core / 32 thread machine with 32GB ram that presents to ubuntu as 32 cores. I'm trying to understand how to get the best performance for embarrassingly parallel tasks. I want to take a bunch of svds in parallel as an example. The scaling seems to be perfect (6.6 seconds regardless of number of svds) until about 7 or 8 simultaneous svds, at which point it starts to creep up, scaling roughly linearly although with high variance, up to 22 seconds for 16 and 47 seconds for 31.

I can confirm that the number of processors being used seems to equals the number getting pmapped over by watching htop, so I don't think openblas multithreading is the issue. Memory usage stays low. Any guess on what is going on? I'm using the generic linux binary julia-79599ada44. I don't think there should be any sending of the matrices but perhaps that is the issue.

Probably I am missing something obvious.

**** with nprocs = 16 ****
@time pmap(x->[svd(rand(1000,1000))[2][1] for i in 1:10],[i for i in 1:16])
elapsed time: 22.350466328 seconds (12292776 bytes allocated)
@time map(x->[svd(rand(1000,1000))[2][1] for i in 1:10],[i for i in 1:16])
elapsed time: 91.135322511 seconds (10269056672 bytes allocated, 2.57% gc time)

**** with nprocs = 31 ****

perfect scaling until here (at 6x speedup)

@time pmap(x->[svd(rand(1000,1000))[2][1] for i in 1:10],[i for i in 1:6])
elapsed time: 6.720786336 seconds (159168 bytes allocated)
@time map(x->[svd(rand(1000,1000))[2][1] for i in 1:10],[i for i in 1:6])
elapsed time: 34.146665292 seconds (3847940044 bytes allocated, 2.46% gc time)
#4.5x speedup

@time pmap(x->[svd(rand(1000,1000))[2][1] for i in 1:10],[i for i in 1:16])
elapsed time: 19.819358972 seconds (391056 bytes allocated)
@time map(x->[svd(rand(1000,1000))[2][1] for i in 1:10],[i for i in 1:16])
elapsed time: 90.688842475 seconds (10260844684 bytes allocated, 2.36% gc time)
#3.69x speedup

@time pmap(x->[svd(rand(1000,1000))[2][1] for i in 1:10],[i for i in 1:nprocs()])
elapsed time: 47.411315342 seconds (738616 bytes allocated)
@time map(x->[svd(rand(1000,1000))[2][1] for i in 1:10],[i for i in 1:nprocs()])
elapsed time: 175.308752879 seconds (19880206220 bytes allocated, 2.34% gc time)

@tkelman tkelman added the parallelism Parallel or distributed computation label May 19, 2015
@simonster
Copy link
Member

Same as #10427?

@simonster simonster added the performance Must go faster label May 19, 2015
@ViralBShah
Copy link
Member

@amitmurthy showed me that even if you just run multiple individual julias using bash and not communicating with each other, you get the same slowdown. Perhaps this does not have to do with julia.

@amitmurthy
Copy link
Contributor

Try

for i in `seq 1 n`
do
    julia -e "blas_set_num_threads(1); svd(rand(100,100)); sleep(1.0); @time [svd(rand(1000,1000))[2][1] for i in 1:10]" &
done

replacing n in seq 1 n with values 1,2,4,8, etc

On my 4 core, 8 threads laptop, I get 5.9, 7.4, 11.5 and 32.7 seconds for n of values 1,2,4,and 8 respectively. Ideally I would expect around the same 5.9 seconds for 1,2, and 4 parallel runs since there are actual 4 cores, i.e. ignoring hyperthreading.

I suspect L1/L2 cache contention as the cause for the slowdown. Note that they are all independent julia processes running concurrently - julia parallel infrastructure is not used here.

@jasonmorton
Copy link
Author

Yes, this is it. I get the same slowdown with bash, so I think you are right and it is just cache contention. If I change the svd to computing 1000 100x100 svds, the speedup is much better, around 16.5x at 31 threads. I need to think much more carefully about cache in my application. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parallelism Parallel or distributed computation performance Must go faster
Projects
None yet
Development

No branches or pull requests

5 participants