Problems with scaling pmap beyond 6-7 cores #11354

jasonmorton · 2015-05-19T16:33:57Z

I'm working with a 16 core / 32 thread machine with 32GB ram that presents to ubuntu as 32 cores. I'm trying to understand how to get the best performance for embarrassingly parallel tasks. I want to take a bunch of svds in parallel as an example. The scaling seems to be perfect (6.6 seconds regardless of number of svds) until about 7 or 8 simultaneous svds, at which point it starts to creep up, scaling roughly linearly although with high variance, up to 22 seconds for 16 and 47 seconds for 31.

I can confirm that the number of processors being used seems to equals the number getting pmapped over by watching htop, so I don't think openblas multithreading is the issue. Memory usage stays low. Any guess on what is going on? I'm using the generic linux binary julia-79599ada44. I don't think there should be any sending of the matrices but perhaps that is the issue.

Probably I am missing something obvious.

**** with nprocs = 16 ****
@time pmap(x->[svd(rand(1000,1000))[2][1] for i in 1:10],[i for i in 1:16])
elapsed time: 22.350466328 seconds (12292776 bytes allocated)
@time map(x->[svd(rand(1000,1000))[2][1] for i in 1:10],[i for i in 1:16])
elapsed time: 91.135322511 seconds (10269056672 bytes allocated, 2.57% gc time)

**** with nprocs = 31 ****

perfect scaling until here (at 6x speedup)

@time pmap(x->[svd(rand(1000,1000))[2][1] for i in 1:10],[i for i in 1:6])
elapsed time: 6.720786336 seconds (159168 bytes allocated)
@time map(x->[svd(rand(1000,1000))[2][1] for i in 1:10],[i for i in 1:6])
elapsed time: 34.146665292 seconds (3847940044 bytes allocated, 2.46% gc time)
#4.5x speedup

@time pmap(x->[svd(rand(1000,1000))[2][1] for i in 1:10],[i for i in 1:16])
elapsed time: 19.819358972 seconds (391056 bytes allocated)
@time map(x->[svd(rand(1000,1000))[2][1] for i in 1:10],[i for i in 1:16])
elapsed time: 90.688842475 seconds (10260844684 bytes allocated, 2.36% gc time)
#3.69x speedup

@time pmap(x->[svd(rand(1000,1000))[2][1] for i in 1:10],[i for i in 1:nprocs()])
elapsed time: 47.411315342 seconds (738616 bytes allocated)
@time map(x->[svd(rand(1000,1000))[2][1] for i in 1:10],[i for i in 1:nprocs()])
elapsed time: 175.308752879 seconds (19880206220 bytes allocated, 2.34% gc time)

simonster · 2015-05-19T18:42:11Z

Same as #10427?

ViralBShah · 2015-05-20T04:45:50Z

@amitmurthy showed me that even if you just run multiple individual julias using bash and not communicating with each other, you get the same slowdown. Perhaps this does not have to do with julia.

amitmurthy · 2015-05-20T05:21:54Z

Try

for i in `seq 1 n`
do
    julia -e "blas_set_num_threads(1); svd(rand(100,100)); sleep(1.0); @time [svd(rand(1000,1000))[2][1] for i in 1:10]" &
done

replacing n in seq 1 n with values 1,2,4,8, etc

On my 4 core, 8 threads laptop, I get 5.9, 7.4, 11.5 and 32.7 seconds for n of values 1,2,4,and 8 respectively. Ideally I would expect around the same 5.9 seconds for 1,2, and 4 parallel runs since there are actual 4 cores, i.e. ignoring hyperthreading.

I suspect L1/L2 cache contention as the cause for the slowdown. Note that they are all independent julia processes running concurrently - julia parallel infrastructure is not used here.

jasonmorton · 2015-05-20T13:08:33Z

Yes, this is it. I get the same slowdown with bash, so I think you are right and it is just cache contention. If I change the svd to computing 1000 100x100 svds, the speedup is much better, around 16.5x at 31 threads. I need to think much more carefully about cache in my application. Thanks.

tkelman added the parallelism Parallel or distributed computation label May 19, 2015

simonster added the performance Must go faster label May 19, 2015

jasonmorton closed this as completed May 20, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems with scaling pmap beyond 6-7 cores #11354

Problems with scaling pmap beyond 6-7 cores #11354

jasonmorton commented May 19, 2015

simonster commented May 19, 2015

ViralBShah commented May 20, 2015

amitmurthy commented May 20, 2015

jasonmorton commented May 20, 2015

Problems with scaling pmap beyond 6-7 cores #11354

Problems with scaling pmap beyond 6-7 cores #11354

Comments

jasonmorton commented May 19, 2015

perfect scaling until here (at 6x speedup)

simonster commented May 19, 2015

ViralBShah commented May 20, 2015

amitmurthy commented May 20, 2015

jasonmorton commented May 20, 2015