You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm working with a 16 core / 32 thread machine with 32GB ram that presents to ubuntu as 32 cores. I'm trying to understand how to get the best performance for embarrassingly parallel tasks. I want to take a bunch of svds in parallel as an example. The scaling seems to be perfect (6.6 seconds regardless of number of svds) until about 7 or 8 simultaneous svds, at which point it starts to creep up, scaling roughly linearly although with high variance, up to 22 seconds for 16 and 47 seconds for 31.
I can confirm that the number of processors being used seems to equals the number getting pmapped over by watching htop, so I don't think openblas multithreading is the issue. Memory usage stays low. Any guess on what is going on? I'm using the generic linux binary julia-79599ada44. I don't think there should be any sending of the matrices but perhaps that is the issue.
Probably I am missing something obvious.
**** with nprocs = 16 **** @time pmap(x->[svd(rand(1000,1000))[2][1] for i in 1:10],[i for i in 1:16])
elapsed time: 22.350466328 seconds (12292776 bytes allocated) @time map(x->[svd(rand(1000,1000))[2][1] for i in 1:10],[i for i in 1:16])
elapsed time: 91.135322511 seconds (10269056672 bytes allocated, 2.57% gc time)
**** with nprocs = 31 ****
perfect scaling until here (at 6x speedup)
@time pmap(x->[svd(rand(1000,1000))[2][1] for i in 1:10],[i for i in 1:6])
elapsed time: 6.720786336 seconds (159168 bytes allocated) @time map(x->[svd(rand(1000,1000))[2][1] for i in 1:10],[i for i in 1:6])
elapsed time: 34.146665292 seconds (3847940044 bytes allocated, 2.46% gc time) #4.5x speedup
@time pmap(x->[svd(rand(1000,1000))[2][1] for i in 1:10],[i for i in 1:16])
elapsed time: 19.819358972 seconds (391056 bytes allocated) @time map(x->[svd(rand(1000,1000))[2][1] for i in 1:10],[i for i in 1:16])
elapsed time: 90.688842475 seconds (10260844684 bytes allocated, 2.36% gc time) #3.69x speedup
@time pmap(x->[svd(rand(1000,1000))[2][1] for i in 1:10],[i for i in 1:nprocs()])
elapsed time: 47.411315342 seconds (738616 bytes allocated) @time map(x->[svd(rand(1000,1000))[2][1] for i in 1:10],[i for i in 1:nprocs()])
elapsed time: 175.308752879 seconds (19880206220 bytes allocated, 2.34% gc time)
The text was updated successfully, but these errors were encountered:
@amitmurthy showed me that even if you just run multiple individual julias using bash and not communicating with each other, you get the same slowdown. Perhaps this does not have to do with julia.
for i in `seq 1 n`
do
julia -e "blas_set_num_threads(1); svd(rand(100,100)); sleep(1.0); @time [svd(rand(1000,1000))[2][1] for i in 1:10]" &
done
replacing n in seq 1 n with values 1,2,4,8, etc
On my 4 core, 8 threads laptop, I get 5.9, 7.4, 11.5 and 32.7 seconds for n of values 1,2,4,and 8 respectively. Ideally I would expect around the same 5.9 seconds for 1,2, and 4 parallel runs since there are actual 4 cores, i.e. ignoring hyperthreading.
I suspect L1/L2 cache contention as the cause for the slowdown. Note that they are all independent julia processes running concurrently - julia parallel infrastructure is not used here.
Yes, this is it. I get the same slowdown with bash, so I think you are right and it is just cache contention. If I change the svd to computing 1000 100x100 svds, the speedup is much better, around 16.5x at 31 threads. I need to think much more carefully about cache in my application. Thanks.
I'm working with a 16 core / 32 thread machine with 32GB ram that presents to ubuntu as 32 cores. I'm trying to understand how to get the best performance for embarrassingly parallel tasks. I want to take a bunch of svds in parallel as an example. The scaling seems to be perfect (6.6 seconds regardless of number of svds) until about 7 or 8 simultaneous svds, at which point it starts to creep up, scaling roughly linearly although with high variance, up to 22 seconds for 16 and 47 seconds for 31.
I can confirm that the number of processors being used seems to equals the number getting pmapped over by watching htop, so I don't think openblas multithreading is the issue. Memory usage stays low. Any guess on what is going on? I'm using the generic linux binary julia-79599ada44. I don't think there should be any sending of the matrices but perhaps that is the issue.
Probably I am missing something obvious.
**** with nprocs = 16 ****
@time pmap(x->[svd(rand(1000,1000))[2][1] for i in 1:10],[i for i in 1:16])
elapsed time: 22.350466328 seconds (12292776 bytes allocated)
@time map(x->[svd(rand(1000,1000))[2][1] for i in 1:10],[i for i in 1:16])
elapsed time: 91.135322511 seconds (10269056672 bytes allocated, 2.57% gc time)
**** with nprocs = 31 ****
perfect scaling until here (at 6x speedup)
@time pmap(x->[svd(rand(1000,1000))[2][1] for i in 1:10],[i for i in 1:6])
elapsed time: 6.720786336 seconds (159168 bytes allocated)
@time map(x->[svd(rand(1000,1000))[2][1] for i in 1:10],[i for i in 1:6])
elapsed time: 34.146665292 seconds (3847940044 bytes allocated, 2.46% gc time)
#4.5x speedup
@time pmap(x->[svd(rand(1000,1000))[2][1] for i in 1:10],[i for i in 1:16])
elapsed time: 19.819358972 seconds (391056 bytes allocated)
@time map(x->[svd(rand(1000,1000))[2][1] for i in 1:10],[i for i in 1:16])
elapsed time: 90.688842475 seconds (10260844684 bytes allocated, 2.36% gc time)
#3.69x speedup
@time pmap(x->[svd(rand(1000,1000))[2][1] for i in 1:10],[i for i in 1:nprocs()])
elapsed time: 47.411315342 seconds (738616 bytes allocated)
@time map(x->[svd(rand(1000,1000))[2][1] for i in 1:10],[i for i in 1:nprocs()])
elapsed time: 175.308752879 seconds (19880206220 bytes allocated, 2.34% gc time)
The text was updated successfully, but these errors were encountered: