-
Notifications
You must be signed in to change notification settings - Fork 558
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[openblas] Build for high numbers of threads (4096) by default instead of 32 #3667
Conversation
Core counts are going up a lot nowadays - I suggest we use 128 max threads by default.
What is the consequence on startup time? |
Presumably also will use more memory. what is a reliable way to test both? Startup time probably with
Is there a better way to look at real memory usage than just eyeballing |
Maybe Output
or # sudo apt install valgrind -y
valgrind --trace-children=yes --tool=massif julia -e "a=rand(5_000_000); BigFloat.(a)"
ms_print massif.out.* | less Output
|
I am doing With 8 threads, it is 0.24s When I replace openblas with the version from OpenBLASHighCoreCounts_jll, I see no meaningful change in startup time or memory usage, but the machine I am testing on only goes to 40 cores. |
@staticfloat @chriselrod Any thoughts here? |
Losing 50ms is rather significant, IMO. I'd like to better understand why we're losing all that time. |
I think 50ms is not too bad given the performance boost (only on systems with large numbers of cores). As a user, I want Julia to start with as many threads as cores. I thought we lose time because of the openblas per-thread buffer allocation that happens when you We could try to estimate the true number of cores and use half as many threads possibly (avoiding the allocation one openblas thread per hyperthread), but I don't know how reliably one can do such a thing. |
The impact on pure Julia startup time is on the order of 5ms on arctic1 (which is now several years old). |
I propose merging this, and using 512 for the HighCoreCounts version. Thoughts? |
After talking it over with Viral, I think we can come to a happy compromise:
ENV["OPENBLAS_NUM_THREADS"] = get(ENV, "OPENBLAS_NUM_THREADS", string(min(Sys.CPU_THREADS, 32))) We should probably put that in This way, we get:
I've looked at |
We may want to merge JuliaLang/julia#42473 first - but I think these environment variables need to get set before we |
I think that sounds reasonable. How did you time the difference in startup times? time OPENBLAS_NUM_THREADS=1 julia --startup=no -e '0' seems too noisy for me to notice a difference. |
We should absolutely do physical core counts if we can get them reliably. Note you may often want to do |
I actually use
|
I still don't see a marked difference. Minimum time seems slightly worse, while median and mean improve: > OPENBLAS_NUM_THREADS=1 julia --startup=no -e 'using BenchmarkTools
, LinearAlgebra; @show(BLAS.get_num_threads()); display(@benchmark run(`$(Base.julia_cmd()) --startup=no -e0`))'
BLAS.get_num_threads() = 1
BenchmarkTools.Trial: 42 samples with 1 evaluation.
Range (min … max): 108.952 ms … 124.790 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 119.121 ms ┊ GC (median): 0.00%
Time (mean ± σ): 119.068 ms ± 4.407 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
▁ ▁ ▁▄ ▁ ▁ ▁ ▁▁ ▁ █
▆▁▁▁▁▆▁▆▁█▁▁▁▆▁▁▆▁▁▁▁▁▁▁█▆▁▁▁▁▁▁▁▆██▆█▆█▁█▁▁▆▁▁▆▆▁▁██▆▁█▆█▆▆▆ ▁
109 ms Histogram: frequency by time 125 ms <
Memory estimate: 4.00 KiB, allocs estimate: 87.⏎
> OPENBLAS_NUM_THREADS=1 julia --startup=no -e 'using BenchmarkTools
, LinearAlgebra; @show(BLAS.get_num_threads()); display(@benchmark run(`$(Base.julia_cmd()) --startup=no -e0`))'
BLAS.get_num_threads() = 1
BenchmarkTools.Trial: 43 samples with 1 evaluation.
Range (min … max): 108.923 ms … 129.074 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 118.254 ms ┊ GC (median): 0.00%
Time (mean ± σ): 117.794 ms ± 4.956 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
▃█ ▃ █▃ ▃ █ █ ▃ ▃ █ █
▇▁██▁▁▁█▁▁▁▇▁▁▁▁▇▁██▁▁█▁▇▁█▁█▇▁█▇█▁▇▁▁▇▇▁█▁▇▁█▁▇▇▁▁▁▁▁▁▁▁▁▁▁▇ ▁
109 ms Histogram: frequency by time 129 ms <
Memory estimate: 4.00 KiB, allocs estimate: 87.⏎
> OPENBLAS_NUM_THREADS=28 julia --startup=no -e 'using BenchmarkTool
s, LinearAlgebra; @show(BLAS.get_num_threads()); display(@benchmark run(`$(Base.julia_cmd()) --startup=no -e0`))'
BLAS.get_num_threads() = 28
BenchmarkTools.Trial: 44 samples with 1 evaluation.
Range (min … max): 111.792 ms … 124.431 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 113.549 ms ┊ GC (median): 0.00%
Time (mean ± σ): 114.528 ms ± 2.479 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
▃ █ ▆▃ ▁ ▃
▇█▁▁▄█▄██▄█▁█▄▁▁▁▁▄▁▁▄▁▄▄▁▁▇▇▇▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▁
112 ms Histogram: frequency by time 124 ms <
Memory estimate: 4.00 KiB, allocs estimate: 87.⏎
> OPENBLAS_NUM_THREADS=28 julia --startup=no -e 'using BenchmarkTool
s, LinearAlgebra; @show(BLAS.get_num_threads()); display(@benchmark run(`$(Base.julia_cmd()) --startup=no -e0`))'
BLAS.get_num_threads() = 28
BenchmarkTools.Trial: 44 samples with 1 evaluation.
Range (min … max): 111.519 ms … 120.237 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 113.767 ms ┊ GC (median): 0.00%
Time (mean ± σ): 114.196 ms ± 2.124 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
▅█ ▅
█▅▁█▅█▅█▅▁██▁▁▁▅█▅█▅▁▁█▅▅▁▁▅▁▁▁▁▅▅▁▁▁▁█▅▁▅▁▁▁▁▁▁▁▅▁▁▁▁▁▁▅▁▁▁▅ ▁
112 ms Histogram: frequency by time 120 ms <
Memory estimate: 4.00 KiB, allocs estimate: 87.⏎ This is on a 14 core / 28 thread CPU. |
This looks very similar to what I was seeing; you're seeing a 3ms increase in minimum time (which, IMO, is the only thing we should be paying attention to, as all other variation is most likely due to other processes adding noise). If you had a 128 core machine, you'd probably see an increase closer to 10ms. |
So, how do we feel about merging this with 4096 max threads? :-) |
I would like someone to do a quick sanity check on Windows and macOS to see what happens if you build OpenBLAS with 4096 cores. It's not too hard; just build Julia via |
We'll certainly get mac testers, but who might be able to do this for Windows? |
I checked on windows; it's about the same as on Linux. |
@staticfloat can you check startup time impact when you use as many threads as cores? OpenBLAS by default will launch as many threads as hyperthreads which may be much worse for startup time impact due to oversubscribing. |
On mac with the current master, there's some makefile breakage:
Found a way forward, but also filed: |
I'm satisfied that this is reasonable now. Let's merge this, and then work on the Julia-side thread choice logic. |
Note that this only changes common.jl. We need a separate PR to actually build a new OpenBLAS with this, after which we do the version bump on julia master. Should we keep the openblas high core counts around - or get rid of it? |
…d of 32 (JuliaPackaging#3667) * [openblas] Build for 4096 threads by default We're building with an extremely high thread limit, then limiting in Julia according to our hardware introspection
Core counts are going up a lot nowadays - I suggest we use 128 max threads by default.
Note that this is only updating the common.jl file, which will not trigger the building of the libraries. That will get picked up in the next run. I wanted to open this PR to have discussion on the defaults and prepare for bumping things up whenever we are ready.