Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[openblas] Build for high numbers of threads (4096) by default instead of 32 #3667

Merged
merged 2 commits into from
Oct 7, 2021

Conversation

ViralBShah
Copy link
Member

@ViralBShah ViralBShah commented Sep 30, 2021

Core counts are going up a lot nowadays - I suggest we use 128 max threads by default.

Note that this is only updating the common.jl file, which will not trigger the building of the libraries. That will get picked up in the next run. I wanted to open this PR to have discussion on the defaults and prepare for bumping things up whenever we are ready.

Core counts are going up a lot nowadays - I suggest we use 128 max threads by default.
@ViralBShah ViralBShah changed the title [openblas] Build for 128 threads by default [openblas] Build for 128 threads by default instead of 32 Sep 30, 2021
@vchuravy
Copy link
Member

vchuravy commented Oct 1, 2021

What is the consequence on startup time?

@ViralBShah
Copy link
Member Author

ViralBShah commented Oct 1, 2021

Presumably also will use more memory. what is a reliable way to test both? Startup time probably with

time ./julia -e exit

Is there a better way to look at real memory usage than just eyeballing htop?

@inkydragon
Copy link
Contributor

inkydragon commented Oct 1, 2021

Maybe /usr/bin/time -v julia -e "a=rand(5_000_000); BigFloat.(a)"

Output
    Command being timed: "julia -e a=rand(5_000_000); BigFloat.(a)"
    User time (seconds): 0.99
    System time (seconds): 0.56
    Percent of CPU this job got: 142%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.09
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 828132
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 0
    Minor (reclaiming a frame) page faults: 32552
    Voluntary context switches: 7
    Involuntary context switches: 16
    Swaps: 0
    File system inputs: 0
    File system outputs: 0
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0

or

# sudo apt install valgrind -y
valgrind --trace-children=yes --tool=massif  julia -e "a=rand(5_000_000); BigFloat.(a)"
ms_print massif.out.* | less
Output
--------------------------------------------------------------------------------
Command:            julia -e a=rand(5_000_000); BigFloat.(a)
Massif arguments:   (none)
ms_print arguments: massif.out.2674
--------------------------------------------------------------------------------


    MB
175.7^                                                                       @
     |        @:::::::@::::::::::::@:::::::::@:::::::::::##:::::::::::::@::::@
     |        @:: : ::@:  ::::::   @:::::: : @:::        # :: :: :::::::@::::@
     |        @:: : ::@:  ::::::   @:::::: : @:::        # :: :: :::::::@::::@
     |        @:: : ::@:  ::::::   @:::::: : @:::        # :: :: :::::::@::::@
     |     @::@:: : ::@:  ::::::   @:::::: : @:::        # :: :: :::::::@::::@
     |     @::@:: : ::@:  ::::::   @:::::: : @:::        # :: :: :::::::@::::@
     |     @::@:: : ::@:  ::::::   @:::::: : @:::        # :: :: :::::::@::::@
     |     @::@:: : ::@:  ::::::   @:::::: : @:::        # :: :: :::::::@::::@
     |     @::@:: : ::@:  ::::::   @:::::: : @:::        # :: :: :::::::@::::@
     |   @:@::@:: : ::@:  ::::::   @:::::: : @:::        # :: :: :::::::@::::@
     |   @:@::@:: : ::@:  ::::::   @:::::: : @:::        # :: :: :::::::@::::@
     |   @:@::@:: : ::@:  ::::::   @:::::: : @:::        # :: :: :::::::@::::@
     |   @:@::@:: : ::@:  ::::::   @:::::: : @:::        # :: :: :::::::@::::@
     |   @:@::@:: : ::@:  ::::::   @:::::: : @:::        # :: :: :::::::@::::@
     |   @:@::@:: : ::@:  ::::::   @:::::: : @:::        # :: :: :::::::@::::@
     |   @:@::@:: : ::@:  ::::::   @:::::: : @:::        # :: :: :::::::@::::@
     |   @:@::@:: : ::@:  ::::::   @:::::: : @:::        # :: :: :::::::@::::@
     |   @:@::@:: : ::@:  ::::::   @:::::: : @:::        # :: :: :::::::@::::@
     |   @:@::@:: : ::@:  ::::::   @:::::: : @:::        # :: :: :::::::@::::@
   0 +----------------------------------------------------------------------->Gi
     0                                                                   6.243

Number of snapshots: 64
 Detailed snapshots: [3, 5, 8, 16, 24, 33, 37 (peak), 53, 63]

--------------------------------------------------------------------------------
  n        time(i)         total(B)   useful-heap(B) extra-heap(B)    stacks(B)
--------------------------------------------------------------------------------
  0              0                0                0             0            0
  1    113,336,545        3,633,576        3,607,086        26,490            0
  2    219,235,835        3,649,968        3,623,470        26,498            0
  3    324,720,416       97,585,728       97,427,879       157,849            0
99.84% (97,427,879B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
->81.98% (80,000,000B) 0x55F76B7: jl_profile_init (signal-handling.c:294)
| ->81.98% (80,000,000B) 0xC0ADEFA: julia_init_55051.clone_1 (Profile.jl:60)
|   ->81.98% (80,000,000B) 0xC0ADCE3: julia___init___55048.clone_1 (in /home/woclass/packages/julias/julia-1.6/lib/julia/sys.so)
|     ->81.98% (80,000,000B) 0xC0ADCF5: jfptr___init___55049.clone_1 (in /home/woclass/packages/julias/julia-1.6/lib/julia/sys.so)
|       ->81.98% (80,000,000B) 0x5599459: _jl_invoke (gf.c:2237)
|         ->81.98% (80,000,000B) 0x5599459: jl_apply_generic (gf.c:2419)
|           ->81.98% (80,000,000B) 0x55D21C5: jl_apply (julia.h:1703)
|             ->81.98% (80,000,000B) 0x55D21C5: jl_module_run_initializer (toplevel.c:72)
|               ->81.98% (80,000,000B) 0x55B9ADB: _julia_init (init.c:794)
|                 ->81.98% (80,000,000B) 0x55F5A70: repl_entrypoint (jlapi.c:696)
|                   ->81.98% (80,000,000B) 0x4007A8: main (loader_exe.c:51)
|
....

@ViralBShah
Copy link
Member Author

ViralBShah commented Oct 1, 2021

I am doing time ./julia -e "using LinearAlgebra; @show BLAS.get_num_threads(); exit" on Julia master (which is built for 32 max openblas threads).

With 8 threads, it is 0.24s
With 32 threads, it is 0.29s

When I replace openblas with the version from OpenBLASHighCoreCounts_jll, I see no meaningful change in startup time or memory usage, but the machine I am testing on only goes to 40 cores.

@ViralBShah
Copy link
Member Author

ViralBShah commented Oct 1, 2021

@staticfloat @chriselrod Any thoughts here?

@staticfloat
Copy link
Member

Losing 50ms is rather significant, IMO. I'd like to better understand why we're losing all that time.

@ViralBShah
Copy link
Member Author

ViralBShah commented Oct 2, 2021

I think 50ms is not too bad given the performance boost (only on systems with large numbers of cores). As a user, I want Julia to start with as many threads as cores. I thought we lose time because of the openblas per-thread buffer allocation that happens when you dlopen it. @vtjnash probably knows this quite well.

We could try to estimate the true number of cores and use half as many threads possibly (avoiding the allocation one openblas thread per hyperthread), but I don't know how reliably one can do such a thing.

@ViralBShah
Copy link
Member Author

The impact on pure Julia startup time is on the order of 5ms on arctic1 (which is now several years old).

@ViralBShah
Copy link
Member Author

I propose merging this, and using 512 for the HighCoreCounts version. Thoughts?

@staticfloat
Copy link
Member

After talking it over with Viral, I think we can come to a happy compromise:

  1. Investigate whether building OpenBLAS with high max core counts is feasible on all platforms, and if it is, get rid of OpenBLASHighCoreCount and just have a single OpenBLAS_jll that has its max CPU count set to something very high like 4096.

  2. Re-instate a Julia __init__() limit that does something like:

ENV["OPENBLAS_NUM_THREADS"] = get(ENV, "OPENBLAS_NUM_THREADS", string(min(Sys.CPU_THREADS, 32)))

We should probably put that in LinearAlgebra instead of Base.

This way, we get:

  1. Reasonable performance by default (32 cores is enough for most machines).
  2. Reduced latency increase on manycore machines.
  3. The ability to override Julia's default limit and run with 128 cores if you really want to.

I've looked at Sys.maxrss() with a variety of different OPENBLAS_NUM_THREADS settings, and it doesn't appear to change that much (~1MB difference between 1 and 128 threads) so I don't think memory usage is an issue when setting extremely high core counts.

@ViralBShah
Copy link
Member Author

We may want to merge JuliaLang/julia#42473 first - but I think these environment variables need to get set before we dlopen openblas. Is there a chance it gets dlopened earlier?

@ViralBShah ViralBShah changed the title [openblas] Build for 128 threads by default instead of 32 [openblas] Build for high numbers of threads (4096) by default instead of 32 Oct 5, 2021
@chriselrod
Copy link

I think that sounds reasonable.
Eventually we should use physical core counts instead of threads
JuliaLang/LinearAlgebra.jl#671

How did you time the difference in startup times?

time OPENBLAS_NUM_THREADS=1 julia --startup=no -e '0'

seems too noisy for me to notice a difference.

@ViralBShah
Copy link
Member Author

ViralBShah commented Oct 5, 2021

We should absolutely do physical core counts if we can get them reliably. Note you may often want to do /usr/bin/time.

@staticfloat
Copy link
Member

I actually use @btime from BenchmarkTools, e.g.

@btime run(`./julia -e0`)

@chriselrod
Copy link

chriselrod commented Oct 5, 2021

I still don't see a marked difference. Minimum time seems slightly worse, while median and mean improve:

> OPENBLAS_NUM_THREADS=1 julia --startup=no -e 'using BenchmarkTools
, LinearAlgebra; @show(BLAS.get_num_threads()); display(@benchmark run(`$(Base.julia_cmd()) --startup=no -e0`))'
BLAS.get_num_threads() = 1
BenchmarkTools.Trial: 42 samples with 1 evaluation.
 Range (min  max):  108.952 ms  124.790 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     119.121 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   119.068 ms ±   4.407 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

           ▁              ▁         ▁▄ ▁ ▁ ▁         ▁▁  ▁ █
  ▆▁▁▁▁▆▁▆▁█▁▁▁▆▁▁▆▁▁▁▁▁▁▁█▆▁▁▁▁▁▁▁▆██▆█▆█▁█▁▁▆▁▁▆▆▁▁██▆▁█▆█▆▆▆ ▁
  109 ms           Histogram: frequency by time          125 ms <

 Memory estimate: 4.00 KiB, allocs estimate: 87.> OPENBLAS_NUM_THREADS=1 julia  --startup=no -e 'using BenchmarkTools
, LinearAlgebra; @show(BLAS.get_num_threads()); display(@benchmark run(`$(Base.julia_cmd()) --startup=no -e0`))'
BLAS.get_num_threads() = 1
BenchmarkTools.Trial: 43 samples with 1 evaluation.
 Range (min  max):  108.923 ms  129.074 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     118.254 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   117.794 ms ±   4.956 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▃█   ▃          █▃  ▃   █ █  ▃ ▃       █   █
  ▇▁██▁▁▁█▁▁▁▇▁▁▁▁▇▁██▁▁█▁▇▁█▁█▇▁█▇█▁▇▁▁▇▇▁█▁▇▁█▁▇▇▁▁▁▁▁▁▁▁▁▁▁▇ ▁
  109 ms           Histogram: frequency by time          129 ms <

 Memory estimate: 4.00 KiB, allocs estimate: 87.> OPENBLAS_NUM_THREADS=28 julia --startup=no -e 'using BenchmarkTool
s, LinearAlgebra; @show(BLAS.get_num_threads()); display(@benchmark run(`$(Base.julia_cmd()) --startup=no -e0`))'
BLAS.get_num_threads() = 28
BenchmarkTools.Trial: 44 samples with 1 evaluation.
 Range (min  max):  111.792 ms  124.431 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     113.549 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   114.528 ms ±   2.479 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▃   █ ▆▃ ▁ ▃
  ▇█▁▁▄█▄██▄█▁█▄▁▁▁▁▄▁▁▄▁▄▄▁▁▇▇▇▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▁
  112 ms           Histogram: frequency by time          124 ms <

 Memory estimate: 4.00 KiB, allocs estimate: 87.> OPENBLAS_NUM_THREADS=28 julia --startup=no -e 'using BenchmarkTool
s, LinearAlgebra; @show(BLAS.get_num_threads()); display(@benchmark run(`$(Base.julia_cmd()) --startup=no -e0`))'
BLAS.get_num_threads() = 28
BenchmarkTools.Trial: 44 samples with 1 evaluation.
 Range (min  max):  111.519 ms  120.237 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     113.767 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   114.196 ms ±   2.124 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

            ▅█      ▅
  █▅▁█▅█▅█▅▁██▁▁▁▅█▅█▅▁▁█▅▅▁▁▅▁▁▁▁▅▅▁▁▁▁█▅▁▅▁▁▁▁▁▁▁▅▁▁▁▁▁▁▅▁▁▁▅ ▁
  112 ms           Histogram: frequency by time          120 ms <

 Memory estimate: 4.00 KiB, allocs estimate: 87.

This is on a 14 core / 28 thread CPU.

@staticfloat
Copy link
Member

This looks very similar to what I was seeing; you're seeing a 3ms increase in minimum time (which, IMO, is the only thing we should be paying attention to, as all other variation is most likely due to other processes adding noise). If you had a 128 core machine, you'd probably see an increase closer to 10ms.

@ViralBShah
Copy link
Member Author

So, how do we feel about merging this with 4096 max threads? :-)

@staticfloat
Copy link
Member

I would like someone to do a quick sanity check on Windows and macOS to see what happens if you build OpenBLAS with 4096 cores. It's not too hard; just build Julia via make USE_BINARYBUILDER_OPENBLAS=0 after having changed these lines to set the limit to 4096. Then compare the memory usage there with the memory usage of a normal Julia, (startup time would be good to know as well)

@ViralBShah
Copy link
Member Author

We'll certainly get mac testers, but who might be able to do this for Windows?

@staticfloat
Copy link
Member

I checked on windows; it's about the same as on Linux.

@ViralBShah
Copy link
Member Author

ViralBShah commented Oct 6, 2021

@staticfloat can you check startup time impact when you use as many threads as cores? OpenBLAS by default will launch as many threads as hyperthreads which may be much worse for startup time impact due to oversubscribing.

@ViralBShah
Copy link
Member Author

ViralBShah commented Oct 6, 2021

On mac with the current master, there's some makefile breakage:

➜  julia git:(master) ✗ make USE_BINARYBUILDER_OPENBLAS=0 VERBOSE=1
make[1]: *** No rule to make target `scratch/objconv/build-compiled', needed by `scratch/openblas-d909f9f3d4fc4ccff36d69f178558df154ba1002/build-compiled'.  Stop.
make: *** [julia-deps] Error 2

Found a way forward, but also filed:
JuliaLang/julia#42519

@staticfloat
Copy link
Member

I'm satisfied that this is reasonable now. Let's merge this, and then work on the Julia-side thread choice logic.

@staticfloat staticfloat merged commit 433dc26 into master Oct 7, 2021
@staticfloat staticfloat deleted the vs/openblas-threads branch October 7, 2021 22:23
@ViralBShah
Copy link
Member Author

Note that this only changes common.jl. We need a separate PR to actually build a new OpenBLAS with this, after which we do the version bump on julia master. Should we keep the openblas high core counts around - or get rid of it?

simeonschaub pushed a commit to simeonschaub/Yggdrasil that referenced this pull request Feb 23, 2022
…d of 32 (JuliaPackaging#3667)

* [openblas] Build for 4096 threads by default

We're building with an extremely high thread limit, then limiting in Julia according to our hardware introspection
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants