BLAS threads should default to physical not logical core count? #671

ChrisRackauckas · 2019-09-28T01:08:06Z

On an i7-8550u, OpenBLAS is defaulting to 8 threads. I was comparing to RecursiveFactorizations.jl, and saw the performance is like:

using BenchmarkTools
import LinearAlgebra, RecursiveFactorization
ccall((:openblas_get_num_threads64_, Base.libblas_name), Cint, ())
LinearAlgebra.BLAS.set_num_threads(4)
BenchmarkTools.DEFAULT_PARAMETERS.seconds = 0.5

luflop(m, n) = n^3÷3 - n÷3 + m*n^2
luflop(n) = luflop(n, n)

bas_mflops = Float64[]
rec_mflops = Float64[]
ns = 50:50:800
for n in ns
    A = rand(n, n)
    bt = @belapsed LinearAlgebra.lu!($(copy(A)))
    rt = @belapsed RecursiveFactorization.lu!($(copy(A)))
    push!(bas_mflops, luflop(n)/bt/1e9)
    push!(rec_mflops, luflop(n)/rt/1e9)
end

using Plots
plt = plot(ns, bas_mflops, legend=:bottomright, lab="OpenBLAS", title="LU Factorization Benchmark", marker=:auto)
plot!(plt, ns, rec_mflops, lab="RecursiveFactorization", marker=:auto)
xaxis!(plt, "size (N x N)")
yaxis!(plt, "GFLOPS")

1 thread

4 threads

8 threads

Conclusion: the default that Julia chooses, 8 threads, is the worst, with 1 thread doing better. But using the number of physical cores, 4, is best.

So there were a lot of issues on Discourse and Slack #gripes where essentially "setting BLAS threads to 1 is better than the default!", but it looks like it's because the default should be the number of physical and not logical threads. I am actually very surprised it's not set that way, and so I was wondering why, and also where this default is set (I couldn't find it).

ChrisRackauckas · 2019-09-28T01:22:22Z

Interestingly, on Linux I checked our benchmarking server and see:

processor       : 15
vendor_id       : GenuineIntel
cpu family      : 6
model           : 63
model name      : Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
stepping        : 0
microcode       : 0x3d
cpu MHz         : 2394.455
cache size      : 35840 KB
physical id     : 30
siblings        : 1
core id         : 0
cpu cores       : 1
apicid          : 30
initial apicid  : 30
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts nopl xtopology tsc_reliable nonstop_tsc eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 avx2 smep bmi2 invpcid xsaveopt arat spec_ctrl intel_stibp flush_l1d arch_capabilities
bogomips        : 4788.91
clflush size    : 64
cache_alignment : 64
address sizes   : 42 bits physical, 48 bits virtual
power management:

[crackauckas@pumas ~]$ julia
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.2.0 (2019-08-20)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> ccall((:openblas_get_num_threads64_, Base.libblas_name), Cint, ())
8

and it's correct there, so this may be an issue with Windows.

YingboMa · 2019-09-28T01:38:43Z

This doesn't seem like a Windows-only problem.

julia> versioninfo(verbose=true)
Julia Version 1.3.0-rc2.0
Commit a04936e3e0 (2019-09-12 19:49 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
      Ubuntu 19.04
  uname: Linux 5.0.0-29-generic JuliaLang/julia#31-Ubuntu SMP Thu Sep 12 13:05:32 UTC 2019 x86_64 x86_64
  CPU: Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz:
              speed         user         nice          sys         idle          irq
       JuliaLang/julia#1  2700 MHz     384385 s       2674 s     128341 s    7458186 s          0 s
       JuliaLang/julia#2  2700 MHz     381396 s       3631 s     125158 s    1466384 s          0 s
       JuliaLang/julia#3  2700 MHz     304236 s       3488 s      99122 s    1470836 s          0 s
       JuliaLang/julia#4  2700 MHz     421380 s       4702 s     137860 s    1454895 s          0 s
       JuliaLang/julia#5  2700 MHz     192528 s       2218 s      83784 s    1472282 s          0 s
       JuliaLang/julia#6  2700 MHz     267616 s       1661 s      82347 s    1470275 s          0 s
       JuliaLang/julia#7  2700 MHz     309143 s       1778 s      98877 s    1478079 s          0 s
       JuliaLang/julia#8  2700 MHz     287978 s       3788 s      85621 s    1470172 s          0 s

  Memory: 15.530269622802734 GB (4015.9765625 MB free)
  Uptime: 238530.0 sec
  Load Avg:  0.39208984375  0.30615234375  0.16015625
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)
Environment:
  HOME = /home/scheme
  OPTFLAGS = -load=/home/scheme/julia/usr/lib/libjulia.so
  PATH = /home/scheme/.cargo/bin:/home/scheme/Downloads/VSCode-linux-x64/code-insiders:/sbin:/home/scheme/.cargo/bin:/home/scheme/Downloads/VSCode-linux-x64/code-insiders:/sbin:/home/scheme/.cargo/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
  TERM = screen-256color
  WINDOWPATH = 2

julia> ccall((:openblas_get_num_threads64_, Base.libblas_name), Cint, ())
8

My Linux laptop has four physical cores.

tkf · 2019-09-28T02:29:05Z

Isn't 8 coming from here?

https://github.com/JuliaLang/julia/blob/318affa294e8d493b1b3edcc3e06df26c97eb4bb/base/Base.jl#L395-L403

See also

https://github.com/JuliaLang/julia/blob/318affa294e8d493b1b3edcc3e06df26c97eb4bb/base/sysinfo.jl#L56-L63

Ref: JuliaLang/julia#27856

macd · 2019-09-28T06:25:30Z

Here is the comparison of the 4 thread case on an Intel i7-6770HQ (Skull Canyon NUC) with using the Intel MKL

ChrisRackauckas · 2019-09-28T06:36:09Z

That shows? It seems RecursiveFactorizations is only faster when the BLAS is not well-tuned, but I don't understand how what you show is related @macd . Could you show the 8 thread version of MKL to see if MKL is also not smart with the setting being at logical?

macd · 2019-09-28T15:18:00Z

Here are the MKL results as a function of number of threads on the Skull Canyon NUC. It has 4 cores and 8 threads. Clearly a stumble with 8 threads around n = 500, but it is still better than 4. Apparently Intel knows how to use the hyper threads more effectively.

ChrisRackauckas · 2019-09-28T15:49:35Z

That explains a lot. Thanks

macd · 2019-09-29T02:59:03Z

So I have egg all over my face on this one. Just looked at the code in the window I left open this morning to see I incorrectly used an index rather than the indexed value for the number of threads. The above graph is then for 1,2,3,4 threads. The following graph shows no significant difference between 4 and 8 threads. (I never was an early morning person.)

ChrisRackauckas · 2019-09-29T08:56:58Z

Ahh yes, so it looks like it's just smart and doesn't actually run more threads or something like that. Since OpenBLAS doesn't do that, this would account for a good chunk of the difference given the default settings.

StefanKarpinski · 2019-10-01T20:56:08Z

Can we easily detect the number of actual cores versus logical cores? We renamed the old Sys.CPU_CORES to Sys.CPU_THREADS fortunately, so we have the name Sys.CPU_CORES available for that value but I'm not sure how best to get it.

chriselrod · 2019-10-01T21:11:08Z

Is lscpu reliably on Linux machines?
It correctly identified 10 cores per socket and 2 threads per core:

> lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
Address sizes:       46 bits physical, 48 bits virtual
CPU(s):              20
On-line CPU(s) list: 0-19
Thread(s) per core:  2
Core(s) per socket:  10
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Core(TM) i9-7900X CPU @ 3.30GHz
Stepping:            4
CPU MHz:             3980.712
CPU max MHz:         4500.0000
CPU min MHz:         1200.0000
BogoMIPS:            6600.00
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            14080K
NUMA node0 CPU(s):   0-19
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts md_clear flush_l1d

> lscpu | grep "Core(s) per socket"
Core(s) per socket:  10

StefanKarpinski · 2019-10-01T21:27:21Z

Some possibilities for figuring out the number of physical cores:

https://en.wikipedia.org/wiki/CPUID
https://c9x.me/x86/html/file_module_x86_id_45.html
https://stackoverflow.com/a/46201896/659248: smp_processor_id()

We may be able to borrow something from https://github.com/m-j-w/CpuId.jl

StefanKarpinski · 2019-10-11T19:51:44Z

Another possible library for figuring out the number of cores (from @RoyiAvital):

https://github.com/anrieff/libcpuid

KristofferC · 2019-10-12T10:41:10Z

Just as a note, we already do a bunch of cpuid stuff to support picking functions optimized for a feature-set at runtime e.g.

https://github.com/JuliaLang/julia/blob/cf544e6afab38351eeac049af1956153e24ec99c/src/processor_x86.cpp#L5-L28.

Perhaps extending that to pick out the number of cores instead of bundling a whole other library is more "clean".

tkf · 2020-01-11T04:35:03Z

I'm not very familiar with things this close to hardware. But, IIUC, CPUID is not the full story and I suppose you need to run the instruction on each core/socket. Does julia has a facility to do that in all supported platforms?

Also, I noticed that there is an issue tracked in libuv libuv/libuv#1967 for adding the support for physical cores to uv_cpu_info (which is called via Sys.cpu_info). I guess it'd be ideal if it's supported by libuv.

BTW, It looks like uv_cpu_info just looks into OS's API (e.g., /proc/cpuinfo in Linux):
https://github.com/JuliaLang/libuv/blob/julia-uv2-1.29.1/src/unix/linux-core.c
https://github.com/JuliaLang/libuv/blob/julia-uv2-1.29.1/src/unix/darwin.c
https://github.com/JuliaLang/libuv/blob/julia-uv2-1.29.1/src/win/util.c
Maybe another option is to call such APIs directly from Julia? For example, parsing /proc/cpuinfo sounds very easy.

FYI: Meanwhile, I wrote a simple helper Python script to launch Julia with appropriate JULIA_CPU_THREADS: https://github.com/tkf/julia-t. It does something equivalent to lscpu -p | grep -v '^#' | sort --unique --field-separator=, --key=2 | wc -l to get the number of threads. It's useful to use this via Distributed.addprocs.

KristofferC · 2020-02-07T17:35:36Z

It seems MKL limits the maximum threads to the number of physical cores (i9-9900K (8 cores)):

julia> ccall((:mkl_get_max_threads, "libmkl_rt"), Int32, ())
8

julia> ccall((:mkl_set_num_threads, "libmkl_rt"), Cvoid, (Ptr{Int32},), Ref(Int32(16)))

julia> ccall((:mkl_get_max_threads, "libmkl_rt"), Int32, ())
8

julia> ccall((:mkl_set_num_threads, "libmkl_rt"), Cvoid, (Ptr{Int32},), Ref(Int32(4)))

julia> ccall((:mkl_get_max_threads, "libmkl_rt"), Int32, ())
4

jlperla · 2020-05-24T15:51:04Z

Is this stuff done automatically in Julia 1.5?

StefanKarpinski · 2020-05-26T18:31:07Z

Not that I'm aware of. No one ever figured out how to get the right number of threads.

jlperla · 2020-05-26T20:20:31Z

I am sure that this has already been suggested, but what if the is a temporary solution of just taking half of the current number because that would be the right calculation in most cases?

Even if AMD processors count logical cores different than intel (maybe they don't?...), it seems like it might be better to choose a default which matches the higher market share for now so that the MKL vs. OpenBLAS is more speed is more comparible. Others can always increase the number of cores if they wish.

tkf · 2020-05-26T20:46:02Z

No one ever figured out how to get the right number of threads.

I think it just has to be handled via whatever the API given by the OS. For example, in Linux, you'd want to make it cgroup-aware rather than using the hardware information directly.

$ sudo docker run --cpuset-cpus=0-2 -i -t --rm julia
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.4.1 (2020-04-14)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

(@v1.4) pkg> add CpuId
...

julia> using CpuId
[ Info: Precompiling CpuId [adafc99b-e345-5852-983c-f28acb93d879]

julia> cpucores()
4

julia> cputhreads()
8

shell> cat /sys/fs/cgroup/cpuset/cpuset.cpus
0-2

julia> Sys.CPU_THREADS
8

Here, using Threads.nthreads() == 3 with julia -t sounds like the best approach.

Parsing /proc and /sys sounds easy to do for Linux. I'm not sure about other OSes, though.

(Edit: fix the example to use --cpuset-cpus=0-2 instead of --cpuset-cpus=0-3)

StefanKarpinski · 2020-05-27T13:30:45Z

We (probably) don't want to make all of the CpuId package a dependency of Julia, so in order to use that, we would have to extract a minimal piece of it that lets us get the number of cores and use that.

Short of that, defaulting to half the number of CPU threads is probably better than saturating the cores. Of course, there are other computations than BLAS and they generally do benefit from hyperthreads, so maybe not. And if anyone has gone to the trouble of disabling hyperthreading, they're suddenly going to see their BLAS-intensive Julia code get half the FLOPS, so not great.

StefanKarpinski · 2020-05-27T13:33:06Z

Having libuv support this would be nice but since that's a feature request issue and there's no PR to implement it, whereas we already have Julia code that implements using CPUID to detect the number of cores and threads, it seems better to do that for now. If libuv ever adds the feature, we can always switch to using that instead.

StefanKarpinski · 2020-05-27T13:35:07Z

@KristofferC's observation that we already do a bunch of CPU detection stuff in C is actually the most promising avenue, imo: write C code that figures out the right number of CPU cores and threads and then expose that number as a C global or C function and set the relevant Julia globals and then use those. So someone "just" needs to write C code that does this now.

tkf · 2020-05-27T16:16:15Z

My point was that CPUID does not give us the correct information. I think it's better to parse /proc and/or /sys at least for Linux.

StefanKarpinski · 2020-05-27T16:20:08Z

How does it not agree? cpucores() == 4 and cat /sys/fs/cgroup/cpuset/cpuset.cpus shows for CPUs with IDs 0, 1, 2 and 3.

oschulz · 2020-06-21T15:57:46Z

Is there a technical necessity to have a hard upper limit (don't know the technical depths, here), or could the user be allowed to set any `$OPENBLAS_NUM_THREADS$ value they like, in the future?

ViralBShah · 2020-06-21T16:11:06Z

It is a build time option for OpenBLAS, and we try to be conservative because it leads to significant memory consumption to have a default that is too high. I don't know if recent versions have fixed this, and if we can set the number to something big - like 256.

oschulz · 2020-06-21T17:01:31Z

Ah, I see! Grrr, why oh why does OpenBLAS need to know that at build time ... :-)

ViralBShah · 2020-06-21T19:56:27Z

Relevant: https://github.com/xianyi/OpenBLAS/blob/develop/USAGE.md

While eventually we should integrate openblas threading with Julia's, I wonder if we can solve the current issue by moving to openmp. That will have the benefit of also playing nicely with the existing openmp compiled librares in BB.

yuyichao · 2020-06-21T20:12:22Z

What thread scheduler to use does not seem to have anything to do with buffer allocation. That should be something to be fixed separately.

oschulz · 2020-06-22T01:47:00Z

I wonder if BLAS actually does some code generation based on the max. number of threads or so? I guess it's not a compile-time option on a whim?

yuyichao · 2020-06-22T02:14:38Z

No code generation. AFAICT it uses stack buffer for cheap and reentraint allocation. The easiest first step to try should be to just use VLA when supported.

jlperla · 2021-02-22T17:56:11Z

With Julia 1.6 I assume this is still an issue? Is there a current idiot-proof berst practice to tell users to copy/paste into their code to pick a better default value?

chriselrod · 2021-02-22T18:07:49Z

With Julia 1.6 I assume this is still an issue? Is there a current idiot-proof berst practice to tell users to copy/paste into their code to pick a better default value?

import Hwloc
ncores = Hwloc.num_physical_cores()

jlperla · 2021-02-22T18:14:49Z

@ChrisRackauckas Didn't you sometimes find that even a lower value was typically better or is my memory foggy?

ChrisRackauckas · 2021-02-23T02:46:38Z

That works. The issue is that Hwloc isn't what Julia uses. @ViralBShah's response was, shouldn't we just ship Hwloc with Julia then? @StefanKarpinski's response is that it's pretty heavyweight so it would be good to pull out just what we need. And that's where the issue has sat since. Someone just needs to put the work in.

chriselrod · 2021-02-23T04:17:23Z

OpenBLAS doesn't always ramp up well, especially for operations other than gemm (like LU), so if your matrices aren't that large it's often faster if you set it to a single thread. MKL doesn't have this problem, while it is pretty extreme in BLIS even for gemm.
Asymptotically in matrix size, you want 1 thread per physical core.

For x86, we could also strip the code from CpuId.jl:

#
#   TODO:
#   The following llvmcall routines fail when being inlined!
#   Hence the @noinline.
#

# Low level cpuid call, taking eax=leaf and ecx=subleaf,
# returning eax, ebx, ecx, edx as NTuple(4,UInt32)
@noinline function cpuid_llvm(leaf::UInt32, subleaf::UInt32)
    Base.llvmcall("""
        ; leaf = %0, subleaf = %1, %2 is some label
        ; call 'cpuid' with arguments loaded into registers EAX = leaf, ECX = subleaf
        %3 = tail call { i32, i32, i32, i32 } asm sideeffect "cpuid",
            "={ax},={bx},={cx},={dx},{ax},{cx},~{dirflag},~{fpsr},~{flags}"
            (i32 %0, i32 %1) JuliaLang/julia#2
        ; retrieve the result values and convert to vector [4 x i32]
        %4 = extractvalue { i32, i32, i32, i32 } %3, 0
        %5 = extractvalue { i32, i32, i32, i32 } %3, 1
        %6 = extractvalue { i32, i32, i32, i32 } %3, 2
        %7 = extractvalue { i32, i32, i32, i32 } %3, 3
        ; return the values as a new tuple
        %8  = insertvalue [4 x i32] undef, i32 %4, 0
        %9  = insertvalue [4 x i32]   %8 , i32 %5, 1
        %10 = insertvalue [4 x i32]   %9 , i32 %6, 2
        %11 = insertvalue [4 x i32]  %10 , i32 %7, 3
        ret [4 x i32] %11
    """
    # llvmcall requires actual types, rather than the usual (...) tuple
    , NTuple{4,UInt32}, Tuple{UInt32,UInt32}, leaf, subleaf)
end
function cpuid(leaf=0, subleaf=0)
    # for some reason, we need a dedicated local
    # variable of UInt32 for llvmcall to succeed
    l, s = UInt32(leaf), UInt32(subleaf)
    cpuid_llvm(l, s) ::NTuple{4,UInt32}
end

@inline function hasleaf(leaf::UInt32) ::Bool
    eax, ebx, ecx, edx = cpuid(leaf & 0xffff_0000)
    eax >= leaf
end
function cpucores() ::Int

    leaf = 0x0000_000b
    hasleaf(leaf) || return zero(UInt32)

    # The number of cores reported by cpuid is actually already the total
    # number of cores at that level, including all of the lower levels.
    # Thus, we need to find the highest level...which is 0x02 == "Core"
    # on ecx[15:08] per the specs, and divide it by the number of
    # 0x01 == "SMT" logical cores.

    sl = zero(UInt32)
    nc = zero(UInt32) # "Core" count
    nl = zero(UInt32) # "SMT" count
    while (true)
        # ebx[15:0] must be non-zero according to manual
        eax, ebx, ecx, edx = cpuid(leaf, sl)
        ebx & 0xffff == 0x0000 && break
        sl += one(UInt32)
        lt = ((ecx >> 8) & 0xff) & 0x03
        # Let's assume for now there's only one valid entry for each level
        lt == 0x01 && ( nl = ebx & 0xffff; continue )
        lt == 0x02 && ( nc = ebx & 0xffff; continue )
        # others are invalid and shouldn't be considered..
    end

    return iszero(nc) ? # no cores detected? then maybe its AMD?
        # AMD
        ((cpuid(0x8000_0008)[3] & 0x00ff)+1) :
        # Intel, we need nonzero values of nc and nl
        (iszero(nl) ? nc : nc ÷ nl)
end

This works:

julia> cpucores()
4

julia> versioninfo() # the 1165G7 has 4 cores
Julia Version 1.7.0-DEV.581
Commit d524f21917* (2021-02-19 22:06 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, tigerlake)
Environment:
  JULIA_NUM_THREADS = 8

I think Hwloc is clever about only reporting the number of cores a VM is given (IIRC Hwloc.num_physical_cores() returns 2 on GitHub CI, even though those servers have many more cores).

I haven't looked at it closely, but here is Hwloc's x86 code. Probably more rigorous and better tested than the above.
The trick to being a cross-platform library like Hwloc is to have different code for every platform.

StefanKarpinski · 2021-05-04T01:15:29Z

I'm marking this for triage so that we can discuss if @chriselrod's implementation here is something we can go ahead with. This would be very nice to finally fix as this is a rather unfortunate performance trap to default to. While we're at it, perhaps we can discuss whether Julia should default to having as many threads as cores.

oschulz · 2021-05-04T06:12:01Z

A good default that would also suit larger/shared system might be default nthreads equal number of physical cores in a single numa domain. Not sure how easy it is to get that cross-platform without additional libs, though.

JeffBezanson · 2021-07-15T19:19:07Z

Triage says 👍 but we would like to implement this inside processor.cpp.

jlperla · 2021-09-21T14:30:28Z

Where did this end up in 1.7? Do we still need custom logic to choose processors?

ViralBShah · 2024-09-18T21:53:25Z

JuliaLang/julia#55574 makes this better (but perhaps not a fix for using only physical cores).

ImVictorCheng · 2024-10-23T09:29:10Z

Hi, just to follow up, is this issue solved? Does BLAS default to physical core count as of Julia 1.11.1? And is physical core count truely better than logical count?

I just ran some tests on my computer (CPU: i7-7700) and Sys.CPU_THREADS returns 8, BLAS.get_num_threads() returns 4. For my test code the default is indeed faster than forcing BLAS.set_num_threads(8). I'm wondering if this result can be trusted. Thanks!

giordano · 2024-10-23T09:43:56Z

Hi, just to follow up, is this issue solved? Does BLAS default to physical core count as of Julia 1.11.1?

Technically no, as Julia itself has still no clue of the number of physical cores.

And is physical core count truely better than logical count?

Without reference to the workload you have in mind this question doesn't make sense. I'm assuming you're thinking about a heavily compute bound workload, where trying to use more threads than physical cores is indeed problematic as it oversubscribes the resources. Instead, for a much lighter workload using more threads than physical cores shouldn't be a problem, in some cases it could even be beneficial.

and Sys.CPU_THREADS returns 8, BLAS.get_num_threads() returns 4

That's simply because, except on aarch64-darwin, by default Julia blindly defaults the number of BLAS threads to half of the total available threads, which normally is Sys.CPU_THREADS unless you constrain Julia to use fewer threads with affinity settings: https://github.com/JuliaLang/julia/blob/be0ce9dbf0597c1ff50fc73f3d197a19708c4cd3/stdlib/LinearAlgebra/src/LinearAlgebra.jl#L834 But as I said above, there's still no notion of physical vs logical threads.

ViralBShah · 2024-10-23T12:53:22Z

Does Hwloc give an accurate notion of the number of threads?

giordano · 2024-10-23T12:59:24Z

Yes, but as Stefan said above, it's always problematic to add new binary libraries as julia deps, also because we're stuck with that version forever for a given julia version, since having to use hwloc at startup means that updating it as a stdlib is even more cumbersome than it currently is.

JeffBezanson added the multithreading Base.Threads and related functionality label Oct 1, 2019

ChrisRackauckas mentioned this issue Feb 15, 2020

Issues with complex ode15s solver SciML/DifferentialEquations.jl#503

Closed

oxinabox mentioned this issue Feb 19, 2020

Explain boundry cases for set_num_threads JuliaLang/julia#34728

Open

vlandau mentioned this issue Feb 21, 2020

Further investigate impacts of BLAS_NUM_THREADS on performance Circuitscape/Omniscape.jl#24

Closed

tkf mentioned this issue May 8, 2020

nthread limit when running in taskset/cpuset JuliaLang/julia#35787

Closed

StefanKarpinski mentioned this issue May 5, 2021

Limit how many threads used during precompile JuliaLang/Pkg.jl#2404

Closed

ToucheSir mentioned this issue Jun 14, 2021

performance variance between macOS / Linux ? FluxML/Flux.jl#749

Closed

jonathan-laurent mentioned this issue Aug 31, 2021

Questions on parallelization jonathan-laurent/AlphaZero.jl#71

Open

chriselrod mentioned this issue Oct 5, 2021

[openblas] Build for high numbers of threads (4096) by default instead of 32 JuliaPackaging/Yggdrasil#3667

Merged

giordano mentioned this issue Jan 7, 2022

Use number of physical not logical cores for auto nthreads? JuliaLang/julia#43692

Open

ChrisRackauckas mentioned this issue Jan 13, 2022

Krylov.jl optimizations for ODE usage SciML/OrdinaryDiffEq.jl#1563

Open

NilsNiggemann mentioned this issue Feb 2, 2024

Performance Engineering Master plan NilsNiggemann/PMFRG.jl#8

Open

moble mentioned this issue Oct 14, 2024

Deal with BLAS thread numbers more simply moble/SphericalFunctions.jl#43

Open

KristofferC transferred this issue from JuliaLang/julia Nov 26, 2024

BLAS threads should default to physical not logical core count? #671

BLAS threads should default to physical not logical core count? #671

Comments

ChrisRackauckas commented Sep 28, 2019 • edited Loading

ChrisRackauckas commented Sep 28, 2019

YingboMa commented Sep 28, 2019

tkf commented Sep 28, 2019

macd commented Sep 28, 2019

ChrisRackauckas commented Sep 28, 2019 • edited Loading

macd commented Sep 28, 2019

ChrisRackauckas commented Sep 28, 2019

macd commented Sep 29, 2019

ChrisRackauckas commented Sep 29, 2019

StefanKarpinski commented Oct 1, 2019

chriselrod commented Oct 1, 2019 • edited Loading

StefanKarpinski commented Oct 1, 2019 • edited Loading

StefanKarpinski commented Oct 11, 2019

KristofferC commented Oct 12, 2019

tkf commented Jan 11, 2020

KristofferC commented Feb 7, 2020

jlperla commented May 24, 2020

StefanKarpinski commented May 26, 2020

jlperla commented May 26, 2020

tkf commented May 26, 2020 • edited Loading

StefanKarpinski commented May 27, 2020

StefanKarpinski commented May 27, 2020

StefanKarpinski commented May 27, 2020

tkf commented May 27, 2020

StefanKarpinski commented May 27, 2020

oschulz commented Jun 21, 2020

ViralBShah commented Jun 21, 2020

oschulz commented Jun 21, 2020

ViralBShah commented Jun 21, 2020

yuyichao commented Jun 21, 2020

oschulz commented Jun 22, 2020

yuyichao commented Jun 22, 2020

jlperla commented Feb 22, 2021

chriselrod commented Feb 22, 2021

jlperla commented Feb 22, 2021

ChrisRackauckas commented Feb 23, 2021

chriselrod commented Feb 23, 2021 • edited Loading

StefanKarpinski commented May 4, 2021

oschulz commented May 4, 2021

JeffBezanson commented Jul 15, 2021

jlperla commented Sep 21, 2021

ViralBShah commented Sep 18, 2024

ImVictorCheng commented Oct 23, 2024

giordano commented Oct 23, 2024

ViralBShah commented Oct 23, 2024

giordano commented Oct 23, 2024

ChrisRackauckas commented Sep 28, 2019 •

edited

Loading

ChrisRackauckas commented Sep 28, 2019 •

edited

Loading

chriselrod commented Oct 1, 2019 •

edited

Loading

StefanKarpinski commented Oct 1, 2019 •

edited

Loading

tkf commented May 26, 2020 •

edited

Loading

chriselrod commented Feb 23, 2021 •

edited

Loading