Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BLAS threads should default to physical not logical core count? #671

Open
ChrisRackauckas opened this issue Sep 28, 2019 · 55 comments
Open
Labels
multithreading Base.Threads and related functionality

Comments

@ChrisRackauckas
Copy link
Member

ChrisRackauckas commented Sep 28, 2019

On an i7-8550u, OpenBLAS is defaulting to 8 threads. I was comparing to RecursiveFactorizations.jl, and saw the performance is like:

using BenchmarkTools
import LinearAlgebra, RecursiveFactorization
ccall((:openblas_get_num_threads64_, Base.libblas_name), Cint, ())
LinearAlgebra.BLAS.set_num_threads(4)
BenchmarkTools.DEFAULT_PARAMETERS.seconds = 0.5

luflop(m, n) = n^3÷3 - n÷3 + m*n^2
luflop(n) = luflop(n, n)

bas_mflops = Float64[]
rec_mflops = Float64[]
ns = 50:50:800
for n in ns
    A = rand(n, n)
    bt = @belapsed LinearAlgebra.lu!($(copy(A)))
    rt = @belapsed RecursiveFactorization.lu!($(copy(A)))
    push!(bas_mflops, luflop(n)/bt/1e9)
    push!(rec_mflops, luflop(n)/rt/1e9)
end

using Plots
plt = plot(ns, bas_mflops, legend=:bottomright, lab="OpenBLAS", title="LU Factorization Benchmark", marker=:auto)
plot!(plt, ns, rec_mflops, lab="RecursiveFactorization", marker=:auto)
xaxis!(plt, "size (N x N)")
yaxis!(plt, "GFLOPS")

1 thread

lubenchbnumthreads1

4 threads

lubenchbnumthreads4

8 threads

lubench

Conclusion: the default that Julia chooses, 8 threads, is the worst, with 1 thread doing better. But using the number of physical cores, 4, is best.

So there were a lot of issues on Discourse and Slack #gripes where essentially "setting BLAS threads to 1 is better than the default!", but it looks like it's because the default should be the number of physical and not logical threads. I am actually very surprised it's not set that way, and so I was wondering why, and also where this default is set (I couldn't find it).

@ChrisRackauckas
Copy link
Member Author

Interestingly, on Linux I checked our benchmarking server and see:

processor       : 15
vendor_id       : GenuineIntel
cpu family      : 6
model           : 63
model name      : Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
stepping        : 0
microcode       : 0x3d
cpu MHz         : 2394.455
cache size      : 35840 KB
physical id     : 30
siblings        : 1
core id         : 0
cpu cores       : 1
apicid          : 30
initial apicid  : 30
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts nopl xtopology tsc_reliable nonstop_tsc eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 avx2 smep bmi2 invpcid xsaveopt arat spec_ctrl intel_stibp flush_l1d arch_capabilities
bogomips        : 4788.91
clflush size    : 64
cache_alignment : 64
address sizes   : 42 bits physical, 48 bits virtual
power management:

[crackauckas@pumas ~]$ julia
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.2.0 (2019-08-20)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> ccall((:openblas_get_num_threads64_, Base.libblas_name), Cint, ())
8

and it's correct there, so this may be an issue with Windows.

@YingboMa
Copy link

This doesn't seem like a Windows-only problem.

julia> versioninfo(verbose=true)
Julia Version 1.3.0-rc2.0
Commit a04936e3e0 (2019-09-12 19:49 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
      Ubuntu 19.04
  uname: Linux 5.0.0-29-generic JuliaLang/julia#31-Ubuntu SMP Thu Sep 12 13:05:32 UTC 2019 x86_64 x86_64
  CPU: Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz:
              speed         user         nice          sys         idle          irq
       JuliaLang/julia#1  2700 MHz     384385 s       2674 s     128341 s    7458186 s          0 s
       JuliaLang/julia#2  2700 MHz     381396 s       3631 s     125158 s    1466384 s          0 s
       JuliaLang/julia#3  2700 MHz     304236 s       3488 s      99122 s    1470836 s          0 s
       JuliaLang/julia#4  2700 MHz     421380 s       4702 s     137860 s    1454895 s          0 s
       JuliaLang/julia#5  2700 MHz     192528 s       2218 s      83784 s    1472282 s          0 s
       JuliaLang/julia#6  2700 MHz     267616 s       1661 s      82347 s    1470275 s          0 s
       JuliaLang/julia#7  2700 MHz     309143 s       1778 s      98877 s    1478079 s          0 s
       JuliaLang/julia#8  2700 MHz     287978 s       3788 s      85621 s    1470172 s          0 s

  Memory: 15.530269622802734 GB (4015.9765625 MB free)
  Uptime: 238530.0 sec
  Load Avg:  0.39208984375  0.30615234375  0.16015625
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)
Environment:
  HOME = /home/scheme
  OPTFLAGS = -load=/home/scheme/julia/usr/lib/libjulia.so
  PATH = /home/scheme/.cargo/bin:/home/scheme/Downloads/VSCode-linux-x64/code-insiders:/sbin:/home/scheme/.cargo/bin:/home/scheme/Downloads/VSCode-linux-x64/code-insiders:/sbin:/home/scheme/.cargo/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
  TERM = screen-256color
  WINDOWPATH = 2

julia> ccall((:openblas_get_num_threads64_, Base.libblas_name), Cint, ())
8

My Linux laptop has four physical cores.

@macd
Copy link

macd commented Sep 28, 2019

Here is the comparison of the 4 thread case on an Intel i7-6770HQ (Skull Canyon NUC) with using the Intel MKL
plt

@ChrisRackauckas
Copy link
Member Author

ChrisRackauckas commented Sep 28, 2019

That shows? It seems RecursiveFactorizations is only faster when the BLAS is not well-tuned, but I don't understand how what you show is related @macd . Could you show the 8 thread version of MKL to see if MKL is also not smart with the setting being at logical?

@macd
Copy link

macd commented Sep 28, 2019

Here are the MKL results as a function of number of threads on the Skull Canyon NUC. It has 4 cores and 8 threads. Clearly a stumble with 8 threads around n = 500, but it is still better than 4. Apparently Intel knows how to use the hyper threads more effectively.
mkl_threads

@ChrisRackauckas
Copy link
Member Author

That explains a lot. Thanks

@macd
Copy link

macd commented Sep 29, 2019

So I have egg all over my face on this one. Just looked at the code in the window I left open this morning to see I incorrectly used an index rather than the indexed value for the number of threads. The above graph is then for 1,2,3,4 threads. The following graph shows no significant difference between 4 and 8 threads. (I never was an early morning person.)
mkl_threads_v2

@ChrisRackauckas
Copy link
Member Author

Ahh yes, so it looks like it's just smart and doesn't actually run more threads or something like that. Since OpenBLAS doesn't do that, this would account for a good chunk of the difference given the default settings.

@JeffBezanson JeffBezanson added the multithreading Base.Threads and related functionality label Oct 1, 2019
@StefanKarpinski
Copy link
Member

Can we easily detect the number of actual cores versus logical cores? We renamed the old Sys.CPU_CORES to Sys.CPU_THREADS fortunately, so we have the name Sys.CPU_CORES available for that value but I'm not sure how best to get it.

@chriselrod
Copy link
Contributor

chriselrod commented Oct 1, 2019

Is lscpu reliably on Linux machines?
It correctly identified 10 cores per socket and 2 threads per core:

> lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
Address sizes:       46 bits physical, 48 bits virtual
CPU(s):              20
On-line CPU(s) list: 0-19
Thread(s) per core:  2
Core(s) per socket:  10
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Core(TM) i9-7900X CPU @ 3.30GHz
Stepping:            4
CPU MHz:             3980.712
CPU max MHz:         4500.0000
CPU min MHz:         1200.0000
BogoMIPS:            6600.00
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            14080K
NUMA node0 CPU(s):   0-19
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts md_clear flush_l1d

> lscpu | grep "Core(s) per socket"
Core(s) per socket:  10

@StefanKarpinski
Copy link
Member

StefanKarpinski commented Oct 1, 2019

Some possibilities for figuring out the number of physical cores:

We may be able to borrow something from https://github.com/m-j-w/CpuId.jl

@StefanKarpinski
Copy link
Member

Another possible library for figuring out the number of cores (from @RoyiAvital):

@KristofferC
Copy link
Member

Just as a note, we already do a bunch of cpuid stuff to support picking functions optimized for a feature-set at runtime e.g.

https://github.com/JuliaLang/julia/blob/cf544e6afab38351eeac049af1956153e24ec99c/src/processor_x86.cpp#L5-L28.

Perhaps extending that to pick out the number of cores instead of bundling a whole other library is more "clean".

@tkf
Copy link
Member

tkf commented Jan 11, 2020

I'm not very familiar with things this close to hardware. But, IIUC, CPUID is not the full story and I suppose you need to run the instruction on each core/socket. Does julia has a facility to do that in all supported platforms?

Also, I noticed that there is an issue tracked in libuv libuv/libuv#1967 for adding the support for physical cores to uv_cpu_info (which is called via Sys.cpu_info). I guess it'd be ideal if it's supported by libuv.

BTW, It looks like uv_cpu_info just looks into OS's API (e.g., /proc/cpuinfo in Linux):
https://github.com/JuliaLang/libuv/blob/julia-uv2-1.29.1/src/unix/linux-core.c
https://github.com/JuliaLang/libuv/blob/julia-uv2-1.29.1/src/unix/darwin.c
https://github.com/JuliaLang/libuv/blob/julia-uv2-1.29.1/src/win/util.c
Maybe another option is to call such APIs directly from Julia? For example, parsing /proc/cpuinfo sounds very easy.

FYI: Meanwhile, I wrote a simple helper Python script to launch Julia with appropriate JULIA_CPU_THREADS: https://github.com/tkf/julia-t. It does something equivalent to lscpu -p | grep -v '^#' | sort --unique --field-separator=, --key=2 | wc -l to get the number of threads. It's useful to use this via Distributed.addprocs.

@KristofferC
Copy link
Member

It seems MKL limits the maximum threads to the number of physical cores (i9-9900K (8 cores)):

julia> ccall((:mkl_get_max_threads, "libmkl_rt"), Int32, ())
8

julia> ccall((:mkl_set_num_threads, "libmkl_rt"), Cvoid, (Ptr{Int32},), Ref(Int32(16)))

julia> ccall((:mkl_get_max_threads, "libmkl_rt"), Int32, ())
8

julia> ccall((:mkl_set_num_threads, "libmkl_rt"), Cvoid, (Ptr{Int32},), Ref(Int32(4)))

julia> ccall((:mkl_get_max_threads, "libmkl_rt"), Int32, ())
4

@jlperla
Copy link

jlperla commented May 24, 2020

Is this stuff done automatically in Julia 1.5?

@StefanKarpinski
Copy link
Member

Not that I'm aware of. No one ever figured out how to get the right number of threads.

@jlperla
Copy link

jlperla commented May 26, 2020

I am sure that this has already been suggested, but what if the is a temporary solution of just taking half of the current number because that would be the right calculation in most cases?

Even if AMD processors count logical cores different than intel (maybe they don't?...), it seems like it might be better to choose a default which matches the higher market share for now so that the MKL vs. OpenBLAS is more speed is more comparible. Others can always increase the number of cores if they wish.

@tkf
Copy link
Member

tkf commented May 26, 2020

No one ever figured out how to get the right number of threads.

I think it just has to be handled via whatever the API given by the OS. For example, in Linux, you'd want to make it cgroup-aware rather than using the hardware information directly.

$ sudo docker run --cpuset-cpus=0-2 -i -t --rm julia
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.4.1 (2020-04-14)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

(@v1.4) pkg> add CpuId
...

julia> using CpuId
[ Info: Precompiling CpuId [adafc99b-e345-5852-983c-f28acb93d879]

julia> cpucores()
4

julia> cputhreads()
8

shell> cat /sys/fs/cgroup/cpuset/cpuset.cpus
0-2

julia> Sys.CPU_THREADS
8

Here, using Threads.nthreads() == 3 with julia -t sounds like the best approach.

Parsing /proc and /sys sounds easy to do for Linux. I'm not sure about other OSes, though.

(Edit: fix the example to use --cpuset-cpus=0-2 instead of --cpuset-cpus=0-3)

@StefanKarpinski
Copy link
Member

We (probably) don't want to make all of the CpuId package a dependency of Julia, so in order to use that, we would have to extract a minimal piece of it that lets us get the number of cores and use that.

Short of that, defaulting to half the number of CPU threads is probably better than saturating the cores. Of course, there are other computations than BLAS and they generally do benefit from hyperthreads, so maybe not. And if anyone has gone to the trouble of disabling hyperthreading, they're suddenly going to see their BLAS-intensive Julia code get half the FLOPS, so not great.

@StefanKarpinski
Copy link
Member

Having libuv support this would be nice but since that's a feature request issue and there's no PR to implement it, whereas we already have Julia code that implements using CPUID to detect the number of cores and threads, it seems better to do that for now. If libuv ever adds the feature, we can always switch to using that instead.

@StefanKarpinski
Copy link
Member

@KristofferC's observation that we already do a bunch of CPU detection stuff in C is actually the most promising avenue, imo: write C code that figures out the right number of CPU cores and threads and then expose that number as a C global or C function and set the relevant Julia globals and then use those. So someone "just" needs to write C code that does this now.

@tkf
Copy link
Member

tkf commented May 27, 2020

My point was that CPUID does not give us the correct information. I think it's better to parse /proc and/or /sys at least for Linux.

@StefanKarpinski
Copy link
Member

How does it not agree? cpucores() == 4 and cat /sys/fs/cgroup/cpuset/cpuset.cpus shows for CPUs with IDs 0, 1, 2 and 3.

@oschulz
Copy link

oschulz commented Jun 21, 2020

Is there a technical necessity to have a hard upper limit (don't know the technical depths, here), or could the user be allowed to set any `$OPENBLAS_NUM_THREADS$ value they like, in the future?

@ViralBShah
Copy link
Member

It is a build time option for OpenBLAS, and we try to be conservative because it leads to significant memory consumption to have a default that is too high. I don't know if recent versions have fixed this, and if we can set the number to something big - like 256.

@oschulz
Copy link

oschulz commented Jun 21, 2020

Ah, I see! Grrr, why oh why does OpenBLAS need to know that at build time ... :-)

@ViralBShah
Copy link
Member

Relevant: https://github.com/xianyi/OpenBLAS/blob/develop/USAGE.md

While eventually we should integrate openblas threading with Julia's, I wonder if we can solve the current issue by moving to openmp. That will have the benefit of also playing nicely with the existing openmp compiled librares in BB.

@yuyichao
Copy link

What thread scheduler to use does not seem to have anything to do with buffer allocation. That should be something to be fixed separately.

@oschulz
Copy link

oschulz commented Jun 22, 2020

I wonder if BLAS actually does some code generation based on the max. number of threads or so? I guess it's not a compile-time option on a whim?

@yuyichao
Copy link

No code generation. AFAICT it uses stack buffer for cheap and reentraint allocation. The easiest first step to try should be to just use VLA when supported.

@jlperla
Copy link

jlperla commented Feb 22, 2021

With Julia 1.6 I assume this is still an issue? Is there a current idiot-proof berst practice to tell users to copy/paste into their code to pick a better default value?

@chriselrod
Copy link
Contributor

With Julia 1.6 I assume this is still an issue? Is there a current idiot-proof berst practice to tell users to copy/paste into their code to pick a better default value?

import Hwloc
ncores = Hwloc.num_physical_cores()

@jlperla
Copy link

jlperla commented Feb 22, 2021

@ChrisRackauckas Didn't you sometimes find that even a lower value was typically better or is my memory foggy?

@ChrisRackauckas
Copy link
Member Author

That works. The issue is that Hwloc isn't what Julia uses. @ViralBShah's response was, shouldn't we just ship Hwloc with Julia then? @StefanKarpinski's response is that it's pretty heavyweight so it would be good to pull out just what we need. And that's where the issue has sat since. Someone just needs to put the work in.

@chriselrod
Copy link
Contributor

chriselrod commented Feb 23, 2021

OpenBLAS doesn't always ramp up well, especially for operations other than gemm (like LU), so if your matrices aren't that large it's often faster if you set it to a single thread. MKL doesn't have this problem, while it is pretty extreme in BLIS even for gemm.
Asymptotically in matrix size, you want 1 thread per physical core.

For x86, we could also strip the code from CpuId.jl:

#
#   TODO:
#   The following llvmcall routines fail when being inlined!
#   Hence the @noinline.
#

# Low level cpuid call, taking eax=leaf and ecx=subleaf,
# returning eax, ebx, ecx, edx as NTuple(4,UInt32)
@noinline function cpuid_llvm(leaf::UInt32, subleaf::UInt32)
    Base.llvmcall("""
        ; leaf = %0, subleaf = %1, %2 is some label
        ; call 'cpuid' with arguments loaded into registers EAX = leaf, ECX = subleaf
        %3 = tail call { i32, i32, i32, i32 } asm sideeffect "cpuid",
            "={ax},={bx},={cx},={dx},{ax},{cx},~{dirflag},~{fpsr},~{flags}"
            (i32 %0, i32 %1) JuliaLang/julia#2
        ; retrieve the result values and convert to vector [4 x i32]
        %4 = extractvalue { i32, i32, i32, i32 } %3, 0
        %5 = extractvalue { i32, i32, i32, i32 } %3, 1
        %6 = extractvalue { i32, i32, i32, i32 } %3, 2
        %7 = extractvalue { i32, i32, i32, i32 } %3, 3
        ; return the values as a new tuple
        %8  = insertvalue [4 x i32] undef, i32 %4, 0
        %9  = insertvalue [4 x i32]   %8 , i32 %5, 1
        %10 = insertvalue [4 x i32]   %9 , i32 %6, 2
        %11 = insertvalue [4 x i32]  %10 , i32 %7, 3
        ret [4 x i32] %11
    """
    # llvmcall requires actual types, rather than the usual (...) tuple
    , NTuple{4,UInt32}, Tuple{UInt32,UInt32}, leaf, subleaf)
end
function cpuid(leaf=0, subleaf=0)
    # for some reason, we need a dedicated local
    # variable of UInt32 for llvmcall to succeed
    l, s = UInt32(leaf), UInt32(subleaf)
    cpuid_llvm(l, s) ::NTuple{4,UInt32}
end

@inline function hasleaf(leaf::UInt32) ::Bool
    eax, ebx, ecx, edx = cpuid(leaf & 0xffff_0000)
    eax >= leaf
end
function cpucores() ::Int

    leaf = 0x0000_000b
    hasleaf(leaf) || return zero(UInt32)

    # The number of cores reported by cpuid is actually already the total
    # number of cores at that level, including all of the lower levels.
    # Thus, we need to find the highest level...which is 0x02 == "Core"
    # on ecx[15:08] per the specs, and divide it by the number of
    # 0x01 == "SMT" logical cores.

    sl = zero(UInt32)
    nc = zero(UInt32) # "Core" count
    nl = zero(UInt32) # "SMT" count
    while (true)
        # ebx[15:0] must be non-zero according to manual
        eax, ebx, ecx, edx = cpuid(leaf, sl)
        ebx & 0xffff == 0x0000 && break
        sl += one(UInt32)
        lt = ((ecx >> 8) & 0xff) & 0x03
        # Let's assume for now there's only one valid entry for each level
        lt == 0x01 && ( nl = ebx & 0xffff; continue )
        lt == 0x02 && ( nc = ebx & 0xffff; continue )
        # others are invalid and shouldn't be considered..
    end

    return iszero(nc) ? # no cores detected? then maybe its AMD?
        # AMD
        ((cpuid(0x8000_0008)[3] & 0x00ff)+1) :
        # Intel, we need nonzero values of nc and nl
        (iszero(nl) ? nc : nc ÷ nl)
end

This works:

julia> cpucores()
4

julia> versioninfo() # the 1165G7 has 4 cores
Julia Version 1.7.0-DEV.581
Commit d524f21917* (2021-02-19 22:06 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, tigerlake)
Environment:
  JULIA_NUM_THREADS = 8

I think Hwloc is clever about only reporting the number of cores a VM is given (IIRC Hwloc.num_physical_cores() returns 2 on GitHub CI, even though those servers have many more cores).

I haven't looked at it closely, but here is Hwloc's x86 code. Probably more rigorous and better tested than the above.
The trick to being a cross-platform library like Hwloc is to have different code for every platform.

@StefanKarpinski
Copy link
Member

I'm marking this for triage so that we can discuss if @chriselrod's implementation here is something we can go ahead with. This would be very nice to finally fix as this is a rather unfortunate performance trap to default to. While we're at it, perhaps we can discuss whether Julia should default to having as many threads as cores.

@oschulz
Copy link

oschulz commented May 4, 2021

A good default that would also suit larger/shared system might be default nthreads equal number of physical cores in a single numa domain. Not sure how easy it is to get that cross-platform without additional libs, though.

@JeffBezanson
Copy link
Member

Triage says 👍 but we would like to implement this inside processor.cpp.

@jlperla
Copy link

jlperla commented Sep 21, 2021

Where did this end up in 1.7? Do we still need custom logic to choose processors?

@ViralBShah
Copy link
Member

JuliaLang/julia#55574 makes this better (but perhaps not a fix for using only physical cores).

@ImVictorCheng
Copy link

Hi, just to follow up, is this issue solved? Does BLAS default to physical core count as of Julia 1.11.1? And is physical core count truely better than logical count?

I just ran some tests on my computer (CPU: i7-7700) and Sys.CPU_THREADS returns 8, BLAS.get_num_threads() returns 4. For my test code the default is indeed faster than forcing BLAS.set_num_threads(8). I'm wondering if this result can be trusted. Thanks!

@giordano
Copy link
Contributor

Hi, just to follow up, is this issue solved? Does BLAS default to physical core count as of Julia 1.11.1?

Technically no, as Julia itself has still no clue of the number of physical cores.

And is physical core count truely better than logical count?

Without reference to the workload you have in mind this question doesn't make sense. I'm assuming you're thinking about a heavily compute bound workload, where trying to use more threads than physical cores is indeed problematic as it oversubscribes the resources. Instead, for a much lighter workload using more threads than physical cores shouldn't be a problem, in some cases it could even be beneficial.

and Sys.CPU_THREADS returns 8, BLAS.get_num_threads() returns 4

That's simply because, except on aarch64-darwin, by default Julia blindly defaults the number of BLAS threads to half of the total available threads, which normally is Sys.CPU_THREADS unless you constrain Julia to use fewer threads with affinity settings: https://github.com/JuliaLang/julia/blob/be0ce9dbf0597c1ff50fc73f3d197a19708c4cd3/stdlib/LinearAlgebra/src/LinearAlgebra.jl#L834 But as I said above, there's still no notion of physical vs logical threads.

@ViralBShah
Copy link
Member

Does Hwloc give an accurate notion of the number of threads?

@giordano
Copy link
Contributor

Yes, but as Stefan said above, it's always problematic to add new binary libraries as julia deps, also because we're stuck with that version forever for a given julia version, since having to use hwloc at startup means that updating it as a stdlib is even more cumbersome than it currently is.

@KristofferC KristofferC transferred this issue from JuliaLang/julia Nov 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
multithreading Base.Threads and related functionality
Projects
None yet
Development

No branches or pull requests