-
-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BLAS threads should default to physical not logical core count? #671
Comments
Interestingly, on Linux I checked our benchmarking server and see:
and it's correct there, so this may be an issue with Windows. |
This doesn't seem like a Windows-only problem.
My Linux laptop has four physical cores. |
That shows? It seems RecursiveFactorizations is only faster when the BLAS is not well-tuned, but I don't understand how what you show is related @macd . Could you show the 8 thread version of MKL to see if MKL is also not smart with the setting being at logical? |
That explains a lot. Thanks |
Ahh yes, so it looks like it's just smart and doesn't actually run more threads or something like that. Since OpenBLAS doesn't do that, this would account for a good chunk of the difference given the default settings. |
Can we easily detect the number of actual cores versus logical cores? We renamed the old |
Is
|
Some possibilities for figuring out the number of physical cores:
We may be able to borrow something from https://github.com/m-j-w/CpuId.jl |
Another possible library for figuring out the number of cores (from @RoyiAvital): |
Just as a note, we already do a bunch of cpuid stuff to support picking functions optimized for a feature-set at runtime e.g. Perhaps extending that to pick out the number of cores instead of bundling a whole other library is more "clean". |
I'm not very familiar with things this close to hardware. But, IIUC, CPUID is not the full story and I suppose you need to run the instruction on each core/socket. Does Also, I noticed that there is an issue tracked in libuv libuv/libuv#1967 for adding the support for physical cores to BTW, It looks like FYI: Meanwhile, I wrote a simple helper Python script to launch Julia with appropriate |
It seems MKL limits the maximum threads to the number of physical cores (i9-9900K (8 cores)): julia> ccall((:mkl_get_max_threads, "libmkl_rt"), Int32, ())
8
julia> ccall((:mkl_set_num_threads, "libmkl_rt"), Cvoid, (Ptr{Int32},), Ref(Int32(16)))
julia> ccall((:mkl_get_max_threads, "libmkl_rt"), Int32, ())
8
julia> ccall((:mkl_set_num_threads, "libmkl_rt"), Cvoid, (Ptr{Int32},), Ref(Int32(4)))
julia> ccall((:mkl_get_max_threads, "libmkl_rt"), Int32, ())
4 |
Is this stuff done automatically in Julia 1.5? |
Not that I'm aware of. No one ever figured out how to get the right number of threads. |
I am sure that this has already been suggested, but what if the is a temporary solution of just taking half of the current number because that would be the right calculation in most cases? Even if AMD processors count logical cores different than intel (maybe they don't?...), it seems like it might be better to choose a default which matches the higher market share for now so that the MKL vs. OpenBLAS is more speed is more comparible. Others can always increase the number of cores if they wish. |
I think it just has to be handled via whatever the API given by the OS. For example, in Linux, you'd want to make it cgroup-aware rather than using the hardware information directly.
Here, using Parsing (Edit: fix the example to use |
We (probably) don't want to make all of the CpuId package a dependency of Julia, so in order to use that, we would have to extract a minimal piece of it that lets us get the number of cores and use that. Short of that, defaulting to half the number of CPU threads is probably better than saturating the cores. Of course, there are other computations than BLAS and they generally do benefit from hyperthreads, so maybe not. And if anyone has gone to the trouble of disabling hyperthreading, they're suddenly going to see their BLAS-intensive Julia code get half the FLOPS, so not great. |
Having libuv support this would be nice but since that's a feature request issue and there's no PR to implement it, whereas we already have Julia code that implements using CPUID to detect the number of cores and threads, it seems better to do that for now. If libuv ever adds the feature, we can always switch to using that instead. |
@KristofferC's observation that we already do a bunch of CPU detection stuff in C is actually the most promising avenue, imo: write C code that figures out the right number of CPU cores and threads and then expose that number as a C global or C function and set the relevant Julia globals and then use those. So someone "just" needs to write C code that does this now. |
My point was that CPUID does not give us the correct information. I think it's better to parse |
How does it not agree? |
Is there a technical necessity to have a hard upper limit (don't know the technical depths, here), or could the user be allowed to set any `$OPENBLAS_NUM_THREADS$ value they like, in the future? |
It is a build time option for OpenBLAS, and we try to be conservative because it leads to significant memory consumption to have a default that is too high. I don't know if recent versions have fixed this, and if we can set the number to something big - like 256. |
Ah, I see! Grrr, why oh why does OpenBLAS need to know that at build time ... :-) |
Relevant: https://github.com/xianyi/OpenBLAS/blob/develop/USAGE.md While eventually we should integrate openblas threading with Julia's, I wonder if we can solve the current issue by moving to openmp. That will have the benefit of also playing nicely with the existing openmp compiled librares in BB. |
What thread scheduler to use does not seem to have anything to do with buffer allocation. That should be something to be fixed separately. |
I wonder if BLAS actually does some code generation based on the max. number of threads or so? I guess it's not a compile-time option on a whim? |
No code generation. AFAICT it uses stack buffer for cheap and reentraint allocation. The easiest first step to try should be to just use VLA when supported. |
With Julia 1.6 I assume this is still an issue? Is there a current idiot-proof berst practice to tell users to copy/paste into their code to pick a better default value? |
import Hwloc
ncores = Hwloc.num_physical_cores() |
@ChrisRackauckas Didn't you sometimes find that even a lower value was typically better or is my memory foggy? |
That works. The issue is that Hwloc isn't what Julia uses. @ViralBShah's response was, shouldn't we just ship Hwloc with Julia then? @StefanKarpinski's response is that it's pretty heavyweight so it would be good to pull out just what we need. And that's where the issue has sat since. Someone just needs to put the work in. |
OpenBLAS doesn't always ramp up well, especially for operations other than gemm (like LU), so if your matrices aren't that large it's often faster if you set it to a single thread. MKL doesn't have this problem, while it is pretty extreme in BLIS even for gemm. For x86, we could also strip the code from CpuId.jl: #
# TODO:
# The following llvmcall routines fail when being inlined!
# Hence the @noinline.
#
# Low level cpuid call, taking eax=leaf and ecx=subleaf,
# returning eax, ebx, ecx, edx as NTuple(4,UInt32)
@noinline function cpuid_llvm(leaf::UInt32, subleaf::UInt32)
Base.llvmcall("""
; leaf = %0, subleaf = %1, %2 is some label
; call 'cpuid' with arguments loaded into registers EAX = leaf, ECX = subleaf
%3 = tail call { i32, i32, i32, i32 } asm sideeffect "cpuid",
"={ax},={bx},={cx},={dx},{ax},{cx},~{dirflag},~{fpsr},~{flags}"
(i32 %0, i32 %1) JuliaLang/julia#2
; retrieve the result values and convert to vector [4 x i32]
%4 = extractvalue { i32, i32, i32, i32 } %3, 0
%5 = extractvalue { i32, i32, i32, i32 } %3, 1
%6 = extractvalue { i32, i32, i32, i32 } %3, 2
%7 = extractvalue { i32, i32, i32, i32 } %3, 3
; return the values as a new tuple
%8 = insertvalue [4 x i32] undef, i32 %4, 0
%9 = insertvalue [4 x i32] %8 , i32 %5, 1
%10 = insertvalue [4 x i32] %9 , i32 %6, 2
%11 = insertvalue [4 x i32] %10 , i32 %7, 3
ret [4 x i32] %11
"""
# llvmcall requires actual types, rather than the usual (...) tuple
, NTuple{4,UInt32}, Tuple{UInt32,UInt32}, leaf, subleaf)
end
function cpuid(leaf=0, subleaf=0)
# for some reason, we need a dedicated local
# variable of UInt32 for llvmcall to succeed
l, s = UInt32(leaf), UInt32(subleaf)
cpuid_llvm(l, s) ::NTuple{4,UInt32}
end
@inline function hasleaf(leaf::UInt32) ::Bool
eax, ebx, ecx, edx = cpuid(leaf & 0xffff_0000)
eax >= leaf
end
function cpucores() ::Int
leaf = 0x0000_000b
hasleaf(leaf) || return zero(UInt32)
# The number of cores reported by cpuid is actually already the total
# number of cores at that level, including all of the lower levels.
# Thus, we need to find the highest level...which is 0x02 == "Core"
# on ecx[15:08] per the specs, and divide it by the number of
# 0x01 == "SMT" logical cores.
sl = zero(UInt32)
nc = zero(UInt32) # "Core" count
nl = zero(UInt32) # "SMT" count
while (true)
# ebx[15:0] must be non-zero according to manual
eax, ebx, ecx, edx = cpuid(leaf, sl)
ebx & 0xffff == 0x0000 && break
sl += one(UInt32)
lt = ((ecx >> 8) & 0xff) & 0x03
# Let's assume for now there's only one valid entry for each level
lt == 0x01 && ( nl = ebx & 0xffff; continue )
lt == 0x02 && ( nc = ebx & 0xffff; continue )
# others are invalid and shouldn't be considered..
end
return iszero(nc) ? # no cores detected? then maybe its AMD?
# AMD
((cpuid(0x8000_0008)[3] & 0x00ff)+1) :
# Intel, we need nonzero values of nc and nl
(iszero(nl) ? nc : nc ÷ nl)
end This works: julia> cpucores()
4
julia> versioninfo() # the 1165G7 has 4 cores
Julia Version 1.7.0-DEV.581
Commit d524f21917* (2021-02-19 22:06 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-11.0.1 (ORCJIT, tigerlake)
Environment:
JULIA_NUM_THREADS = 8 I think Hwloc is clever about only reporting the number of cores a VM is given (IIRC I haven't looked at it closely, but here is Hwloc's x86 code. Probably more rigorous and better tested than the above. |
I'm marking this for triage so that we can discuss if @chriselrod's implementation here is something we can go ahead with. This would be very nice to finally fix as this is a rather unfortunate performance trap to default to. While we're at it, perhaps we can discuss whether Julia should default to having as many threads as cores. |
A good default that would also suit larger/shared system might be default nthreads equal number of physical cores in a single numa domain. Not sure how easy it is to get that cross-platform without additional libs, though. |
Triage says 👍 but we would like to implement this inside processor.cpp. |
Where did this end up in 1.7? Do we still need custom logic to choose processors? |
JuliaLang/julia#55574 makes this better (but perhaps not a fix for using only physical cores). |
Hi, just to follow up, is this issue solved? Does BLAS default to physical core count as of Julia 1.11.1? And is physical core count truely better than logical count? I just ran some tests on my computer (CPU: i7-7700) and Sys.CPU_THREADS returns 8, BLAS.get_num_threads() returns 4. For my test code the default is indeed faster than forcing BLAS.set_num_threads(8). I'm wondering if this result can be trusted. Thanks! |
Technically no, as Julia itself has still no clue of the number of physical cores.
Without reference to the workload you have in mind this question doesn't make sense. I'm assuming you're thinking about a heavily compute bound workload, where trying to use more threads than physical cores is indeed problematic as it oversubscribes the resources. Instead, for a much lighter workload using more threads than physical cores shouldn't be a problem, in some cases it could even be beneficial.
That's simply because, except on aarch64-darwin, by default Julia blindly defaults the number of BLAS threads to half of the total available threads, which normally is Sys.CPU_THREADS unless you constrain Julia to use fewer threads with affinity settings: https://github.com/JuliaLang/julia/blob/be0ce9dbf0597c1ff50fc73f3d197a19708c4cd3/stdlib/LinearAlgebra/src/LinearAlgebra.jl#L834 But as I said above, there's still no notion of physical vs logical threads. |
Does Hwloc give an accurate notion of the number of threads? |
Yes, but as Stefan said above, it's always problematic to add new binary libraries as julia deps, also because we're stuck with that version forever for a given julia version, since having to use hwloc at startup means that updating it as a stdlib is even more cumbersome than it currently is. |
On an i7-8550u, OpenBLAS is defaulting to 8 threads. I was comparing to RecursiveFactorizations.jl, and saw the performance is like:
1 thread
4 threads
8 threads
Conclusion: the default that Julia chooses, 8 threads, is the worst, with 1 thread doing better. But using the number of physical cores, 4, is best.
So there were a lot of issues on Discourse and Slack #gripes where essentially "setting BLAS threads to 1 is better than the default!", but it looks like it's because the default should be the number of physical and not logical threads. I am actually very surprised it's not set that way, and so I was wondering why, and also where this default is set (I couldn't find it).
The text was updated successfully, but these errors were encountered: