-
Notifications
You must be signed in to change notification settings - Fork 558
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[OpenBLAS] Build the BFloat16 kernels in OpenBLAS #7202
Conversation
BTW, I was just using GCC 11.1 as an experiment to see if it would get me over the hump of getting things to build. That is probably what is preventing the building of the libgfortran3 and libgfortran4 variants. |
For the bfloat16 functions, GCC 10+ is needed, but GCC 11 added the Sapphire Rapids support, so compiling with GCC 11 will give both. The downside is binarybuilder won't be able to use GCC 8+ with anything other than gfortran5, so the x86_64 targets with libgfortran3 or libgfortran4 won't be able to have the bfloat16 components, probably (they need older GCC versions). The powerpc build failure is because of GCC 11 (the test was assuming exactly GCC 10, not anything greater), but it is a simple patch (sent upstream in OpenMathLib/OpenBLAS#4193). The remaining issues are actually now clang-based issues. One with AppleClang and one with the flang compiler for the memsan version. |
(actually cooperlake is new in GCC 10, so we need at least GCC 10 to build the bfloat16 interface parts). |
Would it make sense to keep building the old libgfortran{3,4} builds with the older compilers without bfloat16 support, but add it to the libgfortran5 version? Or perhaps we should just build a separate library (say OpenBLASWithExt) with the extensions and fewer platforms. We also have OpenBLAS with high core counts, which is not widely used at all, and could also include the extensions and other experimental features. |
Rebased to pickup the new 0.3.24, which allows the three current bfloat-specific patches to be dropped. Lets see if any of the other platforms start working now (I expect the gfortran3 and 4 to still be broken though). |
Building with gfortran3 and gfortran4 will require some build system tweaks, so I have asked upstream about it here: OpenMathLib/OpenBLAS#4422. So hopefully they can work out the best way to get the generic kernels for the newer targets on these older compilers. |
sigh the aarch62 Apple builds are still failing, but at least the others are passing. @giordano I believe you had some SVE-related failures previously, do these look familiar? I think these are living in the bfloat16 kernels, so maybe the upstream fix for your SVE problem was incomplete last time? |
My problem was only with the use of The error you're getting now is
Relevant lines of #if !defined(__ARM_FEATURE_SVE)
#error "SVE support not enabled"
#else so the problem is that
looks plausible, as $ julia --compile=min -e 'using BinaryBuilderBase; BinaryBuilderBase.runshell(Platform("aarch64", "macos"); preferred_llvm_version=v"13", lock_microarchitecture=false)'
sandbox:${WORKSPACE} # cc -march=armv8.2-a -dM -E - < /dev/null | grep __ARM_FEATURE_SVE
sandbox:${WORKSPACE} # cc -march=armv8.2-a+sve -dM -E - < /dev/null | grep __ARM_FEATURE_SVE
#define __ARM_FEATURE_SVE 1 According to https://gcc.gnu.org/onlinedocs/gcc/AArch64-Options.html SVE is required only with |
I don't know if these are the lines being hit, but https://github.com/OpenMathLib/OpenBLAS/blob/d6a5174e9c50b9f68db96d7d7818f92cdfb4e7ec/Makefile.arm64#L138-L142 looks suspicious: when targeting macOS |
I've sent an issue report upstream about the SVE kernels and the NeoverseN2 target: OpenMathLib/OpenBLAS#4449. Looking at the issue report that prompted the fallback to the armv8.2-a option, OpenMathLib/OpenBLAS#3739, I guess it would also be fixed by bumping the GCC version to be 11.4 or newer, but there probably is definitely an issue to fix upstream about including the kernels when they shouldn't be. As for the new Windows failure... that is an assembler problem that I haven't dived into, but it looks like it isn't happy with some part of the bfloat16 kernel assembly. |
And I've sent the Windows assembler errors upstream now OpenMathLib/OpenBLAS#4450. |
🥳 all platforms are building now 🎉. Right now, I have it only turned on for the OpenBLAS build script, not the OpenBLAS32 build script. Does it make sense to enable it in the OpenBLAS32 build script as well? |
Thanks for the heroic effort of tracking down all the issues!
Not sure, maybe yes for consistency? |
While we are here, I verified patch
Instead, with a local build of commit OpenMathLib/OpenBLAS@d6a5174 (latest on branch
So it appears the bug reported at OpenMathLib/OpenBLAS#2998 is indeed fixed upstream, and we don't need to carry the patch around. |
I was actually going to do a follow-on PR to backport the patch mentioned in #7661 to our builds so that we can maintain better history of these changes, so we can drop that patch at that time. |
Sounds good. In the meantime, I verified the neoversen1 kernel for nrm2 is way faster than the generic one:
Apple Silicon (which now uses neoversen1 kernels by default) users who compute lots of norms are going to love this improvement. Edit: new benchmarks, this time on an M1 (instead of neoversev2):
|
bfloat16 on the CPU? oooo nice. I'd love to add it as a user-defined type to GraphBLAS. I just haven't figured out how to get those operations to work on the CPU. GraphBLAS doesn't yet call any of the BLAS, no dgemms or whatever. It could do so for say GrB_mxm when all the matrices are dense and when the semiring is (plus,times) for double, single, double complex or single complex. I could do that for bfloat16 too, if it's available. |
@imciner2 Let's also incorporate the patch for GEMV thresholds that will speed up UMFPACK into this PR. |
I guess we could, but my plan was to merge this one and then do the patch backporting/removal for both 0.3.23 and 0.3.26 in another PR to get better git history. Since we squash merge PRs, that would be a lot of patch changes going in/out at once, so for future reference I think it is better to keep them separate. |
Right - but the problem is that every merge publishes hundreds of megs of binaries for download, which would be nice to avoid. If we make a clean PR here, we can choose not to squash the merge so that we can retain the history. |
* Add upstream patch to fix compilation with non-AVX512bf processors. Taken from upstream, can be removed in 0.3.27. * Add patch to compile with SVE on aarch64 NeoverseN2, used by Apple systems. No upstream fix yet.
The way CI works here requires squash-merging, as we do the diff with the previous commit to determine what to rebuild. But I'm sympathetic to the idea of avoiding intermediate builds, that won't likely have much use, if possible |
* Remove fallback to generic aarch kernels on neoverse. This was fixed before 0.3.20, so the optimized kernels should be fine to use now. * Backport threshold increase for gemv multithreading to improve performance. This is included in 0.3.27.
Ok, I have backported the patch and also removed the neoverse fallback. All that's left is the threading number update. |
sanitize build is still failing. Should we turn it off? |
I simply used |
@KristofferC @oscardssmith Note that we should use the new OpenBLAS binaries from here for 1.10.1. |
This also * drops a patch (`deps/patches/neoverse-generic-kernels.patch`) not needed anymore for an [old bug](OpenMathLib/OpenBLAS#2998) fixed upstream in OpenBLAS. This results in ~5x speedup in the computation of `BLAS.nrm2` (and hence `LinearAlgebra.norm` for vectors longer than `LinearAlgebra.NRM2_CUTOFF` (== 32) elements) when the neoversen1 kernels are used, e.g. by default on all Apple Silicon CPUs * adds a regression test for the above bug * updates other patches when building openblas from source Corresponding PR in Yggdrasil: JuliaPackaging/Yggdrasil#7202.
* [OpenBLAS] Build the BFloat16 kernels * Add upstream patch to fix compilation with non-AVX512bf processors. Taken from upstream, can be removed in 0.3.27. * Add patch to compile with SVE on aarch64 NeoverseN2, used by Apple systems. No upstream fix yet. * [OpenBLAS] Update patches to improve performance * Remove fallback to generic aarch kernels on neoverse. This was fixed before 0.3.20, so the optimized kernels should be fine to use now. * Backport threshold increase for gemv multithreading to improve performance. This is included in 0.3.27. * [[email protected]] Compile with llvm 13 for msan build --------- Co-authored-by: Viral B. Shah <[email protected]> Co-authored-by: Mosè Giordano <[email protected]>
This also * drops a patch (`deps/patches/neoverse-generic-kernels.patch`) not needed anymore for an [old bug](OpenMathLib/OpenBLAS#2998) fixed upstream in OpenBLAS. This results in ~5x speedup in the computation of `BLAS.nrm2` (and hence `LinearAlgebra.norm` for vectors longer than `LinearAlgebra.NRM2_CUTOFF` (== 32) elements) when the neoversen1 kernels are used, e.g. by default on all Apple Silicon CPUs * adds a regression test for the above bug * updates other patches when building openblas from source Corresponding PR in Yggdrasil: JuliaPackaging/Yggdrasil#7202.
Continuation of #7168. This now includes a patch that fixes detection and building on x86_64 architectures, which is based on the upstream PR OpenMathLib/OpenBLAS#4192.
The aarch64 build failures need a bit more investigation, since it looks like there might be naming differences between the apple toolchains and gnu toolchains used for linux that are causing the kernels to not compile properly.