[OpenBLAS] Build the BFloat16 kernels in OpenBLAS #7202

imciner2 · 2023-08-14T17:09:37Z

Continuation of #7168. This now includes a patch that fixes detection and building on x86_64 architectures, which is based on the upstream PR OpenMathLib/OpenBLAS#4192.

The aarch64 build failures need a bit more investigation, since it looks like there might be naming differences between the apple toolchains and gnu toolchains used for linux that are causing the kernels to not compile properly.

ViralBShah · 2023-08-14T21:38:43Z

BTW, I was just using GCC 11.1 as an experiment to see if it would get me over the hump of getting things to build. That is probably what is preventing the building of the libgfortran3 and libgfortran4 variants.

imciner2 · 2023-08-14T22:27:10Z

For the bfloat16 functions, GCC 10+ is needed, but GCC 11 added the Sapphire Rapids support, so compiling with GCC 11 will give both. The downside is binarybuilder won't be able to use GCC 8+ with anything other than gfortran5, so the x86_64 targets with libgfortran3 or libgfortran4 won't be able to have the bfloat16 components, probably (they need older GCC versions).

The powerpc build failure is because of GCC 11 (the test was assuming exactly GCC 10, not anything greater), but it is a simple patch (sent upstream in OpenMathLib/OpenBLAS#4193).

The remaining issues are actually now clang-based issues. One with AppleClang and one with the flang compiler for the memsan version.

imciner2 · 2023-08-14T22:30:03Z

(actually cooperlake is new in GCC 10, so we need at least GCC 10 to build the bfloat16 interface parts).

ViralBShah · 2023-08-14T22:41:17Z

Would it make sense to keep building the old libgfortran{3,4} builds with the older compilers without bfloat16 support, but add it to the libgfortran5 version?

Or perhaps we should just build a separate library (say OpenBLASWithExt) with the extensions and fewer platforms. We also have OpenBLAS with high core counts, which is not widely used at all, and could also include the extensions and other experimental features.

imciner2 · 2023-11-10T11:08:03Z

Rebased to pickup the new 0.3.24, which allows the three current bfloat-specific patches to be dropped. Lets see if any of the other platforms start working now (I expect the gfortran3 and 4 to still be broken though).

imciner2 · 2024-01-11T00:30:49Z

Building with gfortran3 and gfortran4 will require some build system tweaks, so I have asked upstream about it here: OpenMathLib/OpenBLAS#4422. So hopefully they can work out the best way to get the generic kernels for the newer targets on these older compilers.

imciner2 · 2024-01-22T17:33:56Z

sigh the aarch62 Apple builds are still failing, but at least the others are passing.

@giordano I believe you had some SVE-related failures previously, do these look familiar? I think these are living in the bfloat16 kernels, so maybe the upstream fix for your SVE problem was incomplete last time?

giordano · 2024-01-22T17:49:28Z

My problem was only with the use of -march=native (OpenMathLib/OpenBLAS#4433), which was fixed by OpenMathLib/OpenBLAS#4434.

The error you're getting now is

[17:25:10] In file included from ../kernel/arm64/sbgemm_kernel_8x4_neoversen2.c:29:
[17:25:10] /opt/x86_64-linux-musl/lib/clang/13.0.1/include/arm_sve.h:15:2: error: "SVE support not enabled"
[17:25:10] #error "SVE support not enabled"
[17:25:10]  ^

Relevant lines of /opt/x86_64-linux-musl/lib/clang/13.0.1/include/arm_sve.h are

#if !defined(__ARM_FEATURE_SVE)
#error "SVE support not enabled"
#else

so the problem is that __ARM_FEATURE_SVE is undefined, which looking at the compiler invocations like

[17:25:10] cc -O2 -DMAX_STACK_ALLOC=2048 -Wall -DF_INTERFACE_GFORT -fPIC -DDYNAMIC_ARCH -DSMP_SERVER -DNO_WARMUP -DMAX_CPU_NUMBER=512 -DMAX_PARALLEL_NUMBER=1 -DBUILD_BFLOAT16 -DBUILD_SINGLE=1 -DBUILD_DOUBLE=1 -DBUILD_COMPLEX=1 -DBUILD_COMPLEX16=1 -DVERSION=\"0.3.26\" -march=armv8.2-a -mtune=cortex-a72 -UASMNAME -UASMFNAME -UNAME -UCNAME -UCHAR_NAME -UCHAR_CNAME -DASMNAME=_dtrmm_kernel_LN_NEOVERSEN2 -DASMFNAME=_dtrmm_kernel_LN_NEOVERSEN2_ -DNAME=dtrmm_kernel_LN_NEOVERSEN2_ -DCNAME=dtrmm_kernel_LN_NEOVERSEN2 -DCHAR_NAME=\"dtrmm_kernel_LN_NEOVERSEN2_\" -DCHAR_CNAME=\"dtrmm_kernel_LN_NEOVERSEN2\" -DNO_AFFINITY -DTS=_NEOVERSEN2 -I.. -DBUILD_KERNEL -DTABLE_NAME=gotoblas_NEOVERSEN2 -DDOUBLE  -UCOMPLEX -c -DTRMMKERNEL -DDOUBLE -UCOMPLEX -DLEFT -UTRANSA ../kernel/arm64/dtrmm_kernel_8x4.S -o dtrmm_kernel_LN_NEOVERSEN2.o

looks plausible, as -march=armv8.2-a alone doesn't enable SVE automatically

$ julia --compile=min -e 'using BinaryBuilderBase; BinaryBuilderBase.runshell(Platform("aarch64", "macos"); preferred_llvm_version=v"13", lock_microarchitecture=false)'
sandbox:${WORKSPACE} # cc -march=armv8.2-a -dM -E - < /dev/null | grep __ARM_FEATURE_SVE
sandbox:${WORKSPACE} # cc -march=armv8.2-a+sve -dM -E - < /dev/null | grep __ARM_FEATURE_SVE
#define __ARM_FEATURE_SVE 1

According to https://gcc.gnu.org/onlinedocs/gcc/AArch64-Options.html SVE is required only with armv9-a

giordano · 2024-01-22T17:58:24Z

I don't know if these are the lines being hit, but https://github.com/OpenMathLib/OpenBLAS/blob/d6a5174e9c50b9f68db96d7d7818f92cdfb4e7ec/Makefile.arm64#L138-L142 looks suspicious: when targeting macOS -march=armv8.2-a is used, otherwise -march=armv8.5-a+sve+sve2+bf16, which is capable of generating code for SVE and bf16. I guess there's some logic to review around there.

imciner2 · 2024-01-23T08:54:41Z

I've sent an issue report upstream about the SVE kernels and the NeoverseN2 target: OpenMathLib/OpenBLAS#4449. Looking at the issue report that prompted the fallback to the armv8.2-a option, OpenMathLib/OpenBLAS#3739, I guess it would also be fixed by bumping the GCC version to be 11.4 or newer, but there probably is definitely an issue to fix upstream about including the kernels when they shouldn't be.

As for the new Windows failure... that is an assembler problem that I haven't dived into, but it looks like it isn't happy with some part of the bfloat16 kernel assembly.

imciner2 · 2024-01-23T09:20:22Z

And I've sent the Windows assembler errors upstream now OpenMathLib/OpenBLAS#4450.

imciner2 · 2024-01-24T15:01:18Z

🥳 all platforms are building now 🎉.

Right now, I have it only turned on for the OpenBLAS build script, not the OpenBLAS32 build script. Does it make sense to enable it in the OpenBLAS32 build script as well?

giordano · 2024-01-24T15:05:25Z

🥳 all platforms are building now 🎉.

Thanks for the heroic effort of tracking down all the issues!

Right now, I have it only turned on for the OpenBLAS build script, not the OpenBLAS32 build script. Does it make sense to enable it in the OpenBLAS32 build script as well?

Not sure, maybe yes for consistency?

giordano · 2024-01-24T15:45:59Z

While we are here, I verified patch 10-neoverse-generic-kernels.patch can now be safely dropped. With a local build of OpenBLAS v0.3.13 (make DYNAMIC_ARCH=1 LIBPREFIX=libopenblas64_ INTERFACE64=1 SYMBOLSUFFIX=64_ -j36) on Neoverse V2:

$ OPENBLAS_CORETYPE=NEOVERSEN1 OPENBLAS_VERBOSE=2 julia -q
Core: neoversen1
julia> using LinearAlgebra

julia> BLAS.lbt_forward("./libopenblas64_.so"; clear=true)
Core: neoversen1
4856

julia> a = zeros(Float64, 100);

julia> a[1] = -Inf;

julia> BLAS.nrm2(a)
NaN

Instead, with a local build of commit OpenMathLib/OpenBLAS@d6a5174 (latest on branch develop as of now) with the same configuration I get

$ OPENBLAS_CORETYPE=NEOVERSEN1 OPENBLAS_VERBOSE=2 julia -q
Core: neoversen1
julia> using LinearAlgebra

julia> BLAS.lbt_forward("./libopenblas64_.so"; clear=true)
Core: neoversen1
4860

julia> a = zeros(Float64, 100);

julia> a[1] = -Inf;

julia> BLAS.nrm2(a)
Inf

So it appears the bug reported at OpenMathLib/OpenBLAS#2998 is indeed fixed upstream, and we don't need to carry the patch around.

imciner2 · 2024-01-24T15:51:07Z

I was actually going to do a follow-on PR to backport the patch mentioned in #7661 to our builds so that we can maintain better history of these changes, so we can drop that patch at that time.

giordano · 2024-01-24T16:02:43Z

Sounds good.

In the meantime, I verified the neoversen1 kernel for nrm2 is way faster than the generic one:

$ OPENBLAS_CORETYPE=NEOVERSEN1 OPENBLAS_VERBOSE=2 julia -q
Core: neoversen1
julia> using BenchmarkTools, LinearAlgebra

julia> a = zeros(Float64, 100);

julia> a[1] = -Inf;

julia> @btime norm($a);
  40.557 ns (0 allocations: 0 bytes)

julia> BLAS.lbt_forward("./libopenblas64_.so"; clear=true)
Core: neoversen1
4860

julia> @btime norm($a);
  6.912 ns (0 allocations: 0 bytes)

Apple Silicon (which now uses neoversen1 kernels by default) users who compute lots of norms are going to love this improvement.

Edit: new benchmarks, this time on an M1 (instead of neoversev2):

% OPENBLAS_VERBOSE=2 julia -q
Core: neoversen1
julia> using BenchmarkTools, LinearAlgebra

julia> a = zeros(Float64, 100); a[1] = -Inf;

julia> @btime norm($(a))
  43.267 ns (0 allocations: 0 bytes)
Inf

julia> BLAS.lbt_forward(realpath("./libopenblas64_.dylib"); clear=true)
Core: neoversen1
4874

julia> @btime norm($(a))
  8.008 ns (0 allocations: 0 bytes)
Inf

./libopenblas64_.dylib is a build with proper neoversen1 kernel for nrm2 from this PR, this confirms the ~5x speedup observed on neoversev2 for this core type.

DrTimothyAldenDavis · 2024-01-24T19:34:28Z

bfloat16 on the CPU? oooo nice. I'd love to add it as a user-defined type to GraphBLAS. I just haven't figured out how to get those operations to work on the CPU.

GraphBLAS doesn't yet call any of the BLAS, no dgemms or whatever. It could do so for say GrB_mxm when all the matrices are dense and when the semiring is (plus,times) for double, single, double complex or single complex. I could do that for bfloat16 too, if it's available.

O/OpenBLAS/[email protected]/build_tarballs.jl

ViralBShah · 2024-01-24T19:48:30Z

@imciner2 Let's also incorporate the patch for GEMV thresholds that will speed up UMFPACK into this PR.

imciner2 · 2024-01-24T21:20:56Z

@imciner2 Let's also incorporate the patch for GEMV thresholds that will speed up UMFPACK into this PR.

I guess we could, but my plan was to merge this one and then do the patch backporting/removal for both 0.3.23 and 0.3.26 in another PR to get better git history. Since we squash merge PRs, that would be a lot of patch changes going in/out at once, so for future reference I think it is better to keep them separate.

ViralBShah · 2024-01-24T21:26:00Z

Right - but the problem is that every merge publishes hundreds of megs of binaries for download, which would be nice to avoid. If we make a clean PR here, we can choose not to squash the merge so that we can retain the history.

* Add upstream patch to fix compilation with non-AVX512bf processors. Taken from upstream, can be removed in 0.3.27. * Add patch to compile with SVE on aarch64 NeoverseN2, used by Apple systems. No upstream fix yet.

giordano · 2024-01-24T21:46:38Z

If we make a clean PR here, we can choose not to squash the merge so that we can retain the history

The way CI works here requires squash-merging, as we do the diff with the previous commit to determine what to rebuild. But I'm sympathetic to the idea of avoiding intermediate builds, that won't likely have much use, if possible

* Remove fallback to generic aarch kernels on neoverse. This was fixed before 0.3.20, so the optimized kernels should be fine to use now. * Backport threshold increase for gemv multithreading to improve performance. This is included in 0.3.27.

imciner2 · 2024-01-24T22:43:04Z

Ok, I have backported the patch and also removed the neoverse fallback. All that's left is the threading number update.

O/OpenBLAS/[email protected]/build_tarballs.jl

ViralBShah · 2024-01-25T14:08:16Z

sanitize build is still failing. Should we turn it off?

giordano · 2024-01-25T17:15:51Z

sanitize build is still failing. Should we turn it off?

I simply used preferred_llvm_version in the wrong recipe, it was [email protected], not [email protected] 🤦

ViralBShah · 2024-01-25T19:44:56Z

@KristofferC @oscardssmith Note that we should use the new OpenBLAS binaries from here for 1.10.1.

This also * drops a patch (`deps/patches/neoverse-generic-kernels.patch`) not needed anymore for an [old bug](OpenMathLib/OpenBLAS#2998) fixed upstream in OpenBLAS. This results in ~5x speedup in the computation of `BLAS.nrm2` (and hence `LinearAlgebra.norm` for vectors longer than `LinearAlgebra.NRM2_CUTOFF` (== 32) elements) when the neoversen1 kernels are used, e.g. by default on all Apple Silicon CPUs * adds a regression test for the above bug * updates other patches when building openblas from source Corresponding PR in Yggdrasil: JuliaPackaging/Yggdrasil#7202.

* [OpenBLAS] Build the BFloat16 kernels * Add upstream patch to fix compilation with non-AVX512bf processors. Taken from upstream, can be removed in 0.3.27. * Add patch to compile with SVE on aarch64 NeoverseN2, used by Apple systems. No upstream fix yet. * [OpenBLAS] Update patches to improve performance * Remove fallback to generic aarch kernels on neoverse. This was fixed before 0.3.20, so the optimized kernels should be fine to use now. * Backport threshold increase for gemv multithreading to improve performance. This is included in 0.3.27. * [[email protected]] Compile with llvm 13 for msan build --------- Co-authored-by: Viral B. Shah <[email protected]> Co-authored-by: Mosè Giordano <[email protected]>

This also * drops a patch (`deps/patches/neoverse-generic-kernels.patch`) not needed anymore for an [old bug](OpenMathLib/OpenBLAS#2998) fixed upstream in OpenBLAS. This results in ~5x speedup in the computation of `BLAS.nrm2` (and hence `LinearAlgebra.norm` for vectors longer than `LinearAlgebra.NRM2_CUTOFF` (== 32) elements) when the neoversen1 kernels are used, e.g. by default on all Apple Silicon CPUs * adds a regression test for the above bug * updates other patches when building openblas from source Corresponding PR in Yggdrasil: JuliaPackaging/Yggdrasil#7202.

imciner2 force-pushed the im/bfp16 branch from e883581 to 263f0e1 Compare August 17, 2023 14:16

imciner2 force-pushed the im/bfp16 branch from a3bb19d to 6b8bb08 Compare November 10, 2023 11:07

imciner2 added the linear algebra 🔢 label Nov 14, 2023

imciner2 mentioned this pull request Jan 11, 2024

Don't compile AVX512BF code when it isn't available OpenMathLib/OpenBLAS#4422

Closed

imciner2 force-pushed the im/bfp16 branch from 6b8bb08 to 642fd75 Compare January 22, 2024 16:25

imciner2 mentioned this pull request Jan 23, 2024

Patch GCC 11 build to fix AMX instructions #8001

Closed

imciner2 force-pushed the im/bfp16 branch from 642fd75 to 5c3a177 Compare January 24, 2024 14:13

imciner2 marked this pull request as ready for review January 24, 2024 14:58

This was referenced Jan 24, 2024

Incorrect answer in nrm2 computation on Neoverse-n1 OpenMathLib/OpenBLAS#2998

Closed

[OpenBLAS] Add v0.3.20 #4485

Merged

imciner2 requested a review from ViralBShah January 24, 2024 16:19

ViralBShah reviewed Jan 24, 2024

View reviewed changes

O/OpenBLAS/[email protected]/build_tarballs.jl Show resolved Hide resolved

[OpenBLAS] Build the BFloat16 kernels

2af0ec8

* Add upstream patch to fix compilation with non-AVX512bf processors. Taken from upstream, can be removed in 0.3.27. * Add patch to compile with SVE on aarch64 NeoverseN2, used by Apple systems. No upstream fix yet.

imciner2 force-pushed the im/bfp16 branch from 5c3a177 to 3060f2a Compare January 24, 2024 22:41

giordano reviewed Jan 25, 2024

View reviewed changes

O/OpenBLAS/[email protected]/build_tarballs.jl Outdated Show resolved Hide resolved

O/OpenBLAS/[email protected]/build_tarballs.jl Outdated Show resolved Hide resolved

giordano force-pushed the im/bfp16 branch from da09289 to 1c245ca Compare January 25, 2024 17:18

[[email protected]] Compile with llvm 13 for msan build

4ddbd2e

giordano force-pushed the im/bfp16 branch from 1c245ca to 4ddbd2e Compare January 25, 2024 17:22

ViralBShah merged commit 65d6af9 into JuliaPackaging:master Jan 25, 2024
47 checks passed

oscardssmith mentioned this pull request Jan 25, 2024

Backports for 1.10.1 JuliaLang/julia#52755

Merged

33 tasks

giordano mentioned this pull request Jan 26, 2024

[OpenBLAS_jll] Update to new build with BFloat16 kernels JuliaLang/julia#53059

Merged

giordano mentioned this pull request Feb 6, 2024

more source-build problems, with openblas now on CI JuliaLang/julia#53172

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[OpenBLAS] Build the BFloat16 kernels in OpenBLAS #7202

[OpenBLAS] Build the BFloat16 kernels in OpenBLAS #7202

imciner2 commented Aug 14, 2023

ViralBShah commented Aug 14, 2023 •

edited

Loading

imciner2 commented Aug 14, 2023 •

edited

Loading

imciner2 commented Aug 14, 2023

ViralBShah commented Aug 14, 2023

imciner2 commented Nov 10, 2023

imciner2 commented Jan 11, 2024

imciner2 commented Jan 22, 2024

giordano commented Jan 22, 2024

giordano commented Jan 22, 2024

imciner2 commented Jan 23, 2024

imciner2 commented Jan 23, 2024

imciner2 commented Jan 24, 2024

giordano commented Jan 24, 2024

giordano commented Jan 24, 2024

imciner2 commented Jan 24, 2024

giordano commented Jan 24, 2024 •

edited

Loading

DrTimothyAldenDavis commented Jan 24, 2024

ViralBShah commented Jan 24, 2024

imciner2 commented Jan 24, 2024

ViralBShah commented Jan 24, 2024 •

edited

Loading

giordano commented Jan 24, 2024

imciner2 commented Jan 24, 2024

ViralBShah commented Jan 25, 2024 •

edited

Loading

giordano commented Jan 25, 2024

ViralBShah commented Jan 25, 2024 •

edited

Loading

[OpenBLAS] Build the BFloat16 kernels in OpenBLAS #7202

[OpenBLAS] Build the BFloat16 kernels in OpenBLAS #7202

Conversation

imciner2 commented Aug 14, 2023

ViralBShah commented Aug 14, 2023 • edited Loading

imciner2 commented Aug 14, 2023 • edited Loading

imciner2 commented Aug 14, 2023

ViralBShah commented Aug 14, 2023

imciner2 commented Nov 10, 2023

imciner2 commented Jan 11, 2024

imciner2 commented Jan 22, 2024

giordano commented Jan 22, 2024

giordano commented Jan 22, 2024

imciner2 commented Jan 23, 2024

imciner2 commented Jan 23, 2024

imciner2 commented Jan 24, 2024

giordano commented Jan 24, 2024

giordano commented Jan 24, 2024

imciner2 commented Jan 24, 2024

giordano commented Jan 24, 2024 • edited Loading

DrTimothyAldenDavis commented Jan 24, 2024

ViralBShah commented Jan 24, 2024

imciner2 commented Jan 24, 2024

ViralBShah commented Jan 24, 2024 • edited Loading

giordano commented Jan 24, 2024

imciner2 commented Jan 24, 2024

ViralBShah commented Jan 25, 2024 • edited Loading

giordano commented Jan 25, 2024

ViralBShah commented Jan 25, 2024 • edited Loading

ViralBShah commented Aug 14, 2023 •

edited

Loading

imciner2 commented Aug 14, 2023 •

edited

Loading

giordano commented Jan 24, 2024 •

edited

Loading

ViralBShah commented Jan 24, 2024 •

edited

Loading

ViralBShah commented Jan 25, 2024 •

edited

Loading

ViralBShah commented Jan 25, 2024 •

edited

Loading