Test fails on macOS Sequoia (ARM M1) #1073

andyfaff · 2024-11-13T22:56:05Z

Description

I cloned the repository to my machine (macOS Sequoia, M1. XCode 16.1),

(dev3) teapot:lapack andrew$ gcc --version
Apple clang version 16.0.0 (clang-1600.0.26.3)
Target: arm64-apple-darwin24.0.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin
(dev3) teapot:lapack andrew$ gfortran --version
GNU Fortran (Homebrew GCC 14.2.0_1) 14.2.0

brew install gfortran
cp make.inc.example make.inc
make -j8 all

The summary of the test suite (full log here):

			-->   LAPACK TESTING SUMMARY  <--
		Processing LAPACK Testing output found in the TESTING directory
SUMMARY             	nb test run 	numerical error   	other error  
================   	===========	=================	================  
REAL             	1569648		0	(0.000%)	0	(0.000%)	
DOUBLE PRECISION	1563140		4	(0.000%)	4	(0.000%)	
COMPLEX          	1029730		0	(0.000%)	0	(0.000%)	
COMPLEX16         	1029705		1	(0.000%)	0	(0.000%)	

--> ALL PRECISIONS	5192223		5	(0.000%)	4	(0.000%)

On a related note, are there guidelines on how to direct the make file to use the macOS Accelerate for BLAS instead of the inbuilt librefblas?

Checklist

I've included a minimal example to reproduce the issue
I'd be willing to make a PR to solve this issue

The text was updated successfully, but these errors were encountered:

andyfaff · 2024-11-14T01:01:20Z

git bisecting locates the failing COMPLEX16 test to 2a87758 (as seen on my hardware/OS combo)

andyfaff · 2024-11-14T01:19:05Z

I believe the four failing double precision calculations are also seen in CI. Bisecting on the output of make -j8 all pins them down to roughly 3838c85 or 0791c22.

From examining CI it doesn't look like these numerical outputs are used to pass/fail the test suite though.

langou · 2024-11-14T03:17:11Z

Argh. Thanks for the report. Shoo. That's going to be a tough one to debug. We'll have a look.
@jprhyne

langou · 2024-11-14T03:20:27Z

On a related note, are there guidelines on how to direct the make file to use the macOS Accelerate for BLAS instead of the inbuilt librefblas?

In make.inc,

lapack/make.inc.example

Line 77 in 9128e20

BLASLIB = $(TOPSRCDIR)/librefblas.a

you should be able to change BLASLIB to whatever you want. So, for example, you should be able to link with "macOS Accelerate for BLAS". I assume cmake as a similar option.

martin-frbg · 2024-11-14T17:32:37Z

Curiously, I got an additional numerical&other error in REAL (SGS failure with INFO=9 from SGGES3), and a single numerical error in COMPLEX instead of the one in COMPLEX16 when I reproduced it on the M1 in the GCC Compile Farm. (Using the homebrew gfortran-14.2.0_1 as well, only the system clang is at 14.0.0 - XCode 14.2 commandline tools are installed only). Our friendly neighborhood "minor testing failures" at work again ?

martin-frbg · 2024-11-14T18:31:47Z

I also note that the "rough bisect" pointed to the new NRM2 routines, which are also implicated in the SVG (dgesdd) divergence errors discussed in #672

andyfaff · 2024-11-15T00:29:58Z

The reason I started running these tests is because SciPy uses Accelerate as the underlying BLAS. We've just noticed that our test suite has fails with macOS15, whereas they weren't on macOS14. With macOS 14 --> 15 Apple updated Accelerate. Part of the change was a bump in LAPACK from 3.9.1 to 3.11.

I've given Apple some feedback that we're experiencing problems, but the reproducer is currently showing how the SciPy test suite fails. I was trying to see if I could find another way of demonstrating that there was an issue, short of trying to come up with a specific fortran program. I therefore thought to try building this project against Accelerate, running the test suite, and if there were fails to point the Apple engineers towards something they may find easier to digest. I haven't succeeded in building and testing this project against Accelerate, but I did get as far as running the suite as-is. This highlighted the issues so I thought I'd report them.

jprhyne · 2024-11-16T00:24:00Z

Hey @andyfaff sorry for the late response! I gave this a quick look and I am unable to reproduce the errors you are experiencing. I have tried on my AMD linux machine with a Ryzen 5 7640U processor. In addition I tried in a macOS VM built from a recovery image using an intel CPU. I have attached a screenshot of the same versions of gcc and gfortran installed with brew with no failures.

Unfortunately, I don't have access to an ARM M1 mac, but I do have access to a raspberry pi, which I'll try to recreate the issue on that machine and if so, I'll post an update.

Looking at your log, it seems that the errors are in the routines
DGEES1
DGEESX1
ZGGEV (This one I am less sure about reading the error)

And based on https://netlib.org/lapack/installation.hints I am unsure what these failures could mean, is this what @martin-frbg was talking about?

I appreciate any extra insight!

andyfaff · 2024-11-16T06:17:34Z

They're visible on macOS Sequoia (os=15.1), with XCode 16.1. One way of you visualising the errors would be in CI, but it'll require macos-latest, and the latest XCode. Four of the numerical fails are already visible in the project's CI.

martin-frbg · 2024-11-16T11:27:32Z

The errors are not reproducible on x86_64 or in emulation, but they (and/or similar ones) are on actual M1 hardware. That's why I wrote that grumpy response about known accuracy issues in the testsuite.
I haven't gotten around to checking if the problems go away after putting back the "old" NRM2 code yet

martin-frbg · 2024-11-16T11:47:45Z

@jprhyne see also https://netlib.org/lapack/faq#_how_do_i_interpret_lapack_testing_failures (and I think one of the older issues about failed tests had a detailed explanation why some tests will always report a huge absolute error when they fail)

martin-frbg · 2024-11-16T19:14:18Z

Restoring the "old" dnrm2.f does fix the 4+4 errors in double precision, and restoring scnrm2.f removes the single one I got in COMPLEX (rather than COMPLEX16 as reported above). I'm still seeing a single error in SHS with Matrix order= 5, type=18, seed=1471,2745,3835,1213, result 11 is 28.04 that is obviously unrelated.

angsch · 2024-11-30T15:23:49Z

Thank you so much, @martin-frbg, for having narrowed down the bug. This is fantastic work.

In the dgesdd bug, the matrix is so big that it is hard to understand what is going on. This case, however, is much smaller and we may have a change of understanding the convergence failure

 *** Error code from DGEES =    6

https://gist.github.com/andyfaff/ff02543e7ec9561b28d8a7c6702d43a0#file-testing_results-txt-L1404

Do you think you could just print the matrices? I hope that getting every step in gehrd is enough. I unfortunately don't have a machine to reproduce the error.

Also, is there a difference if you change the compiler optimization level? I assume that you used the default -O2 . Does the problem surface with a lower optimization level?

Edit: hseqr -> gehrd

andyfaff · 2024-12-01T01:08:07Z

@Developer-Ecosystem-Engineering you may be interested in this parallel issue that I experienced after opening the one in scipy.

Developer-Ecosystem-Engineering · 2024-12-02T17:19:32Z

Thanks @andyfaff. We don't pull in BLAS at the moment, this is unlikely an issue on our end.

Restoring the "old" dnrm2.f does fix the 4+4 errors in double precision, and restoring scnrm2.f removes the single one I got in COMPLEX (rather than COMPLEX16 as reported above). I'm still seeing a single error in SHS with Matrix order= 5, type=18, seed=1471,2745,3835,1213, result 11 is 28.04 that is obviously unrelated.

^

martin-frbg · 2024-12-03T14:36:31Z

@angsch sorry, forgot to add that - errors reduce to a single failure in DOUBLE at or below -O1

 Matrix order=    6, type=21, seed=3618, 381,2331,3777, result  7 is 4.504D+15
 DGS drivers:      1 out of   1555 tests failed to pass the threshold
 *** Error code from DDRGES =    9

angsch · 2024-12-03T17:44:56Z

@martin-frbg Sorry for asking one more time: Did you reduce all files to -O1 or only the nrm2 algorithm? What compiler is this? gfortran or flang?

martin-frbg · 2024-12-03T18:08:39Z

Ah, sorry again - that is with -O1 applied globally, and using gfortran-14 like in the original report. I can modify the makefile to build just dnrm2 with O1 of course - got derailed while attempting to trim down the output from DGEHRD to
just the problematic case. (Frustratingly, reducing the selection of matrix dimensions to test in ded.in makes the error disappear for values that error out when used in sequence.)

martin-frbg · 2024-12-03T19:14:28Z

so same reduction in error count (to the one in DDRGES3) with only nrm2 compiled at -O1 (well actually I cheated and compiled all .f90 files in BLAS/src at reduced optimization, but I doubt the others matter here).

martin-frbg · 2024-12-03T22:54:56Z

the decisive difference between -O1 and -O2 compilation of dnrm2.f90 turns out to be -fexpensive-optimizations , which is probably not very helpful

andyfaff added the Type: Bug label Nov 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test fails on macOS Sequoia (ARM M1) #1073

Test fails on macOS Sequoia (ARM M1) #1073

andyfaff commented Nov 13, 2024 •

edited

Loading

andyfaff commented Nov 14, 2024

andyfaff commented Nov 14, 2024 •

edited

Loading

langou commented Nov 14, 2024

langou commented Nov 14, 2024

martin-frbg commented Nov 14, 2024

martin-frbg commented Nov 14, 2024

andyfaff commented Nov 15, 2024

jprhyne commented Nov 16, 2024

andyfaff commented Nov 16, 2024 •

edited

Loading

martin-frbg commented Nov 16, 2024

martin-frbg commented Nov 16, 2024

martin-frbg commented Nov 16, 2024

angsch commented Nov 30, 2024 •

edited

Loading

andyfaff commented Dec 1, 2024

Developer-Ecosystem-Engineering commented Dec 2, 2024

martin-frbg commented Dec 3, 2024

angsch commented Dec 3, 2024

martin-frbg commented Dec 3, 2024

martin-frbg commented Dec 3, 2024

martin-frbg commented Dec 3, 2024

Test fails on macOS Sequoia (ARM M1) #1073

Test fails on macOS Sequoia (ARM M1) #1073

Comments

andyfaff commented Nov 13, 2024 • edited Loading

andyfaff commented Nov 14, 2024

andyfaff commented Nov 14, 2024 • edited Loading

langou commented Nov 14, 2024

langou commented Nov 14, 2024

martin-frbg commented Nov 14, 2024

martin-frbg commented Nov 14, 2024

andyfaff commented Nov 15, 2024

jprhyne commented Nov 16, 2024

andyfaff commented Nov 16, 2024 • edited Loading

martin-frbg commented Nov 16, 2024

martin-frbg commented Nov 16, 2024

martin-frbg commented Nov 16, 2024

angsch commented Nov 30, 2024 • edited Loading

andyfaff commented Dec 1, 2024

Developer-Ecosystem-Engineering commented Dec 2, 2024

martin-frbg commented Dec 3, 2024

angsch commented Dec 3, 2024

martin-frbg commented Dec 3, 2024

martin-frbg commented Dec 3, 2024

martin-frbg commented Dec 3, 2024

andyfaff commented Nov 13, 2024 •

edited

Loading

andyfaff commented Nov 14, 2024 •

edited

Loading

andyfaff commented Nov 16, 2024 •

edited

Loading

angsch commented Nov 30, 2024 •

edited

Loading