Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test fails on macOS Sequoia (ARM M1) #1073

Open
1 of 2 tasks
andyfaff opened this issue Nov 13, 2024 · 20 comments
Open
1 of 2 tasks

Test fails on macOS Sequoia (ARM M1) #1073

andyfaff opened this issue Nov 13, 2024 · 20 comments

Comments

@andyfaff
Copy link

andyfaff commented Nov 13, 2024

Description

I cloned the repository to my machine (macOS Sequoia, M1. XCode 16.1),

(dev3) teapot:lapack andrew$ gcc --version
Apple clang version 16.0.0 (clang-1600.0.26.3)
Target: arm64-apple-darwin24.0.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin
(dev3) teapot:lapack andrew$ gfortran --version
GNU Fortran (Homebrew GCC 14.2.0_1) 14.2.0
brew install gfortran
cp make.inc.example make.inc
make -j8 all

The summary of the test suite (full log here):

			-->   LAPACK TESTING SUMMARY  <--
		Processing LAPACK Testing output found in the TESTING directory
SUMMARY             	nb test run 	numerical error   	other error  
================   	===========	=================	================  
REAL             	1569648		0	(0.000%)	0	(0.000%)	
DOUBLE PRECISION	1563140		4	(0.000%)	4	(0.000%)	
COMPLEX          	1029730		0	(0.000%)	0	(0.000%)	
COMPLEX16         	1029705		1	(0.000%)	0	(0.000%)	

--> ALL PRECISIONS	5192223		5	(0.000%)	4	(0.000%)

On a related note, are there guidelines on how to direct the make file to use the macOS Accelerate for BLAS instead of the inbuilt librefblas?

Checklist

  • I've included a minimal example to reproduce the issue
  • I'd be willing to make a PR to solve this issue
@andyfaff
Copy link
Author

git bisecting locates the failing COMPLEX16 test to 2a87758 (as seen on my hardware/OS combo)

@andyfaff
Copy link
Author

andyfaff commented Nov 14, 2024

I believe the four failing double precision calculations are also seen in CI. Bisecting on the output of make -j8 all pins them down to roughly 3838c85 or 0791c22.

From examining CI it doesn't look like these numerical outputs are used to pass/fail the test suite though.

@langou
Copy link
Contributor

langou commented Nov 14, 2024

Argh. Thanks for the report. Shoo. That's going to be a tough one to debug. We'll have a look.
@jprhyne

@langou
Copy link
Contributor

langou commented Nov 14, 2024

On a related note, are there guidelines on how to direct the make file to use the macOS Accelerate for BLAS instead of the inbuilt librefblas?

In make.inc,

BLASLIB = $(TOPSRCDIR)/librefblas.a

you should be able to change BLASLIB to whatever you want. So, for example, you should be able to link with "macOS Accelerate for BLAS". I assume cmake as a similar option.

@martin-frbg
Copy link
Collaborator

Curiously, I got an additional numerical&other error in REAL (SGS failure with INFO=9 from SGGES3), and a single numerical error in COMPLEX instead of the one in COMPLEX16 when I reproduced it on the M1 in the GCC Compile Farm. (Using the homebrew gfortran-14.2.0_1 as well, only the system clang is at 14.0.0 - XCode 14.2 commandline tools are installed only). Our friendly neighborhood "minor testing failures" at work again ?

@martin-frbg
Copy link
Collaborator

I also note that the "rough bisect" pointed to the new NRM2 routines, which are also implicated in the SVG (dgesdd) divergence errors discussed in #672

@andyfaff
Copy link
Author

The reason I started running these tests is because SciPy uses Accelerate as the underlying BLAS. We've just noticed that our test suite has fails with macOS15, whereas they weren't on macOS14. With macOS 14 --> 15 Apple updated Accelerate. Part of the change was a bump in LAPACK from 3.9.1 to 3.11.

I've given Apple some feedback that we're experiencing problems, but the reproducer is currently showing how the SciPy test suite fails. I was trying to see if I could find another way of demonstrating that there was an issue, short of trying to come up with a specific fortran program. I therefore thought to try building this project against Accelerate, running the test suite, and if there were fails to point the Apple engineers towards something they may find easier to digest. I haven't succeeded in building and testing this project against Accelerate, but I did get as far as running the suite as-is. This highlighted the issues so I thought I'd report them.

@jprhyne
Copy link
Contributor

jprhyne commented Nov 16, 2024

Hey @andyfaff sorry for the late response! I gave this a quick look and I am unable to reproduce the errors you are experiencing. I have tried on my AMD linux machine with a Ryzen 5 7640U processor. In addition I tried in a macOS VM built from a recovery image using an intel CPU. I have attached a screenshot of the same versions of gcc and gfortran installed with brew with no failures.
latestscreenshot
Unfortunately, I don't have access to an ARM M1 mac, but I do have access to a raspberry pi, which I'll try to recreate the issue on that machine and if so, I'll post an update.

Looking at your log, it seems that the errors are in the routines
DGEES1
DGEESX1
ZGGEV (This one I am less sure about reading the error)

And based on https://netlib.org/lapack/installation.hints I am unsure what these failures could mean, is this what @martin-frbg was talking about?

I appreciate any extra insight!

@andyfaff
Copy link
Author

andyfaff commented Nov 16, 2024

They're visible on macOS Sequoia (os=15.1), with XCode 16.1. One way of you visualising the errors would be in CI, but it'll require macos-latest, and the latest XCode. Four of the numerical fails are already visible in the project's CI.

@martin-frbg
Copy link
Collaborator

The errors are not reproducible on x86_64 or in emulation, but they (and/or similar ones) are on actual M1 hardware. That's why I wrote that grumpy response about known accuracy issues in the testsuite.
I haven't gotten around to checking if the problems go away after putting back the "old" NRM2 code yet

@martin-frbg
Copy link
Collaborator

@jprhyne see also https://netlib.org/lapack/faq#_how_do_i_interpret_lapack_testing_failures (and I think one of the older issues about failed tests had a detailed explanation why some tests will always report a huge absolute error when they fail)

@martin-frbg
Copy link
Collaborator

Restoring the "old" dnrm2.f does fix the 4+4 errors in double precision, and restoring scnrm2.f removes the single one I got in COMPLEX (rather than COMPLEX16 as reported above). I'm still seeing a single error in SHS with Matrix order= 5, type=18, seed=1471,2745,3835,1213, result 11 is 28.04 that is obviously unrelated.

@angsch
Copy link
Collaborator

angsch commented Nov 30, 2024

Thank you so much, @martin-frbg, for having narrowed down the bug. This is fantastic work.

In the dgesdd bug, the matrix is so big that it is hard to understand what is going on. This case, however, is much smaller and we may have a change of understanding the convergence failure

 *** Error code from DGEES =    6

https://gist.github.com/andyfaff/ff02543e7ec9561b28d8a7c6702d43a0#file-testing_results-txt-L1404

Do you think you could just print the matrices? I hope that getting every step in gehrd is enough. I unfortunately don't have a machine to reproduce the error.

Also, is there a difference if you change the compiler optimization level? I assume that you used the default -O2 . Does the problem surface with a lower optimization level?

Edit: hseqr -> gehrd

@andyfaff
Copy link
Author

andyfaff commented Dec 1, 2024

@Developer-Ecosystem-Engineering you may be interested in this parallel issue that I experienced after opening the one in scipy.

@Developer-Ecosystem-Engineering

Thanks @andyfaff. We don't pull in BLAS at the moment, this is unlikely an issue on our end.

Restoring the "old" dnrm2.f does fix the 4+4 errors in double precision, and restoring scnrm2.f removes the single one I got in COMPLEX (rather than COMPLEX16 as reported above). I'm still seeing a single error in SHS with Matrix order= 5, type=18, seed=1471,2745,3835,1213, result 11 is 28.04 that is obviously unrelated.

^

@martin-frbg
Copy link
Collaborator

@angsch sorry, forgot to add that - errors reduce to a single failure in DOUBLE at or below -O1

 Matrix order=    6, type=21, seed=3618, 381,2331,3777, result  7 is 4.504D+15
 DGS drivers:      1 out of   1555 tests failed to pass the threshold
 *** Error code from DDRGES =    9

@angsch
Copy link
Collaborator

angsch commented Dec 3, 2024

@martin-frbg Sorry for asking one more time: Did you reduce all files to -O1 or only the nrm2 algorithm? What compiler is this? gfortran or flang?

@martin-frbg
Copy link
Collaborator

Ah, sorry again - that is with -O1 applied globally, and using gfortran-14 like in the original report. I can modify the makefile to build just dnrm2 with O1 of course - got derailed while attempting to trim down the output from DGEHRD to
just the problematic case. (Frustratingly, reducing the selection of matrix dimensions to test in ded.in makes the error disappear for values that error out when used in sequence.)

@martin-frbg
Copy link
Collaborator

so same reduction in error count (to the one in DDRGES3) with only nrm2 compiled at -O1 (well actually I cheated and compiled all .f90 files in BLAS/src at reduced optimization, but I doubt the others matter here).

@martin-frbg
Copy link
Collaborator

the decisive difference between -O1 and -O2 compilation of dnrm2.f90 turns out to be -fexpensive-optimizations , which is probably not very helpful

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants