-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
xshseqr and xdhseqr fail with FPE if run in parallel #69
Comments
Thanks for the bug report! More details about this bug:
Lines 1087 to 1088 in de3919e
*
WRITE(*,*) "NWIN = ", NWIN, ", ILO = ", ILO, ", LIHI = ",
$ LIHI, ", I = ", I
*
IF( FLOPS.NE.0 .AND.
$ ( FLOPS*100 ) / ( 2*NWIN*NWIN ) .GE. MMULT ) THEN
* Result of the tests: $ mpiexec -n 2 xshseqr
[...]
NWIN = 4 , ILO = 73 , LIHI = 76 , I = 73
NWIN = 4 , ILO = 73 , LIHI = 76 , I = 73
NWIN = 4 , ILO = 73 , LIHI = 76 , I = 73
NWIN = 4 , ILO = 73 , LIHI = 74 , I = 73
NWIN = 19 , ILO = 31 , LIHI = 49 , I = 31
NWIN = 19 , ILO = 31 , LIHI = 49 , I = 31
NWIN = 19 , ILO = 31 , LIHI = 49 , I = 31
NWIN = 19 , ILO = 31 , LIHI = 44 , I = 31
NWIN = 0 , ILO = 45 , LIHI = 51 , I = 52
Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
Backtrace for this error:
#0 0x7fd360623d21 in ???
#1 0x7fd360622ef5 in ???
#2 0x7fd36045408f in ???
at /build/glibc-SzIz7B/glibc-2.31/signal/../sysdeps/unix/sysv/linux/x86_64/sigaction.c:0
#3 0x7fd362e1ecb6 in pstrord_
at ../SRC/pstrord.f:1090
#4 0x7fd362e4acc5 in pslaqr3_
at ../SRC/pslaqr3.f:880
#5 0x7fd362e340da in pslaqr0_
at ../SRC/pslaqr0.f:598
#6 0x7fd362e30dbf in pshseqr_
at ../SRC/pshseqr.f:441
#7 0x558c3509e9ac in pshseqrdriver
at ../TESTING/EIG/pshseqrdriver.f:413
#8 0x558c3509f8bd in main
at ../TESTING/EIG/pshseqrdriver.f:565
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 0 on node weslleyp-XPS-15-9510 exited on signal 8 (Floating point exception).
-------------------------------------------------------------------------- |
Minor update to my last message: All tests still pass in the Github Actions, see https://github.com/Reference-ScaLAPACK/scalapack/actions/runs/2735265869. Test |
So Github Actions are not actually using multiple CPU cores? |
I think it is. #71 enforces mapping by cores, and I think this information (https://docs.github.com/en/actions/using-github-hosted-runners/about-github-hosted-runners#supported-runners-and-hardware-resources) is precise, i.e., we have 2 cores available per runner. |
In current master, two tests fail if run in parallel:
Both tests pass fine with
-n 1
. I tested on two machines with differing compilers and MPI versions (4.1.1 and 1.10.7).I observe weirdly long runtimes (hundreds of seconds) for some 2.2.0 tests when run inside the pkgsrc build framework, but they do succeed eventually. These FPEs are more definite.
The text was updated successfully, but these errors were encountered: