Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xshseqr and xdhseqr fail with FPE if run in parallel #69

Open
drhpc opened this issue Jul 25, 2022 · 4 comments
Open

xshseqr and xdhseqr fail with FPE if run in parallel #69

drhpc opened this issue Jul 25, 2022 · 4 comments
Labels
bug Something isn't working

Comments

@drhpc
Copy link

drhpc commented Jul 25, 2022

In current master, two tests fail if run in parallel:

69/70 Testing: xshseqr
69/70 Test: xshseqr
Command: "/sw/env/gcc-10.3.0/openmpi/4.1.1/bin/mpiexec" "-n" "2" "./xshseqr"
Directory: /home/rrztest/src/scalapack/TESTING
"xshseqr" start time: Jul 25 20:04 CEST
Output:
----------------------------------------------------------

 ScaLAPACK Test for PSHSEQR

 epsilon   =    5.96046448E-08
 threshold =    30.0000000    

 Residual and Orthogonality Residual computed by:

 Residual      =  || T - Q^T*A*Q ||_F / ( ||A||_F * eps * sqrt(N) )

 Orthogonality =  MAX( || I - Q^T*Q ||_F, || I - Q*Q^T ||_F ) /  (eps * N)

 Test passes if both residuals are less then threshold

    N  NB    P    Q  QR Time  CHECK
----- --- ---- ---- -------- ------

Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.

Backtrace for this error:
#0  0x151fa27c93ff in ???
#1  0x151fa455124f in pstrord_
        at /home/rrztest/src/scalapack/SRC/pstrord.f:1087
#2  0x151fa457a300 in pslaqr3_
        at /home/rrztest/src/scalapack/SRC/pslaqr3.f:880
#3  0x151fa4565178 in pslaqr0_
        at /home/rrztest/src/scalapack/SRC/pslaqr0.f:598
#4  0x151fa456209d in pshseqr_
        at /home/rrztest/src/scalapack/SRC/pshseqr.f:441
#5  0x4036cf in pshseqrdriver
        at /home/rrztest/src/scalapack/TESTING/EIG/pshseqrdriver.f:413
#6  0x404427 in main
        at /home/rrztest/src/scalapack/TESTING/EIG/pshseqrdriver.f:565
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 0 on node node002 exited on signal 8 (Floating point exception).
--------------------------------------------------------------------------
<end of output>
Test time =   2.91 sec
----------------------------------------------------------
Test Failed.
"xshseqr" end time: Jul 25 20:04 CEST
"xshseqr" time elapsed: 00:00:02
----------------------------------------------------------

70/70 Testing: xdhseqr
70/70 Test: xdhseqr
Command: "/sw/env/gcc-10.3.0/openmpi/4.1.1/bin/mpiexec" "-n" "2" "./xdhseqr"
Directory: /home/rrztest/src/scalapack/TESTING
"xdhseqr" start time: Jul 25 20:04 CEST
Output:
----------------------------------------------------------

 ScaLAPACK Test for PDHSEQR

 epsilon   =    1.1102230246251565E-016
 threshold =    30.000000000000000     

 Residual and Orthogonality Residual computed by:

 Residual      =  || T - Q^T*A*Q ||_F / ( ||A||_F * eps * sqrt(N) )

 Orthogonality =  MAX( || I - Q^T*Q ||_F, || I - Q*Q^T ||_F ) /  (eps * N)

 Test passes if both residuals are less then threshold

    N  NB    P    Q  QR Time  CHECK
----- --- ---- ---- -------- ------

Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.

Backtrace for this error:
#0  0x1488be0113ff in ???
#1  0x1488bff4ebae in pdtrord_
        at /home/rrztest/src/scalapack/SRC/pdtrord.f:1087
#2  0x1488bff77f2f in pdlaqr3_
        at /home/rrztest/src/scalapack/SRC/pdlaqr3.f:878
#3  0x1488bff62d2b in pdlaqr0_
        at /home/rrztest/src/scalapack/SRC/pdlaqr0.f:598
#4  0x1488bff5fc1d in pdhseqr_
        at /home/rrztest/src/scalapack/SRC/pdhseqr.f:441
#5  0x4036e2 in pdhseqrdriver
        at /home/rrztest/src/scalapack/TESTING/EIG/pdhseqrdriver.f:412
#6  0x404445 in main
        at /home/rrztest/src/scalapack/TESTING/EIG/pdhseqrdriver.f:564
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 0 on node node002 exited on signal 8 (Floating point exception).
--------------------------------------------------------------------------
<end of output>
Test time =   2.70 sec
----------------------------------------------------------
Test Failed.
"xdhseqr" end time: Jul 25 20:04 CEST
"xdhseqr" time elapsed: 00:00:02
----------------------------------------------------------

End testing: Jul 25 20:04 CEST

Both tests pass fine with -n 1. I tested on two machines with differing compilers and MPI versions (4.1.1 and 1.10.7).

I observe weirdly long runtimes (hundreds of seconds) for some 2.2.0 tests when run inside the pkgsrc build framework, but they do succeed eventually. These FPEs are more definite.

@weslleyspereira weslleyspereira added the bug Something isn't working label Jul 25, 2022
@weslleyspereira
Copy link
Collaborator

Thanks for the bug report! More details about this bug:

  • It was not detected because some tests were disabled in the Github Actions. Now they are enabled, see 782e739. (My bad, I shouldn't commit directly to the repository. To avoid that, I have just enabled the rule "Require a pull request before merging".)
  • The code breaks at

scalapack/SRC/pstrord.f

Lines 1087 to 1088 in de3919e

IF( FLOPS.NE.0 .AND.
$ ( FLOPS*100 ) / ( 2*NWIN*NWIN ) .GE. MMULT ) THEN

  • It breaks because NWIN = LIHI - I + 1 assumes value 0 during the execution of the test.

  • In my Linux machine, I added the following prints for debugging purposes:

*
               WRITE(*,*) "NWIN = ", NWIN, ", ILO = ", ILO, ", LIHI = ",
     $            LIHI, ", I = ", I
*
               IF( FLOPS.NE.0 .AND.
     $              ( FLOPS*100 ) / ( 2*NWIN*NWIN ) .GE. MMULT ) THEN
*

Result of the tests:

$ mpiexec -n 2 xshseqr
[...]
 NWIN =            4 , ILO =           73 , LIHI =           76 , I =           73
 NWIN =            4 , ILO =           73 , LIHI =           76 , I =           73
 NWIN =            4 , ILO =           73 , LIHI =           76 , I =           73
 NWIN =            4 , ILO =           73 , LIHI =           74 , I =           73
 NWIN =           19 , ILO =           31 , LIHI =           49 , I =           31
 NWIN =           19 , ILO =           31 , LIHI =           49 , I =           31
 NWIN =           19 , ILO =           31 , LIHI =           49 , I =           31
 NWIN =           19 , ILO =           31 , LIHI =           44 , I =           31
 NWIN =            0 , ILO =           45 , LIHI =           51 , I =           52

Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.

Backtrace for this error:
#0  0x7fd360623d21 in ???
#1  0x7fd360622ef5 in ???
#2  0x7fd36045408f in ???
        at /build/glibc-SzIz7B/glibc-2.31/signal/../sysdeps/unix/sysv/linux/x86_64/sigaction.c:0
#3  0x7fd362e1ecb6 in pstrord_
        at ../SRC/pstrord.f:1090
#4  0x7fd362e4acc5 in pslaqr3_
        at ../SRC/pslaqr3.f:880
#5  0x7fd362e340da in pslaqr0_
        at ../SRC/pslaqr0.f:598
#6  0x7fd362e30dbf in pshseqr_
        at ../SRC/pshseqr.f:441
#7  0x558c3509e9ac in pshseqrdriver
        at ../TESTING/EIG/pshseqrdriver.f:413
#8  0x558c3509f8bd in main
        at ../TESTING/EIG/pshseqrdriver.f:565
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 0 on node weslleyp-XPS-15-9510 exited on signal 8 (Floating point exception).
--------------------------------------------------------------------------

@weslleyspereira
Copy link
Collaborator

weslleyspereira commented Jul 25, 2022

Minor update to my last message: All tests still pass in the Github Actions, see https://github.com/Reference-ScaLAPACK/scalapack/actions/runs/2735265869.

Test xshseqr is still failing in my personal machine.

@drhpc
Copy link
Author

drhpc commented Jul 26, 2022

So Github Actions are not actually using multiple CPU cores?

@weslleyspereira
Copy link
Collaborator

So Github Actions are not actually using multiple CPU cores?

I think it is. #71 enforces mapping by cores, and I think this information (https://docs.github.com/en/actions/using-github-hosted-runners/about-github-hosted-runners#supported-runners-and-hardware-resources) is precise, i.e., we have 2 cores available per runner.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants