Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow performance of function dsyev and dsyevx (not fully paralleled) #4758

Open
ajz34 opened this issue Jun 19, 2024 · 3 comments
Open

Slow performance of function dsyev and dsyevx (not fully paralleled) #4758

ajz34 opened this issue Jun 19, 2024 · 3 comments
Labels
LAPACK issue Deficiency in code imported from Reference-LAPACK

Comments

@ajz34
Copy link

ajz34 commented Jun 19, 2024

Hello developers!

I found that functions dsyev and dsyevx seems not fully paralleled, when

  • compiled by gcc (11.4/12.3)
  • AMD (16 cores @ Ryzen 7945HX of laptop / 2 cores @ EPYC 7763 of github action)
  • make options TARGET=ZEN USE_64BITINT=1 DYNAMIC_ARCH=1 NO_CBLAS=0 NO_LAPACK=0 NO_LAPACKE=0 NO_AFFINITY=1 USE_OPENMP=1

Preliminary testing on Intel CPU may also show similar problem.

I'm not sure whether if it's the problem of make configurations, or OpenBLAS currently not fully implemented parallel version of dsyev and dsyevx.
Hope to hear any thoughts or advices, and thanks in advance!


I guess that dsyevr and dsyevd could be better replacements to dsyev. dsyevd is the fastest but consumes more memory, while dsyevr uses much smaller temporary memory.
So additionally, as a programmer not very familiar to low-level BLAS/LAPACK, I wonder that if it's common to use dsyevr and dsyevd as eigen-solvers, instead of dsyev? If so, this may not be such important issue.


Benchmark results (16 cores @ Ryzen 7945HX)

  • CPU ratio refers to (CPU time) / (elapsed time)
  • eigen problem of matrix of 2048 x 2048, eigenvectors required, filled with values from 1 -- 2048
OpenBLAS 0.3.27 MKL 2024.1
function elapsed time CPU ratio elapsed time CPU ratio
dsyev 4861.5 msec 1.65 588.7 msec 15.75
dsyevd 392.2 msec 13.69 225.9 msec 15.42
dsyevr 805.6 msec 6.30 698.6 msec 4.69
dsyevx 3969.8 msec 1.82 542.2 msec 15.76

Reproduction of this issue can be found in Github Action CI (2 physical cores @ EPYC 7763 of github action)(https://github.com/ajz34/issue_openblas_dsyev/actions/runs/9578584638/job/26409147609).

For scripts used in 16 cores @ Ryzen 7945HX, also see https://github.com/ajz34/issue_openblas_dsyev/tree/16-cores-Ryzen-7945HX.

@martin-frbg
Copy link
Collaborator

martin-frbg commented Jun 19, 2024

The LAPACK included in OpenBLAS is almost completely copied from the reference implementation, also known as "netlib" LAPACK https://github.com/Reference-LAPACK/lapack - which is not optimized for speed (and not parallelized except for a few functions that can use OpenMP parallelism if available). Only a handful of functions (such as getrf/potrf) have been reimplemented in OpenBLAS, for everything else the only performance advantage over the reference implementation comes from using the optimized BLAS functions.
(And yes, I would expect that one "normally" prefers DSYEVD or DSYEVR over the original QR algorithm. (see e.g. https://scicomp.stackexchange.com/questions/11827/flop-counts-for-lapack-symmetric-eigenvalue-routines-dsyev-dsyevd-dsyevx-and-d )

@martin-frbg martin-frbg added the LAPACK issue Deficiency in code imported from Reference-LAPACK label Jun 19, 2024
@ajz34
Copy link
Author

ajz34 commented Jun 19, 2024

@martin-frbg
I searched dsyev in issue list and found few issue talking on this topic. But I haven't expected that this also happens to zheev in issues tagged with Lapack issue 😹
And thanks for explanation! 😄

@martin-frbg
Copy link
Collaborator

Well, it might be a useful coincidence if the bottleneck in DSYEV turned out to be the (D)LASR function too, but I have not checked.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
LAPACK issue Deficiency in code imported from Reference-LAPACK
Projects
None yet
Development

No branches or pull requests

2 participants