Skip to content

QUDA Quick Start Guide

Denis Boyda edited this page Oct 19, 2021 · 16 revisions

Installation

As of QUDA 0.8 the preferred build is using CMake, and it is a requirement for builds of version 0.9 and later. QUDA 1.0 requires C++11 support, and recent versions of develop boost this requirement to C++14, which implies a requirement of gcc >= 5.

Note that QUDA requires a recent enough version of cmake. If your system does not have a recent enough version installed or available as module you can download precompiled binaries or sources from the CMake Download page and install them in your home directory.

Here we just describe a quick start, assuming you clone the latest development version from github.

  1. Clone the repository git clone https://github.com/lattice/quda.git
  2. Create a build directory mkdir build; cd build
  3. Run cmake cmake ../quda
  4. Select options ccmake . This will show all options and you can modify them. A short description of the options is included as well.
  5. Run make using a parallel build make -j 16. In general the number of processes to use in parallel for building should match or slightly exceed the number of cores on your system,

Make sure you used the correct architecture for your GPUs in step 3. Default architecture is sm_70 but you may want to specify different architectures such as -DQUDA_GPU_ARCH=sm_60 for a Pascal GPU or -DQUDA_GPU_ARCH=sm_80 for A100. More details can be found at QUDA-Build-With-CMake.

Internal Tests

QUDA includes a number of internal tests, whose primary goal are correctness and performance testing. As of QUDA, the list of tests stands at

  • dslash_test: wilson, clover, twisted mass, twisted clover, domain wall, mobius
  • staggered_dslash_test: staggered and improved staggered
  • invert_test: solver test for Wilson-like fermions
  • staggered_invert_test: solver test for staggered-like fermions
  • blas_test: test all blas functions for performance and correctness
  • eigensolve_test: test for Lanczos and Arnoldi eigensolver with Wilson-like fermions
  • staggered_eigensolve_test: test for Lanczos and Arnoldi eigensolver with staggered-like fermions
  • fermion_force_test: test for asqtad fermion force computation (deprecated)
  • gauge_force_test: test for gauge force computation
  • hisq_force_paths_test: HISQ force derivative computation test
  • hisq_unitarize_force_test: HISQ force unitarize test
  • llfat_test: Gauge link fattening for HISQ / asqtad fermions
  • su3_test: Test of SU(3) reconstruction used in dslash_test
  • unitarize_link_test: Test of unitarization used when constructing improved links

Using the Library

Include the header file include/quda.h in your application, link against lib/libquda.a, and study tests/invert_test.cpp (for Wilson, clover, twisted-mass, or domain wall fermions) or tests/staggered_invert_test.cpp (for asqtad/HISQ fermions) for examples of the solver interface. The various solver options are enumerated in include/enum_quda.h.

Kernel Autotuning

QUDA uses runtime autotuning to maximize performance of each kernel on a given GPU. This brings better performance portability across both GPU architectures and different lattice volumes, parameters, etc. This tuning process takes some time and will generally slow things down the first time a given kernel is called during a run. To avoid this one-time overhead in subsequent runs (using the same action, solver, lattice volume, etc.), the optimal parameters are cached to disk. For this to work, the QUDA_RESOURCE_PATH environment variable must be set, pointing to a writeable directory. Note that since the tuned parameters are hardware-specific, this "resource directory" should not be shared between jobs running on different systems (e.g., two clusters with different GPUs installed). Attempting to use parameters tuned for one card on a different card may lead to unexpected errors. In addition QUDA will refuse to run with an outdated tunecache to avoid as using an outdated (i.e. tuned for an older version of QUDA) may result in undefined behavior. The tunecache.tsv file is dumped at the end of run in the location specified by the QUDA_RESOURCE_PATH environment. If this is not specified then autotuning will be cached only within the scope of the run, but lost when the job ends.

Debugging

QUDA has two specific debugging modes: HOST_DEBUG and DEVICE_DEBUG.

  • HOST_DEBUG compiles all host code using the -g flag and ensures that all CUDA error reporting is done synchronously (e.g., the GPU and CPU are synchronized prior to fetching the error state). For most debugging, HOST_DEBUG is all that should be needed since most bugs tend to be in CPU code. There is a noticeable performance impact enabling HOST_DEBUG, at the 20-50% level, with the penalty being greater at smaller local volumes.
  • DEVICE_DEBUG compiles all GPU kernels using the -G flag. This provides for accurate line reporting in cuda-gdb and cuda-memch. There is a huge performance penalty impact from enabling this, at the 100x level.

Installation of old QUDA versions (pre 0.8)

DEPRECATED:-Installation-using-configure-(autoconf)-for-QUDA-0.7.x

Clone this wiki locally