Feature/neighborlist on host ligand #545

badisa · 2021-12-08T01:26:36Z

This PR extends our Neighborlist to support construct from two sets of coordinates. This will be necessary for eventuating host-ligand interactions.

The performance is unchanged from what is currently on master.

Current Benchmarks on A4000

CUDA Kernel Statistics:

 Time(%)  Total Time (ns)  Instances  Average (ns)  Minimum (ns)  Maximum (ns)  StdDev (ns)                                                  Name                                                
 -------  ---------------  ---------  ------------  ------------  ------------  -----------  ----------------------------------------------------------------------------------------------------
    64.6        1,808,224         12     150,685.3         7,680       294,208    136,109.5  void k_find_blocks_with_ixns<double>(int, const double *, const double *, const double *, const dou…
    20.6          575,296         12      47,941.3         5,024       101,120     41,874.3  void k_find_blocks_with_ixns<float>(int, const double *, const double *, const double *, const doub…
     8.1          227,422         12      18,951.8        16,864        22,528      2,351.4  void k_find_block_bounds<double>(int, int, int, const double *, const double *, double *, double *) 
     4.3          121,057         12      10,088.1         7,104        13,600      3,051.8  void k_find_block_bounds<float>(int, int, int, const double *, const double *, double *, double *)  
     2.4           67,230         24       2,801.3         2,592         3,040        158.5  k_compact_trim_atoms(int, int, unsigned int *, unsigned int *, int *, unsigned int *)   

dhfr-apo: N=23558 speed: 216.82ns/day dt: 1.5fs (ran 100000 steps in 59.78s)
dhfr-apo-barostat-interval-25: N=23558 speed: 193.24ns/day dt: 1.5fs (ran 100000 steps in 67.07s)
dhfr-apo-hmr-barostat-interval-25: N=23558 speed: 319.22ns/day dt: 2.5fs (ran 100000 steps in 67.67s)
building a protein system with 1758 protein atoms and 7047 water atoms
WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
hif2a-apo: N=8805 speed: 478.39ns/day dt: 1.5fs (ran 100000 steps in 27.10s)
hif2a-apo-barostat-interval-25: N=8805 speed: 418.24ns/day dt: 1.5fs (ran 100000 steps in 30.99s)
hif2a-rbfe-with-du-dp: N=8840 speed: 360.66ns/day dt: 1.5fs (ran 100000 steps in 35.94s)
hif2a-rbfe-du-dl-interval-0: N=8840 speed: 362.89ns/day dt: 1.5fs (ran 100000 steps in 35.72s)
hif2a-rbfe-du-dl-interval-1: N=8840 speed: 350.11ns/day dt: 1.5fs (ran 100000 steps in 37.02s)
hif2a-rbfe-du-dl-interval-5: N=8840 speed: 358.71ns/day dt: 1.5fs (ran 100000 steps in 36.13s)
solvent-apo: N=6282 speed: 631.52ns/day dt: 1.5fs (ran 100000 steps in 20.53s)
solvent-apo-barostat-interval-25: N=6282 speed: 579.31ns/day dt: 1.5fs (ran 100000 steps in 22.38s)
solvent-rbfe-with-du-dp: N=6317 speed: 469.03ns/day dt: 1.5fs (ran 100000 steps in 27.64s)
solvent-rbfe-du-dl-interval-0: N=6317 speed: 471.42ns/day dt: 1.5fs (ran 100000 steps in 27.50s)
solvent-rbfe-du-dl-interval-1: N=6317 speed: 457.64ns/day dt: 1.5fs (ran 100000 steps in 28.32s)
solvent-rbfe-du-dl-interval-5: N=6317 speed: 467.04ns/day dt: 1.5fs (ran 100000 steps in 27.75s)

The changes in the PR

CUDA Kernel Statistics:

 Time(%)  Total Time (ns)  Instances  Average (ns)  Minimum (ns)  Maximum (ns)  StdDev (ns)                                                  Name                                                
 -------  ---------------  ---------  ------------  ------------  ------------  -----------  ----------------------------------------------------------------------------------------------------
    65.8        1,840,324         12     153,360.3         7,616       299,393    138,820.0  void k_find_blocks_with_ixns<double, (bool)1>(int, int, const double *, const double *, const doubl…
    19.6          549,059         12      45,754.9         4,448        96,545     40,152.3  void k_find_blocks_with_ixns<float, (bool)1>(int, int, const double *, const double *, const double…
     8.1          225,792         12      18,816.0        16,800        21,792      2,310.7  void k_find_block_bounds<double>(int, int, int, const double *, const double *, double *, double *) 
     4.0          113,089         12       9,424.1         6,752        12,832      2,874.9  void k_find_block_bounds<float>(int, int, int, const double *, const double *, double *, double *)  
     2.4           66,528         24       2,772.0         2,496         4,000        337.9  k_compact_trim_atoms(int, int, unsigned int *, unsigned int *, int *, unsigned int *) 


dhfr-apo: N=23558 speed: 216.85ns/day dt: 1.5fs (ran 100000 steps in 59.77s)
dhfr-apo-barostat-interval-25: N=23558 speed: 194.38ns/day dt: 1.5fs (ran 100000 steps in 66.68s)
dhfr-apo-hmr-barostat-interval-25: N=23558 speed: 319.71ns/day dt: 2.5fs (ran 100000 steps in 67.57s)
building a protein system with 1758 protein atoms and 7047 water atoms
WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
hif2a-apo: N=8805 speed: 476.93ns/day dt: 1.5fs (ran 100000 steps in 27.18s)
hif2a-apo-barostat-interval-25: N=8805 speed: 418.21ns/day dt: 1.5fs (ran 100000 steps in 30.99s)
hif2a-rbfe-with-du-dp: N=8840 speed: 360.70ns/day dt: 1.5fs (ran 100000 steps in 35.93s)
hif2a-rbfe-du-dl-interval-0: N=8840 speed: 361.52ns/day dt: 1.5fs (ran 100000 steps in 35.85s)
hif2a-rbfe-du-dl-interval-1: N=8840 speed: 350.61ns/day dt: 1.5fs (ran 100000 steps in 36.97s)
hif2a-rbfe-du-dl-interval-5: N=8840 speed: 358.75ns/day dt: 1.5fs (ran 100000 steps in 36.13s)
solvent-apo: N=6282 speed: 631.58ns/day dt: 1.5fs (ran 100000 steps in 20.52s)
solvent-apo-barostat-interval-25: N=6282 speed: 581.07ns/day dt: 1.5fs (ran 100000 steps in 22.31s)
solvent-rbfe-with-du-dp: N=6317 speed: 469.30ns/day dt: 1.5fs (ran 100000 steps in 27.62s)
solvent-rbfe-du-dl-interval-0: N=6317 speed: 472.75ns/day dt: 1.5fs (ran 100000 steps in 27.42s)
solvent-rbfe-du-dl-interval-1: N=6317 speed: 456.27ns/day dt: 1.5fs (ran 100000 steps in 28.41s)
solvent-rbfe-du-dl-interval-5: N=6317 speed: 464.34ns/day dt: 1.5fs (ran 100000 steps in 27.92s)

badisa · 2021-12-08T01:44:56Z

timemachine/cpp/src/kernels/k_neighborlist.cuh

@@ -303,12 +309,12 @@ void __global__ k_find_blocks_with_ixns(
        atom_box_dy = max(static_cast<RealType>(0.0), fabs(atom_box_dy) - col_bb_ext_y);
        atom_box_dz = max(static_cast<RealType>(0.0), fabs(atom_box_dz) - col_bb_ext_z);

+        bool check_column_atoms =
+            atom_i_idx < K &&


This additional check is important, without this check you get junk interactions and the test fails.. Previously we were getting lucky with positions beyond the number of atoms we have.

badisa · 2021-12-08T01:45:42Z

timemachine/cpp/src/math_utils.cuh

@@ -1,9 +1,11 @@
 #pragma once

-__device__ __host__ double sign(double a) { return (a > 0) - (a < 0); }
+int __forceinline__ ceil_divide(int x, int y) { return (x + y - 1) / y; }


I would like this to be more widely used, but for now I just used it for the code that relate to these changes.

Would it be okay if we kept the old sign code as well? Not sure why it was deleted.

Its only used by the GBSA potential which isn't currently compiled. But makes sense to keep it if we don't also remove the GBSA potential.

Okay, these are really just jvp rules for autodiff. Let's keep them for now since they may be useful later on?

tests/test_nblist.py

timemachine/cpp/src/wrap_kernels.cpp

mcwitt

Took a first pass and left some surface-level comments!

timemachine/cpp/src/neighborlist.cu

mcwitt · 2021-12-08T18:35:16Z

timemachine/cpp/src/neighborlist.cu


    cudaDeviceSynchronize();
-    const int B = this->B(); //(N+32-1)/32;
+    const int tpb = 32;


Not specific to this PR, but the definition tpb = 32 seems replicated in a lot of places - would it make sense to declare as a global (or at least file-level) constant?

We definitely should and its made worse that in some places tpb is used as a substitute for block size rather than warpsize

mcwitt · 2021-12-08T18:38:47Z

timemachine/cpp/src/math_utils.cuh

-__device__ __host__ float sign(Surreal<float> a) { return (a.real > 0) - (a.real < 0); }
+// __device__ __host__ double sign(Surreal<double> a) { return (a.real > 0) - (a.real < 0); }
+
+// __device__ __host__ float sign(Surreal<float> a) { return (a.real > 0) - (a.real < 0); }


was commenting these out intentional? Would it make sense to just remove them if unused?

timemachine/cpp/src/neighborlist.cu

mcwitt · 2021-12-08T18:46:18Z

timemachine/cpp/src/neighborlist.cu

+    const int N,                // Number of atoms in column
+    const int K,                // Number of atoms in row


This naming feels a bit strange to me, since I'm used to N being the total number. Is this conventional, or could it make sense to give a different name to the number of atoms in a column? (it's possible this will make more sense to me once I have a better understanding of what's going on here)

I find that N is just a size of 'something', I can change them more broadly

I think I also got tripped by this while reading the PR, maybe we can rename N to NR and K to NC (num_row and num_col atoms)?

maxentile

Looks great in my initial testing!

And can yield a big speedup in cases where we might want compute a batch of host-ligand neighborlists on stored snapshots (30x speed-up on the test system, from ~70 snapshots / second to ~2000 snapshots / second: https://gist.github.com/maxentile/fbaf654a7f21c05adce3b04145c43866 )

tests/test_nblist.py

proteneer · 2021-12-10T16:20:12Z

timemachine/cpp/src/neighborlist.cu

-    // assert(N == N_);
+std::vector<std::vector<int>>
+Neighborlist<RealType>::get_nblist_host(int N, const double *h_coords, const double *h_box, const double cutoff) {
+    // Call into the impl that takes colum and row coords


proteneer · 2021-12-10T16:23:19Z

timemachine/cpp/src/neighborlist.cu

@@ -8,23 +8,38 @@

 namespace timemachine {

-template <typename RealType> Neighborlist<RealType>::Neighborlist(int N) : N_(N) {
+template <typename RealType>
+Neighborlist<RealType>::Neighborlist(const int N, const int K) : N_(N), K_(K), compute_full_matrix_(K > 0) {


Should we change compute_full_matrix_ into a function? i.e. if K_ == 0 compute_full_matrix returns True (as opposed to storing it as a separate variable that may get out of sync during a later refactor)

proteneer · 2021-12-10T16:37:20Z

Looks great in my initial testing!

And can yield a big speedup in cases where we might want compute a batch of host-ligand neighborlists on stored snapshots (30x speed-up on the test system, from ~70 snapshots / second to ~2000 snapshots / second: https://gist.github.com/maxentile/fbaf654a7f21c05adce3b04145c43866 )

This probably is also the bare minimum speed-up as well, as there just isn't enough work to saturate the GPU for something like this to proceed one ligand at a time. In fact, I think this code as-as can probably support the batch use case (without resorting to streams) where we can aggregate row atoms so they represent multiple conformations of the same ligand, and we emit a set of tiles from this neighborlist that can compute ligand-host energies for multiple-conformations all at once.

proteneer · 2021-12-10T16:40:31Z

@badisa Can we add a test case where we use the neighborlist to compute the U_HH interactions. Let NR be the number of row atoms, NC be the number of column atoms, and NT the total number of atoms in the complete system. Set NR to some number < NT, and NC to zero. I think right now we only test the case where NR < NT, NC > 0 (U_HL), and NR = NT, and NC = 0 (U original), but would be good to also test NR < NT, NC = 0 (U_HH)

badisa · 2021-12-10T16:59:47Z

Can we add a test case where we use the neighborlist to compute the U_HH interactions?

@proteneer How is this different from the previous test that computes the ixns of a water box? You would just not include the ligand in the coordinates passed as the columns. https://github.com/proteneer/timemachine/blob/master/tests/test_nblist.py#L69

proteneer

generally LGTM, save for some minor comments. Would like to see the test mentioned on using it for U_HH but probably isn't a blocker for merge.

proteneer · 2021-12-10T17:01:23Z

timemachine/cpp/src/neighborlist.cu

+    }
+    if (D != 3) {
+        throw std::runtime_error("D != 3");
+    }



Can we add a consistency check that if K==0 then col_ptr is null, and if K>0, then col_ptr is not null?

proteneer · 2021-12-10T17:06:18Z

Can we add a test case where we use the neighborlist to compute the U_HH interactions?

@proteneer How is this different from the previous test that computes the ixns of a water box? You would just not include the ligand in the coordinates passed as the columns. https://github.com/proteneer/timemachine/blob/master/tests/test_nblist.py#L69

Hm never mind I think I see how you'd probably use this in a nonbonded.cu later on! So we'd simply rely on the caller to prepare the sizes of the input arrays directly.

proteneer · 2021-12-10T17:06:45Z

generally LGTM, save for some minor comments. Would like to see the test mentioned on using it for U_HH but probably isn't a blocker for merge.

Never mind - test for U_HH isn't needed. This PR is great!

maxentile · 2021-12-10T18:16:05Z

This probably is also the bare minimum speed-up as well, as there just isn't enough work to saturate the GPU for something like this to proceed one ligand at a time. In fact, I think this code as-as can probably support the batch use case (without resorting to streams) where we can aggregate row atoms so they represent multiple conformations of the same ligand, and we emit a set of tiles from this neighborlist that can compute ligand-host energies for multiple-conformations all at once.

Agreed! The speed-up from later applying this batch optimization should be substantial.

Just to have a baseline, here are my timings for the Jax reference implementation (computing all host-ligand distances in double precision on the CPU using vmap(distance_on_pairs), for batch sizes 1 through 1000): https://gist.github.com/maxentile/9f8fdc330bf948e588fc8c90177d8e61 .

* Functions aren't called

* Important to compute nonbonded energies of host-ligand

@jfass

* Thanks to @jfass for noticing that we could get rid of the python loop

badisa requested review from mcwitt, proteneer and maxentile December 8, 2021 01:26

badisa commented Dec 8, 2021

View reviewed changes

maxentile reviewed Dec 8, 2021

View reviewed changes

tests/test_nblist.py Outdated Show resolved Hide resolved

tests/test_nblist.py Show resolved Hide resolved

maxentile reviewed Dec 8, 2021

View reviewed changes

timemachine/cpp/src/wrap_kernels.cpp Outdated Show resolved Hide resolved

mcwitt reviewed Dec 8, 2021

View reviewed changes

maxentile reviewed Dec 8, 2021

View reviewed changes

tests/test_nblist.py Outdated Show resolved Hide resolved

proteneer reviewed Dec 10, 2021

View reviewed changes

proteneer approved these changes Dec 10, 2021

View reviewed changes

maxentile approved these changes Dec 13, 2021

View reviewed changes

badisa added 3 commits December 13, 2021 09:27

Removes unused code in neighborlist

c19691b

* Functions aren't called

Adds support for computing NBlist of Host/Ligand

60cdc85

* Important to compute nonbonded energies of host-ligand

Vectorizes delta_r box centering

cd815ab

* Thanks to @jfass for noticing that we could get rid of the python loop

badisa force-pushed the feature/neighborlist-on-host-ligand branch from f786505 to cd815ab Compare December 13, 2021 16:29

badisa merged commit 5f9d022 into master Dec 13, 2021

badisa deleted the feature/neighborlist-on-host-ligand branch December 13, 2021 17:05

mcwitt mentioned this pull request Jan 24, 2022

Add NonbondedInteractionGroup potential, rename existing nonbonded potentials #578

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/neighborlist on host ligand #545

Feature/neighborlist on host ligand #545

badisa commented Dec 8, 2021

badisa Dec 8, 2021

proteneer Dec 10, 2021

badisa Dec 8, 2021

proteneer Dec 10, 2021

badisa Dec 10, 2021

proteneer Dec 10, 2021

mcwitt left a comment

mcwitt Dec 8, 2021

badisa Dec 8, 2021

mcwitt Dec 8, 2021

mcwitt Dec 8, 2021

badisa Dec 8, 2021

proteneer Dec 10, 2021

maxentile left a comment

proteneer Dec 10, 2021

proteneer Dec 10, 2021

proteneer commented Dec 10, 2021 •

edited

Loading

proteneer commented Dec 10, 2021 •

edited

Loading

badisa commented Dec 10, 2021 •

edited

Loading

proteneer left a comment

proteneer Dec 10, 2021

proteneer commented Dec 10, 2021

proteneer commented Dec 10, 2021

maxentile commented Dec 10, 2021

		const int N, // Number of atoms in column
		const int K, // Number of atoms in row

Feature/neighborlist on host ligand #545

Feature/neighborlist on host ligand #545

Conversation

badisa commented Dec 8, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mcwitt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maxentile left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

proteneer commented Dec 10, 2021 • edited Loading

proteneer commented Dec 10, 2021 • edited Loading

badisa commented Dec 10, 2021 • edited Loading

proteneer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

proteneer commented Dec 10, 2021

proteneer commented Dec 10, 2021

maxentile commented Dec 10, 2021

proteneer commented Dec 10, 2021 •

edited

Loading

proteneer commented Dec 10, 2021 •

edited

Loading

badisa commented Dec 10, 2021 •

edited

Loading