Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/neighborlist on host ligand #545

Merged
merged 3 commits into from
Dec 13, 2021

Conversation

badisa
Copy link
Collaborator

@badisa badisa commented Dec 8, 2021

This PR extends our Neighborlist to support construct from two sets of coordinates. This will be necessary for eventuating host-ligand interactions.

The performance is unchanged from what is currently on master.

Current Benchmarks on A4000

CUDA Kernel Statistics:

 Time(%)  Total Time (ns)  Instances  Average (ns)  Minimum (ns)  Maximum (ns)  StdDev (ns)                                                  Name                                                
 -------  ---------------  ---------  ------------  ------------  ------------  -----------  ----------------------------------------------------------------------------------------------------
    64.6        1,808,224         12     150,685.3         7,680       294,208    136,109.5  void k_find_blocks_with_ixns<double>(int, const double *, const double *, const double *, const dou…
    20.6          575,296         12      47,941.3         5,024       101,120     41,874.3  void k_find_blocks_with_ixns<float>(int, const double *, const double *, const double *, const doub…
     8.1          227,422         12      18,951.8        16,864        22,528      2,351.4  void k_find_block_bounds<double>(int, int, int, const double *, const double *, double *, double *) 
     4.3          121,057         12      10,088.1         7,104        13,600      3,051.8  void k_find_block_bounds<float>(int, int, int, const double *, const double *, double *, double *)  
     2.4           67,230         24       2,801.3         2,592         3,040        158.5  k_compact_trim_atoms(int, int, unsigned int *, unsigned int *, int *, unsigned int *)   

dhfr-apo: N=23558 speed: 216.82ns/day dt: 1.5fs (ran 100000 steps in 59.78s)
dhfr-apo-barostat-interval-25: N=23558 speed: 193.24ns/day dt: 1.5fs (ran 100000 steps in 67.07s)
dhfr-apo-hmr-barostat-interval-25: N=23558 speed: 319.22ns/day dt: 2.5fs (ran 100000 steps in 67.67s)
building a protein system with 1758 protein atoms and 7047 water atoms
WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
hif2a-apo: N=8805 speed: 478.39ns/day dt: 1.5fs (ran 100000 steps in 27.10s)
hif2a-apo-barostat-interval-25: N=8805 speed: 418.24ns/day dt: 1.5fs (ran 100000 steps in 30.99s)
hif2a-rbfe-with-du-dp: N=8840 speed: 360.66ns/day dt: 1.5fs (ran 100000 steps in 35.94s)
hif2a-rbfe-du-dl-interval-0: N=8840 speed: 362.89ns/day dt: 1.5fs (ran 100000 steps in 35.72s)
hif2a-rbfe-du-dl-interval-1: N=8840 speed: 350.11ns/day dt: 1.5fs (ran 100000 steps in 37.02s)
hif2a-rbfe-du-dl-interval-5: N=8840 speed: 358.71ns/day dt: 1.5fs (ran 100000 steps in 36.13s)
solvent-apo: N=6282 speed: 631.52ns/day dt: 1.5fs (ran 100000 steps in 20.53s)
solvent-apo-barostat-interval-25: N=6282 speed: 579.31ns/day dt: 1.5fs (ran 100000 steps in 22.38s)
solvent-rbfe-with-du-dp: N=6317 speed: 469.03ns/day dt: 1.5fs (ran 100000 steps in 27.64s)
solvent-rbfe-du-dl-interval-0: N=6317 speed: 471.42ns/day dt: 1.5fs (ran 100000 steps in 27.50s)
solvent-rbfe-du-dl-interval-1: N=6317 speed: 457.64ns/day dt: 1.5fs (ran 100000 steps in 28.32s)
solvent-rbfe-du-dl-interval-5: N=6317 speed: 467.04ns/day dt: 1.5fs (ran 100000 steps in 27.75s)

The changes in the PR

CUDA Kernel Statistics:

 Time(%)  Total Time (ns)  Instances  Average (ns)  Minimum (ns)  Maximum (ns)  StdDev (ns)                                                  Name                                                
 -------  ---------------  ---------  ------------  ------------  ------------  -----------  ----------------------------------------------------------------------------------------------------
    65.8        1,840,324         12     153,360.3         7,616       299,393    138,820.0  void k_find_blocks_with_ixns<double, (bool)1>(int, int, const double *, const double *, const doubl…
    19.6          549,059         12      45,754.9         4,448        96,545     40,152.3  void k_find_blocks_with_ixns<float, (bool)1>(int, int, const double *, const double *, const double…
     8.1          225,792         12      18,816.0        16,800        21,792      2,310.7  void k_find_block_bounds<double>(int, int, int, const double *, const double *, double *, double *) 
     4.0          113,089         12       9,424.1         6,752        12,832      2,874.9  void k_find_block_bounds<float>(int, int, int, const double *, const double *, double *, double *)  
     2.4           66,528         24       2,772.0         2,496         4,000        337.9  k_compact_trim_atoms(int, int, unsigned int *, unsigned int *, int *, unsigned int *) 


dhfr-apo: N=23558 speed: 216.85ns/day dt: 1.5fs (ran 100000 steps in 59.77s)
dhfr-apo-barostat-interval-25: N=23558 speed: 194.38ns/day dt: 1.5fs (ran 100000 steps in 66.68s)
dhfr-apo-hmr-barostat-interval-25: N=23558 speed: 319.71ns/day dt: 2.5fs (ran 100000 steps in 67.57s)
building a protein system with 1758 protein atoms and 7047 water atoms
WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
hif2a-apo: N=8805 speed: 476.93ns/day dt: 1.5fs (ran 100000 steps in 27.18s)
hif2a-apo-barostat-interval-25: N=8805 speed: 418.21ns/day dt: 1.5fs (ran 100000 steps in 30.99s)
hif2a-rbfe-with-du-dp: N=8840 speed: 360.70ns/day dt: 1.5fs (ran 100000 steps in 35.93s)
hif2a-rbfe-du-dl-interval-0: N=8840 speed: 361.52ns/day dt: 1.5fs (ran 100000 steps in 35.85s)
hif2a-rbfe-du-dl-interval-1: N=8840 speed: 350.61ns/day dt: 1.5fs (ran 100000 steps in 36.97s)
hif2a-rbfe-du-dl-interval-5: N=8840 speed: 358.75ns/day dt: 1.5fs (ran 100000 steps in 36.13s)
solvent-apo: N=6282 speed: 631.58ns/day dt: 1.5fs (ran 100000 steps in 20.52s)
solvent-apo-barostat-interval-25: N=6282 speed: 581.07ns/day dt: 1.5fs (ran 100000 steps in 22.31s)
solvent-rbfe-with-du-dp: N=6317 speed: 469.30ns/day dt: 1.5fs (ran 100000 steps in 27.62s)
solvent-rbfe-du-dl-interval-0: N=6317 speed: 472.75ns/day dt: 1.5fs (ran 100000 steps in 27.42s)
solvent-rbfe-du-dl-interval-1: N=6317 speed: 456.27ns/day dt: 1.5fs (ran 100000 steps in 28.41s)
solvent-rbfe-du-dl-interval-5: N=6317 speed: 464.34ns/day dt: 1.5fs (ran 100000 steps in 27.92s)

@@ -303,12 +309,12 @@ void __global__ k_find_blocks_with_ixns(
atom_box_dy = max(static_cast<RealType>(0.0), fabs(atom_box_dy) - col_bb_ext_y);
atom_box_dz = max(static_cast<RealType>(0.0), fabs(atom_box_dz) - col_bb_ext_z);

bool check_column_atoms =
atom_i_idx < K &&
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This additional check is important, without this check you get junk interactions and the test fails.. Previously we were getting lucky with positions beyond the number of atoms we have.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call!

@@ -1,9 +1,11 @@
#pragma once

__device__ __host__ double sign(double a) { return (a > 0) - (a < 0); }
int __forceinline__ ceil_divide(int x, int y) { return (x + y - 1) / y; }
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like this to be more widely used, but for now I just used it for the code that relate to these changes.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be okay if we kept the old sign code as well? Not sure why it was deleted.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its only used by the GBSA potential which isn't currently compiled. But makes sense to keep it if we don't also remove the GBSA potential.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, these are really just jvp rules for autodiff. Let's keep them for now since they may be useful later on?

tests/test_nblist.py Outdated Show resolved Hide resolved
tests/test_nblist.py Show resolved Hide resolved
Copy link
Collaborator

@mcwitt mcwitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a first pass and left some surface-level comments!

timemachine/cpp/src/neighborlist.cu Outdated Show resolved Hide resolved
timemachine/cpp/src/neighborlist.cu Outdated Show resolved Hide resolved
timemachine/cpp/src/neighborlist.cu Outdated Show resolved Hide resolved

cudaDeviceSynchronize();
const int B = this->B(); //(N+32-1)/32;
const int tpb = 32;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not specific to this PR, but the definition tpb = 32 seems replicated in a lot of places - would it make sense to declare as a global (or at least file-level) constant?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We definitely should and its made worse that in some places tpb is used as a substitute for block size rather than warpsize

__device__ __host__ float sign(Surreal<float> a) { return (a.real > 0) - (a.real < 0); }
// __device__ __host__ double sign(Surreal<double> a) { return (a.real > 0) - (a.real < 0); }

// __device__ __host__ float sign(Surreal<float> a) { return (a.real > 0) - (a.real < 0); }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was commenting these out intentional? Would it make sense to just remove them if unused?

timemachine/cpp/src/neighborlist.cu Outdated Show resolved Hide resolved
Comment on lines 223 to 224
const int N, // Number of atoms in column
const int K, // Number of atoms in row
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This naming feels a bit strange to me, since I'm used to N being the total number. Is this conventional, or could it make sense to give a different name to the number of atoms in a column? (it's possible this will make more sense to me once I have a better understanding of what's going on here)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find that N is just a size of 'something', I can change them more broadly

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I also got tripped by this while reading the PR, maybe we can rename N to NR and K to NC (num_row and num_col atoms)?

Copy link
Collaborator

@maxentile maxentile left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great in my initial testing!

And can yield a big speedup in cases where we might want compute a batch of host-ligand neighborlists on stored snapshots (30x speed-up on the test system, from ~70 snapshots / second to ~2000 snapshots / second: https://gist.github.com/maxentile/fbaf654a7f21c05adce3b04145c43866 )

tests/test_nblist.py Outdated Show resolved Hide resolved
// assert(N == N_);
std::vector<std::vector<int>>
Neighborlist<RealType>::get_nblist_host(int N, const double *h_coords, const double *h_box, const double cutoff) {
// Call into the impl that takes colum and row coords
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo colum

@@ -8,23 +8,38 @@

namespace timemachine {

template <typename RealType> Neighborlist<RealType>::Neighborlist(int N) : N_(N) {
template <typename RealType>
Neighborlist<RealType>::Neighborlist(const int N, const int K) : N_(N), K_(K), compute_full_matrix_(K > 0) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we change compute_full_matrix_ into a function? i.e. if K_ == 0 compute_full_matrix returns True (as opposed to storing it as a separate variable that may get out of sync during a later refactor)

@proteneer
Copy link
Owner

proteneer commented Dec 10, 2021

Looks great in my initial testing!

And can yield a big speedup in cases where we might want compute a batch of host-ligand neighborlists on stored snapshots (30x speed-up on the test system, from ~70 snapshots / second to ~2000 snapshots / second: https://gist.github.com/maxentile/fbaf654a7f21c05adce3b04145c43866 )

This probably is also the bare minimum speed-up as well, as there just isn't enough work to saturate the GPU for something like this to proceed one ligand at a time. In fact, I think this code as-as can probably support the batch use case (without resorting to streams) where we can aggregate row atoms so they represent multiple conformations of the same ligand, and we emit a set of tiles from this neighborlist that can compute ligand-host energies for multiple-conformations all at once.

@proteneer
Copy link
Owner

proteneer commented Dec 10, 2021

@badisa Can we add a test case where we use the neighborlist to compute the U_HH interactions. Let NR be the number of row atoms, NC be the number of column atoms, and NT the total number of atoms in the complete system. Set NR to some number < NT, and NC to zero. I think right now we only test the case where NR < NT, NC > 0 (U_HL), and NR = NT, and NC = 0 (U original), but would be good to also test NR < NT, NC = 0 (U_HH)

@badisa
Copy link
Collaborator Author

badisa commented Dec 10, 2021

Can we add a test case where we use the neighborlist to compute the U_HH interactions?

@proteneer How is this different from the previous test that computes the ixns of a water box? You would just not include the ligand in the coordinates passed as the columns. https://github.com/proteneer/timemachine/blob/master/tests/test_nblist.py#L69

Copy link
Owner

@proteneer proteneer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally LGTM, save for some minor comments. Would like to see the test mentioned on using it for U_HH but probably isn't a blocker for merge.

}
if (D != 3) {
throw std::runtime_error("D != 3");
}

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a consistency check that if K==0 then col_ptr is null, and if K>0, then col_ptr is not null?

@proteneer
Copy link
Owner

Can we add a test case where we use the neighborlist to compute the U_HH interactions?

@proteneer How is this different from the previous test that computes the ixns of a water box? You would just not include the ligand in the coordinates passed as the columns. https://github.com/proteneer/timemachine/blob/master/tests/test_nblist.py#L69

Hm never mind I think I see how you'd probably use this in a nonbonded.cu later on! So we'd simply rely on the caller to prepare the sizes of the input arrays directly.

@proteneer
Copy link
Owner

generally LGTM, save for some minor comments. Would like to see the test mentioned on using it for U_HH but probably isn't a blocker for merge.

Never mind - test for U_HH isn't needed. This PR is great!

@maxentile
Copy link
Collaborator

This probably is also the bare minimum speed-up as well, as there just isn't enough work to saturate the GPU for something like this to proceed one ligand at a time. In fact, I think this code as-as can probably support the batch use case (without resorting to streams) where we can aggregate row atoms so they represent multiple conformations of the same ligand, and we emit a set of tiles from this neighborlist that can compute ligand-host energies for multiple-conformations all at once.

Agreed! The speed-up from later applying this batch optimization should be substantial.

Just to have a baseline, here are my timings for the Jax reference implementation (computing all host-ligand distances in double precision on the CPU using vmap(distance_on_pairs), for batch sizes 1 through 1000): https://gist.github.com/maxentile/9f8fdc330bf948e588fc8c90177d8e61 .

* Functions aren't called
* Important to compute nonbonded energies of host-ligand
* Thanks to @jfass for noticing that we could get rid of
  the python loop
@badisa badisa force-pushed the feature/neighborlist-on-host-ligand branch from f786505 to cd815ab Compare December 13, 2021 16:29
@badisa badisa merged commit 5f9d022 into master Dec 13, 2021
@badisa badisa deleted the feature/neighborlist-on-host-ligand branch December 13, 2021 17:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants