Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update getting started guide, example script #68

Merged
merged 7 commits into from
Oct 17, 2024
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
158 changes: 131 additions & 27 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,63 +1,167 @@
YAMMBS
======
# YAMMBS

Yet Another Molecular Mechanics Benchmarking Suite (YAMMBS, pronounced like "yams") is a tool for
benchmarking force fields.

YAMMBS is currently developed for internal use at Open Force Field. It is not currently recommended for external use. No guarantees are made about the stability of the API or the accuracy of any results.

# Getting started
YAMMBS is currently developed for internal use at Open Force Field. It is not currently recommended for external use. No guarantees are made about the stability of the API or the accuracy of any results. Feed back and contributions are welcome [on GitHub](https://github.com/openforcefield/yammbs)

mattwthompson marked this conversation as resolved.
Show resolved Hide resolved
## Installation

Use the file `./devtools/conda-envs/dev.yaml` and also install `yammbs` with something like `python -m pip install -e .`.

## Data sources
## Getting started

See the file `run.py` for a start-to-finish example. Note that the pattern in the script

mattwthompson marked this conversation as resolved.
Show resolved Hide resolved
```python
from multiprocessing import freeze_support

def main():
# Your code here

if __name__ == "__main__":
freeze_support()
main()
```

must be used for Python's `multiprocessing` module to behave well.

mattwthompson marked this conversation as resolved.
Show resolved Hide resolved
### Data sources

It is assumed that the input molecules are stored in a `openff-qcsubmit` model like `OptimizationResultCollection` or YAMMBS's own input models.

mattwthompson marked this conversation as resolved.
Show resolved Hide resolved
It is assumed that the input molecules are stored in a `openff-qcsubmit` model like `OptimizationResultCollection`.
### Preparing an input dataset

## Key API points
YAMMBS relies on QCSumbit to provide datasets from QCArchive. See [their docs](https://docs.openforcefield.org/projects/qcsubmit/en/stable/), particularly the [dataset retrieval example](https://docs.openforcefield.org/projects/qcsubmit/en/stable/examples/retrieving-results.html), for more.

mattwthompson marked this conversation as resolved.
Show resolved Hide resolved
See the file `run.py` for a start-to-finish example.
Currently only optimization datasets (`OptimizationResultCollection` in QCSubmit) are supported.

Load a molecule dataset into the used representation:
First, retrieve a dataset from QCArchive:

```python
from yammbs import MoleculeStore
from qcportal import PortalClient

from openff.qcsubmit.results import OptimizationResultCollection

store = MoleculeStore.from_qcsubmit_collection(
collection=my_collection,
database_name="my_database.sqlite",

client = PortalClient("https://api.qcarchive.molssi.org:443", cache_dir=".")

season1_dataset = OptimizationResultCollection.from_server(
client=client,
datasets="OpenFF Industry Benchmark Season 1 v1.1",
spec_name="default",
)
```

Run MM optimizations of all molecules using a particular force field
After retrieving it - and after applying filters to remove problematic records - you can dump it to disk to avoid pulling down all of the data from the server again.

```python
store.optimize_mm(force_field="openff-2.1.0.offxml")
with open("qcsubmit.json", "w") as f:
f.write(season1_dataset.json())
```

Run DDE (or RMSD, TFD, etc.) analyses and save to results disk:
Once a `OptimizationResultCollection` is in memory, either by pulling it down from QCArchive or loading it from disk, convert it to a "YAMMBS input" model using the API:

mattwthompson marked this conversation as resolved.
Show resolved Hide resolved
```python
ddes = store.get_dde(force_field="openff-2.1.0.offxml")
ddes.to_csv(f"{force_field}-dde.csv")
from yammbs.inputs import QCArchiveDataset


season1_dataset = OptimizationResultCollection.parse_raw(open("qcsubmit.json").read())

mattwthompson marked this conversation as resolved.
Show resolved Hide resolved
dataset = QCArchiveDataset.from_qcsubmit_collection(season1_dataset)

with open("input.json", "w") as f:
f.write(dataset.model_dump_json())
```

Note that the pattern in the script
This input model (`QCArchiveDataset`) stores a miniimum amount of information to use these QM geometries as reference structures. The dataset has fields for tagging the name and model version, but mostly stores a list of structures. Each QM-optimized structure is stored as a `QCArchiveMolecule` object which stores:

mattwthompson marked this conversation as resolved.
Show resolved Hide resolved
* (Mapped) SMILES which can be used to regenerate the `openff.toolkit.Molecule` and similar objects
* QM-optimized geometry
* Final energy from QM optimization
* An ID uniquely defining this structure within the datasets

If running many benchmarks, we recommend using this file as a starting point.

Note: This JSON file ("input.json") is from a different model that the JSON file written from QCSUbmit - they are not interchangeable.

mattwthompson marked this conversation as resolved.
Show resolved Hide resolved
Note: Both QCSubmit and YAMMBS rely on Pydantic for model validation and serialization. Even though both use V2 in packaging, YAMMBS uses the V2 API and (as of October 2024) QCSubmit still uses the V1 API. Usage like above should work fine; only esoteric use cases (in particular, defining a new model that has both YAMMBS and QCSubmit models as fields) should be unsupported.

### Run a benchmark

With the input prepared, create a `MoleculeStore` object:

```python
from multiprocessing import freeze_support
from yammbs import MoleculeStore

def main():
# Your code here
store = MoleculeStore.from_qcarchive_dataset(dataset)
```

if __name__ == "__main__":
freeze_support()
main()
This object is the focal point of running benchmarks; it stores the inputs (QM structures), runs minimizations with force field(s) of interest, stores the results (MM structures), and provides helper methods for use in analysis.

Run MM optimizations of all molecules using a particular force field(s) using `optimize_mm`:

```python
store.optimize_mm(force_field="openff-2.1.0.offxml")

# can also iterate over multiple force fields, and use more processors
for force_field in [
"openff-1.0.0.offxml",
"openff-1.3.0.offxml",
"openff-2.0.0.offxml",
"openff-2.1.0.offxml",
"openff-2.2.1.offxml",
"gaff-2.11",
"de-force-1.0.1.offxml",
]:
store.optimize_mm(force_field=force_field, n_processes=16)
```

must be used for Python's `multiprocessing` module to behave well.
This method short-circuits (i.e. does not run minimizations) if a force field's results are already stored. i.e. the Sage 2.1 optimizations in the loop will be skipped.

There are "output" models that mirroring the input models, basically storing MM-minimized geometries without needing to re-load or re-optimize the QM geometries. This can again be saved out to disk as JSON:

mattwthompson marked this conversation as resolved.
Show resolved Hide resolved
```python
store.get_outputs().model_dump_json("output.json")
```

Summary metrics (including DDE, RMSD, TFD, and internal coordinate RMSDs) are available separately (in order to reduce file size when only summary statistics, and not whole molecular descriptions and geometries, are sought):

```python
store.get_metrics().model_dump_json("metrics.json")
```

The basic structure of the metrics is a hierarchical dictionary. It is keyed by force field tag (i.e. "openff-2.2.1") mapping on to a dict of per-molecule summary metrics. Each of these dicts are keyed by QCArchvie ID (the same ID used to distinguish structures in the input and output models) mapping onto a dict of string-float keys that store the actual metrics (i.e. the DDE, RMSD, etc. of this particular structure optimized with the force field used as its high-level key). Access to these data is similar in memory (on the Pydantic models) and on disk (in JSON). Visually:

```json
{
"metrics": {
"openff-1.0.0": {
"37016854": {
"dde": 0.5890449032115157,
"rmsd": 0.011969891530473157,
"tfd": 0.001592046369769131,
"icrmsd": {
"Bond": 0.0033974261816308144,
"Angle": 0.9483605366613115,
"Dihedral": 1.353163675708829,
"Improper": 0.2922040744956022,
},
},
"37016855": {"this molecule's metrics ..."},
},
"openff-2.0.0": {
"37016855": {"this force field's molecules ..."},
}
}
}
```

This data can be transformed for plotting, summary statistics, etc. which compare the metrics of each force field (for this molecule dataset).

## Custom analyses

See `examples.ipynb` for some examples of interacting with benchmarking results and a starting point for custom analyses.

mattwthompson marked this conversation as resolved.
Show resolved Hide resolved
### License

Expand Down
Loading
Loading