openforcefield · mattwthompson · Oct 17, 2024 · Oct 11, 2024 · Oct 11, 2024 · Oct 11, 2024
diff --git a/README.md b/README.md
@@ -1,63 +1,167 @@
-YAMMBS
-======
+# YAMMBS
 
 Yet Another Molecular Mechanics Benchmarking Suite (YAMMBS, pronounced like "yams") is a tool for
 benchmarking force fields.
 
-YAMMBS is currently developed for internal use at Open Force Field. It is not currently recommended for external use. No guarantees are made about the stability of the API or the accuracy of any results.
-
-# Getting started
+YAMMBS is currently developed for internal use at Open Force Field. It is not currently recommended for external use. No guarantees are made about the stability of the API or the accuracy of any results. Feed back and contributions are welcome [on GitHub](https://github.com/openforcefield/yammbs)
 
 ## Installation
 
 Use the file `./devtools/conda-envs/dev.yaml` and also install `yammbs` with something like `python -m pip install -e .`.
 
-## Data sources
+## Getting started
+
+See the file `run.py` for a start-to-finish example. Note that the pattern in the script
+
+```python
+from multiprocessing import freeze_support
+
+def main():
+    # Your code here
+
+if __name__ == "__main__":
+    freeze_support()
+    main()
+```
+
+must be used for Python's `multiprocessing` module to behave well.
+
+### Data sources
+
+It is assumed that the input molecules are stored in a `openff-qcsubmit` model like `OptimizationResultCollection` or YAMMBS's own input models.
 
-It is assumed that the input molecules are stored in a `openff-qcsubmit` model like `OptimizationResultCollection`.
+### Preparing an input dataset
 
-## Key API points
+YAMMBS relies on QCSumbit to provide datasets from QCArchive. See [their docs](https://docs.openforcefield.org/projects/qcsubmit/en/stable/), particularly the [dataset retrieval example](https://docs.openforcefield.org/projects/qcsubmit/en/stable/examples/retrieving-results.html), for more.
 
-See the file `run.py` for a start-to-finish example.
+Currently only optimization datasets (`OptimizationResultCollection` in QCSubmit) are supported.
 
-Load a molecule dataset into the used representation:
+First, retrieve a dataset from QCArchive:
 
 ```python
-from yammbs import MoleculeStore
+from qcportal import PortalClient
+
+from openff.qcsubmit.results import OptimizationResultCollection
 
-store = MoleculeStore.from_qcsubmit_collection(
-    collection=my_collection,
-    database_name="my_database.sqlite",
+
+client = PortalClient("https://api.qcarchive.molssi.org:443", cache_dir=".")
+
+season1_dataset = OptimizationResultCollection.from_server(
+    client=client,
+    datasets="OpenFF Industry Benchmark Season 1 v1.1",
+    spec_name="default",
 )
 ```
 
-Run MM optimizations of all molecules using a particular force field
+After retrieving it - and after applying filters to remove problematic records - you can dump it to disk to avoid pulling down all of the data from the server again.
 
 ```python
-store.optimize_mm(force_field="openff-2.1.0.offxml")
+with open("qcsubmit.json", "w") as f:
+    f.write(season1_dataset.json())
 ```
 
-Run DDE (or RMSD, TFD, etc.) analyses and save to results disk:
+Once a `OptimizationResultCollection` is in memory, either by pulling it down from QCArchive or loading it from disk, convert it to a "YAMMBS input" model using the API:
 
 ```python
-ddes = store.get_dde(force_field="openff-2.1.0.offxml")
-ddes.to_csv(f"{force_field}-dde.csv")
+from yammbs.inputs import QCArchiveDataset
+
+
+season1_dataset = OptimizationResultCollection.parse_raw(open("qcsubmit.json").read())
+
+dataset = QCArchiveDataset.from_qcsubmit_collection(season1_dataset)
+
+with open("input.json", "w") as f:
+    f.write(dataset.model_dump_json())
 ```
 
-Note that the pattern in the script
+This input model (`QCArchiveDataset`) stores a miniimum amount of information to use these QM geometries as reference structures. The dataset has fields for tagging the name and model version, but mostly stores a list of structures. Each QM-optimized structure is stored as a `QCArchiveMolecule` object which stores:
+
+* (Mapped) SMILES which can be used to regenerate the `openff.toolkit.Molecule` and similar objects
+* QM-optimized geometry
+* Final energy from QM optimization
+* An ID uniquely defining this structure within the datasets
+
+If running many benchmarks, we recommend using this file as a starting point.
+
+Note: This JSON file ("input.json") is from a different model that the JSON file written from QCSUbmit - they are not interchangeable.
+
+Note: Both QCSubmit and YAMMBS rely on Pydantic for model validation and serialization. Even though both use V2 in packaging, YAMMBS uses the V2 API and (as of October 2024) QCSubmit still uses the V1 API. Usage like above should work fine; only esoteric use cases (in particular, defining a new model that has both YAMMBS and QCSubmit models as fields) should be unsupported.
+
+### Run a benchmark
+
+With the input prepared, create a `MoleculeStore` object:
 
 ```python
-from multiprocessing import freeze_support
+from yammbs import MoleculeStore
 
-def main():
-    # Your code here
+store = MoleculeStore.from_qcarchive_dataset(dataset)
+```
 
-if __name__ == "__main__":
-    freeze_support()
-    main()
+This object is the focal point of running benchmarks; it stores the inputs (QM structures), runs minimizations with force field(s) of interest, stores the results (MM structures), and provides helper methods for use in analysis.
+
+Run MM optimizations of all molecules using a particular force field(s) using `optimize_mm`:
+
+```python
+store.optimize_mm(force_field="openff-2.1.0.offxml")
+
+# can also iterate over multiple force fields, and use more processors
+for force_field in [
+    "openff-1.0.0.offxml",
+    "openff-1.3.0.offxml",
+    "openff-2.0.0.offxml",
+    "openff-2.1.0.offxml",
+    "openff-2.2.1.offxml",
+    "gaff-2.11",
+    "de-force-1.0.1.offxml",
+]:
+    store.optimize_mm(force_field=force_field, n_processes=16)
 ```
 
-must be used for Python's `multiprocessing` module to behave well.
+This method short-circuits (i.e. does not run minimizations) if a force field's results are already stored. i.e. the Sage 2.1 optimizations in the loop will be skipped.
+
+There are "output" models that mirroring the input models, basically storing MM-minimized geometries without needing to re-load or re-optimize the QM geometries. This can again be saved out to disk as JSON:
+
+```python
+store.get_outputs().model_dump_json("output.json")
+```
+
+Summary metrics (including DDE, RMSD, TFD, and internal coordinate RMSDs) are available separately (in order to reduce file size when only summary statistics, and not whole molecular descriptions and geometries, are sought):
+
+```python
+store.get_metrics().model_dump_json("metrics.json")
+```
+
+The basic structure of the metrics is a hierarchical dictionary. It is keyed by force field tag (i.e. "openff-2.2.1") mapping on to a dict of per-molecule summary metrics. Each of these dicts are keyed by QCArchvie ID (the same ID used to distinguish structures in the input and output models) mapping onto a dict of string-float keys that store the actual metrics (i.e. the DDE, RMSD, etc. of this particular structure optimized with the force field used as its high-level key). Access to these data is similar in memory (on the Pydantic models) and on disk (in JSON). Visually:
+
+```json
+{
+    "metrics": {
+        "openff-1.0.0": {
+            "37016854": {
+                "dde": 0.5890449032115157,
+                "rmsd": 0.011969891530473157,
+                "tfd": 0.001592046369769131,
+                "icrmsd": {
+                    "Bond": 0.0033974261816308144,
+                    "Angle": 0.9483605366613115,
+                    "Dihedral": 1.353163675708829,
+                    "Improper": 0.2922040744956022,
+                },
+            },
+            "37016855": {"this molecule's metrics ..."},
+        },
+        "openff-2.0.0": {
+            "37016855": {"this force field's molecules ..."},
+        }
+    }
+}
+```
+
+This data can be transformed for plotting, summary statistics, etc. which compare the metrics of each force field (for this molecule dataset).
+
+## Custom analyses
+
+See `examples.ipynb` for some examples of interacting with benchmarking results and a starting point for custom analyses.
 
 ### License