Estimating Uncertainty in Machine Learning Models for Drug Discovery

Project Details

Title: Estimating Uncertainty in Machine Learning Models for Drug Discovery
Type: MSc dissertation
Author: George Batchkala, https://www.linkedin.com/in/george-batchkala/
Supervisor: Professor Garrett M. Morris, [email protected]
Institution: University of Oxford
Department: Department of Statistics, 24-29 St Giles', Oxford, OX1 3LB
Project's dates: June 1st, 2020 - September 14th, 2020
Data: MoleculeNet, Physical Chemistry Datasets (http://moleculenet.ai/datasets-1)
GitHub repository: https://github.com/GeorgeBatch/moleculenet

This repository contains all code, results, and plots I produced while completing my MSc dissertation. The pdf file with the full dissertation will be uploaded after it gets marked and I officially complete my degree.

Abstract

"My model says that I had just found an ultimate drug. Can I trust it?"

In this work, I explore ways of quantifying the confidence of machine learning models used in drug discovery. In order to do this, I start with exploring methods to predict physicochemical properties of drugs and drug-like molecules crucial to drug discovery. I first attempt to reproduce and improve upon a subset of results to do with a drug's solubility in water, taken from a popular benchmark set called "MoleculeNet". Using XGBoost, which in the era of Deep Neural Networks, is already classified as a "conventional" machine learning method, I show that I am able to achieve state-of-the-art results. After that, I explore Gaussian Processes and Infinitesimal Jackknife for Random Forests and their associated uncertainty estimates. Finally, I attempt to understand whether the confidence of a model's prediction can be used to answer a similar but more general question: "How do we know when to trust our models?" The answer depends on the model. We can trust Gaussian Processes when they are confident, but the confidence estimates from Random Forests do not give us any assurance.

Related work

This work is mostly based of four papers:

"MoleculeNet: A Benchmark for Molecular Machine Learning" by Wu et al.;
"Learning From the Ligand: Using Ligand-Based Features to Improve Binding Affinity Prediction" by Boyles et al.;
"The Photoswitch Dataset: A Molecular Machine Learning Benchmark for the Advancement of Synthetic Chemistry" by Thawani et al.; and
"Confidence Intervals for Random Forests: The Jackknife and the Infinitesimal Jackknife" by Wager et al..

Aims

In this dissertation I aim to achieve three primary goals:

Reproduce a subset of solubility-related prediction results from the MoleculeNet benchmarking paper;
Improve upon the reproduced results; and
Use uncertainty estimation methods with the best-performing models to get single prediction uncertainty estimates to evaluate and compare these methods.

Data

I used the MoleculeNet dataset which accompanies the MoleculeNet benchmarking paper, and in particular, I focused on the Physical Chemistry datasets: ESOL, FreeSolv, and Lipophilicity. The MoleculeNet datasets are widely used to validate machine learning models used to estimate a particular property directly from small molecules including drug-like compounds.

The Physical Chemistry datasets can be downloaded from MoleculeNet benchmark dataset collection.

Models

I use the following four models for the regression task of physicochemical property prediction:

Obtaining Confidence Intervals

I obtained per-prediction confidence intervals with:

Gaussian Processes (notes, chapter 7, section 7.2)
Bias-corrected Infinitesimal Jackknife estimate for Random Forests (paper)

Implementation

All the data preparation, experiments, and visualisations were done in Python.

To convert molecules from their SMILES string representations to either Molecular Descriptors or Extended-Connectivity Fingerprints, I used the open-source cheminformatics software, RDKit (GitHub).

Wu et al. suggest to use their Python library, DeepChem (GitHub), to reproduce the results. We decided not to use it, since the user API only gives high-level access to the user, while I wanted to have more control of the implementation. To have comparable results, I decided to use the tools which the DeepChem library is built on.

For most of the machine learning pipeline, I used Scikit-Learn (GitHub) for preprocessing, splitting, modelling, prediction, and validation. To obtain the confidence intervals for Random Forests, I used the forestci (GitHub) extension for Scikit-Learn. The implementation of a custom Tanimoto (Jaccard) kernel for Gaussian Process Regression and all the following GP experiments were performed with GPflow (GitHub).

Set-up

In this section I outline the set-up steps required to start reproducing my results. It covers the following stages:

Directory set-up;
Creating an environment with conda;
Data preparation; and
Creation of features.

Environment

In the root (moleculenet) directory create a project environment from the environment.yml file using:

>>> conda env create -f environment.yml

Environment's name is batch-msc, and we activate it using:

>>> conda activate batch-msc

Conda environments make managing Python library dependences and reproducing research much easier. Another reason why we use conda us that some packages, e.g. RDKit: Open-Source Cheminformatics Software, are not available via pip install.

Data preparation

This section covers two data preparation stages: standardising input files and producing the features.

Standardise Names

To automate the process of working with three different datasets (ESOL, FreeSolv, and Lipiphilicity) we standardise the column names from the original CSV files and store the results in the new CSV files.

We need to get hold of ID/Name, SMILES string representation, and measured label value for each of the compounds in all of the three datasets. To do this, run the following commands in the ~/scripts/ directory:

>>> python get_original_id_smiles_labels_lipophilicity.py
>>> python get_original_id_smiles_labels_esol.py
>>> python get_original_id_smiles_labels_freesolv.py

The resulting files are saved in the ~/data/ directory:

esol_original_IdSmilesLabels.csv, esol_original_extra_features.csv
freesolv_original_IdSmilesLabels.csv
lipophilicity_original_IdSmilesLabels.csv

Note: the original file for the ESOL dataset also contained extra features which we also save here.

Compute and Store Features

We show how to produce the features and store them in CSV files.

From the SMILES string representations of the molecules for all three datasets compute Extended-Connectivity Fingerprints and RDKit Molecular Descriptors to use them as features. We do it at the very beginning and never worry about it in the future.

Note, we produce four different versions of extended-connectivity fingerprints:

ECFP_4 hashed with 1024 bits
ECFP_6 hashed with 1024 bits
ECFP_4 hashed with 2048 bits
ECFP_6 hashed with 2048 bits

To compute and record the features run the corresponding commands in the ~/scripts/ directory:

ECFP features

>>> python get_all_fingerprints_for_all_datasets.py

RDKit features

>>> python get_rdkit_descriptors_for_all_datasets.py

Name		Name	Last commit message	Last commit date
Latest commit History 167 Commits
ci_comparison_tables		ci_comparison_tables
data		data
figures		figures
notebooks		notebooks
notebooks_on_hold		notebooks_on_hold
notebooks_past		notebooks_past
results		results
results_backup		results_backup
scripts		scripts
tables		tables
util_scripts		util_scripts
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Estimating Uncertainty in Machine Learning Models for Drug Discovery

Project Details

Abstract

Related work

Aims

Data

Models

Obtaining Confidence Intervals

Implementation

Set-up

Directory

Git clone

Manual directory set-up

Environment

Data preparation

Standardise Names

Compute and Store Features

ECFP features

RDKit features

About

Releases

Packages

Languages

GeorgeBatch/moleculenet

Folders and files

Latest commit

History

Repository files navigation

Estimating Uncertainty in Machine Learning Models for Drug Discovery

Project Details

Abstract

Related work

Aims

Data

Models

Obtaining Confidence Intervals

Implementation

Set-up

Directory

Git clone

Manual directory set-up

Environment

Data preparation

Standardise Names

Compute and Store Features

ECFP features

RDKit features

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages