Skip to content
/ haplm Public

Bayesian inference for pooled haplotype data using latent multinomials

License

Notifications You must be signed in to change notification settings

ysfoo/haplm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

72 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HapLM: Bayesian inference for pooled haplotype data using latent multinomials

This module implements MCMC methods for latent multinomial distributions, see [1] for details. The key application is for pooled genetic data, where counts of incomplete haplotypes are reported for separate pools. Each incomplete haplotype corresponds to a subset of full haplotypes, and the number of full haplotypes per pool are multinomially distributed. The observation of partial sums of these multinomial counts leads to a latent multinomial model. There are 3 proposed methods, all of which can be used either under a setting where the multinomial frequencies are shared across all pools, or under a hierarchical setting where the multinomial frequencies are given a hierarchical prior. According to the method names given in [1], the proposed methods are:

  • MCMC-Exact. An exact method that performs MCMC inference by enumerating all possible latent counts that are compatible with observed data. This is computationally intensive, but is considered the gold standard in terms of accuracy. The MCMC sampler is the No-U-Turn-Sampler (NUTS).
  • MCMC-Approx. An approximate method that performs MCMC inference via a multinormal approximation, also using NUTS. In the filenames, this method is named as mn_approx. This is similar to the HIPPO [2] method, with the key difference being that this method does not sample the list of input haplotypes, and uses NUTS as the MCMC sampler.
  • LC-Sampling. An exact method that performs MCMC inference by sampling the latent counts as model parameters under a Metropolis-within-Gibbs scheme. In the filenames, this method is named as cgibbs (Collapsed Gibbs) when a conjugate prior is used for the non-hierarchical case. In the hierarchical case, there is no conjugacy and the method is named gibbs, and NUTS is used for sampling the continuous model parameters.

The proposed methods are implemented under haplm/. Two existing methods, HIPPO [2] and AEML [3], are presented with modifictions under hippo_aeml/; see hippo_aeml/README.md for details. The existing methods are written only for the setting where multinomial frequencies are shared, and the incomplete haplotypes are restricted to be information on each marker separately.

Comparisons between the proposed and existing methods, and further examples for the hierarchical case are provided under test/:

  • sim_study Compare all methods with synthetic datasets over 3 markers (8 possible haplotypes). The pool size is varied across datasets to demonstrate how each method scales with pool size.
  • encode Compare all methods except MCMC-Exact with 100 datasets simulated based on human data from the 1000 Genomes Project. Each datasets covers 8 markers (256 possible haplotypes). Partition ligation is used to determine input haplotypes where appropriate. MCMC-Exact is excluded due to computational reasons.
  • time-series Demonstrate a hierarchical use-case for the proposed methods. The synthetic dataset used consists of 30 time points of allele counts over 3 markers (8 haplotypes), where the haplotype frequencies vary over time. The haplotype frequencies are given a Gaussian process (GP) prior.
  • dhps Application of MCMC-Exact to a real dataset of genetic data relevant to antimalarial drug resistance, collated by the Worldwide Antimalarial Resistance Network. 241 data points are used over 3 markers of the Pfdhps gene, where the incomplete haplotypes reported are not the same across data points. See [4] for details.

There is also a comparison of an exact and an approximate (multinormal) likelihood for a toy example of the latent multinomial distribution under mn_acc, which demonstrates cases where the multinormal approximation breaks down.

This project is licensed under the terms of the GNU General Public License v3.0.

Setup

The preferred method of installation is Mambaforge, a more reliable clone of Conda. See here for a user guide to Mamba.

Create a mamba environment and with the required dependencies:

mamba create -n [YOUR_ENV_NAME] 4ti2 atpbar numpyro pulp "pymc>=5" "sympy>=1.12"
mamba activate [YOUR_ENV_NAME]

To install the core funtionality as a module, clone the repository and run the following command from the root directory of this repository:

pip install .

To run the HIPPO and AEML C programs, you will need to install GCC and GSL, and run the following commands from the hippo_aeml/ directory:

gcc hippo.c -o hippo -lgsl -lgslcblas -lm
gcc AEML.c -o AEML -lgsl -lgslcblas -lm

Further dependencies are required to run the spatiotemporal example under test/dhps/:

pip install pycountry_convert
mamba install tqdm cartopy pandas xlrd

Tips

Depending on your computational resources, you may want to adjust to number of MCMC chains and CPU cores used. In particular, there is an overhead for multithreaded operations used by PyMC, and some people recommend setting the environment variables for the number of threads used by libraries such as MKL and OpenMP to 1. This can be done by editing the terminal configuration file, e.g. ~/.bashrc, or by adding the lines

import os
os.environ["MKL_NUM_THREADS"] = "1"
os.environ["OMP_NUM_THREADS"] = "1"

at the very top of the Python script being run.

Some methods run commands from 4ti2, which requires a directory for 4ti2 files to be stored. It is recommended to a create a directory 4ti2-files/ from the root directory of this repository.

More efficient integer programming solvers can speed up the pre-processing phase of the proposed methods. This can be configured by specifying a pulp solver when creating LatentMult objects.

Sampling with NUTS may sometimes take a long time (>5 minutes) to compile, which tends to occur for the multinormal approximation. If a long compilation time is observed, consider setting the jaxify argument (e.g. in haplm.lm_inference.latent_mult_mcmc) to True. This should speed up compilation at the cost of slower sampling.

References

[1] Foo, Y. S., Flegg, J. A. (2023). Haplotype inference from pooled genetic data with a latent multinomial model. arXiv preprint arXiv:2308.16465. https://doi.org/10.48550/arXiv.2308.16465

[2] Pirinen, M. (2009). Estimating population haplotype frequencies from pooled SNP data using incomplete database information. Bioinformatics, 25(24), 3296–3302. https://doi.org/10.1093/bioinformatics/btp584

[3] Kuk, A. Y. C., Zhang, H., & Yang, Y. (2009). Computationally feasible estimation of haplotype frequencies from pooled DNA with and without Hardy-Weinberg equilibrium. Bioinformatics, 25(3), 379–386. https://doi.org/10.1093/bioinformatics/btn623

[4] Foo, Y. S., Flegg, J. A. (2023). A spatiotemporal model of multi-marker antimalarial resistance. medRxiv preprint medRxiv:2023.10.03.23296508. https://doi.org/10.1101/2023.10.03.23296508

About

Bayesian inference for pooled haplotype data using latent multinomials

Resources

License

Stars

Watchers

Forks

Packages

No packages published