In this assignment you are going to develop and implement an algorithm that takes as an input the genotypes of individuals coming from admixed populations and infers the haplotypes of the individuals. Genotyping technologies provide us with strings over {0, 1, 2}, representing the number of copies of the reference allele at each SNP. However, these technologies do not allow us to measure complete haplotypes (i.e. what is exactly the sequence of alleles in each of the two chromosomes of an individual). The problem gets further complicated when phasing haplotypes of individuals coming from recently admixed populations. Recall that the creation of an admixed population is a process that initially starts with two or more homogeneous populations (i.e. each individual is strictly coming from one particular population). Then, at each new generation, following a process of random mating, each individual gets two haplotypes, one per parent, from the previous populations (the two haplotypes may come from parents who are part of different populations). Importantly, each of the two haplotypes is going through some recombinations with an unknown recombination rate r. In our case, we consider admixed individuals from two recently admixed populations (i.e. not many generations have passed since the homogeneous populations), under the assumption of random mating. Develop and implement an algorithm for phasing haplotypes of individuals coming from recently admixed populations. You are free to use any of the methodologies that we have seen in class or that you can find online, and you can use existing software packages as long as they were not designed for haplotype phasing.
This assignment is accompanied by the following files:
- EXAMPLE GENOTYPES: “example data 1.txt”, “example data 2.txt”, “example data 3.txt” - three example datasets to develop and test your method on. Each file contains the genotypes for a set of individuals. Format of files: each row in the file is a genomic position (SNP), each column in the file is an individual. Values are separated by spaces.
- EXAMPLE HAPLOTYPES: “example data 1 sol.txt”, “example data 2 sol.txt”, “example data 3 sol.txt” - each file contains the true haplotypes for ones of the three sets of example genotypes. Format of files: each row in the file is a genomic position (SNP), each pair of consecutive columns in the file is an individual (i.e. for an individual’s genotype column i, the corresponding haplotypes are located at column 2i and 2i + 1). Values are separated by spaces.
- TEST GENOTYPES: “test data 1.txt”, “test data 2.txt” - each file contains the genotypes for a set of individuals. Once you have developed your method, run your method on these test genotypes and output your estimated haplotypes. These are the haplotypes you will submit for grading. Format of files: each row in the file is a genomic position (SNP), each column in the file is an individual. Values are separated by spaces.
- GENOTYPES POSITIONS: “example data 1 geno positions.txt“, “example data 2 geno positions.txt“, “example data 3 geno positions.txt“, “test data 1 geno positions.txt“, and “test data 2 geno positions.txt“ - for each of the example and test dataset, you get the physical positions of the genotypes on the chro- mosome. This additional information can be incorporated into your algorithm (however, you are not required to use it).
In your final submission you are required to submit your estimated haplotypes for all the individuals in the files “test data 1.txt” and “test data 2.txt”. Specifically, submit two files “test data 1 sol.txt” and “test data 2 sol.txt”, each with the inferred haplotypes of the genotypes in the corresponding test file using the same format as the example haplotype files (i.e. “example data 1 sol.txt”, see above).
Importantly, the example datasets are merely provided to you as examples of possible input/output. Note that the input of these datasets do not necessarily represent the input of the test data.