Haplotype phasing in recently admixed populations

In this assignment you are going to develop and implement an algorithm that takes as an input the genotypes of individuals coming from admixed populations and infers the haplotypes of the individuals. Genotyping technologies provide us with strings over {0, 1, 2}, representing the number of copies of the reference allele at each SNP. However, these technologies do not allow us to measure complete haplotypes (i.e. what is exactly the sequence of alleles in each of the two chromosomes of an individual). The problem gets further complicated when phasing haplotypes of individuals coming from recently admixed populations. Recall that the creation of an admixed population is a process that initially starts with two or more homogeneous populations (i.e. each individual is strictly coming from one particular population). Then, at each new generation, following a process of random mating, each individual gets two haplotypes, one per parent, from the previous populations (the two haplotypes may come from parents who are part of different populations). Importantly, each of the two haplotypes is going through some recombinations with an unknown recombination rate r. In our case, we consider admixed individuals from two recently admixed populations (i.e. not many generations have passed since the homogeneous populations), under the assumption of random mating. Develop and implement an algorithm for phasing haplotypes of individuals coming from recently admixed populations. You are free to use any of the methodologies that we have seen in class or that you can find online, and you can use existing software packages as long as they were not designed for haplotype phasing.

Assignment files and compiling your solutions

This assignment is accompanied by the following files:

EXAMPLE GENOTYPES: “example data 1.txt”, “example data 2.txt”, “example data 3.txt” - three example datasets to develop and test your method on. Each file contains the genotypes for a set of individuals. Format of files: each row in the file is a genomic position (SNP), each column in the file is an individual. Values are separated by spaces.
EXAMPLE HAPLOTYPES: “example data 1 sol.txt”, “example data 2 sol.txt”, “example data 3 sol.txt” - each file contains the true haplotypes for ones of the three sets of example genotypes. Format of files: each row in the file is a genomic position (SNP), each pair of consecutive columns in the file is an individual (i.e. for an individual’s genotype column i, the corresponding haplotypes are located at column 2i and 2i + 1). Values are separated by spaces.
TEST GENOTYPES: “test data 1.txt”, “test data 2.txt” - each file contains the genotypes for a set of individuals. Once you have developed your method, run your method on these test genotypes and output your estimated haplotypes. These are the haplotypes you will submit for grading. Format of files: each row in the file is a genomic position (SNP), each column in the file is an individual. Values are separated by spaces.
GENOTYPES POSITIONS: “example data 1 geno positions.txt“, “example data 2 geno positions.txt“, “example data 3 geno positions.txt“, “test data 1 geno positions.txt“, and “test data 2 geno positions.txt“ - for each of the example and test dataset, you get the physical positions of the genotypes on the chro- mosome. This additional information can be incorporated into your algorithm (however, you are not required to use it).

In your final submission you are required to submit your estimated haplotypes for all the individuals in the files “test data 1.txt” and “test data 2.txt”. Specifically, submit two files “test data 1 sol.txt” and “test data 2 sol.txt”, each with the inferred haplotypes of the genotypes in the corresponding test file using the same format as the example haplotype files (i.e. “example data 1 sol.txt”, see above).

Importantly, the example datasets are merely provided to you as examples of possible input/output. Note that the input of these datasets do not necessarily represent the input of the test data.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
Assignment Details		Assignment Details
README.md		README.md
haplotypePhaser.py		haplotypePhaser.py
report.pdf		report.pdf
test_data_1_sol.txt		test_data_1_sol.txt
test_data_2_sol.txt		test_data_2_sol.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Haplotype phasing in recently admixed populations

Assignment files and compiling your solutions

About

Releases

Packages

Languages

helengracehuang/Computational-Genetics

Folders and files

Latest commit

History

Repository files navigation

Haplotype phasing in recently admixed populations

Assignment files and compiling your solutions

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages