-
Notifications
You must be signed in to change notification settings - Fork 195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More complete rGFA support (experimental) #4113
base: master
Are you sure you want to change the base?
Conversation
Just wanted to say I've been experimenting with an end user hack to do something similar*, and I've seen the largest impact in accuracy from variants called around insertions. Reads which straddle a reference and insertion contig will align to one or the other. As a result, depth around insertions flattens out, or even dips a little. Here's an example in HG002 around chr10:133111600, where vg haplotypes identified two overlapping 900bp insertions (one maternal, one paternal). Surjecting reads to both insertions and GRCh38 led to reduced depth in the region, causing deepvariant to miss some GIAB annotated variants. However, the reads which surjected to the reference sequence now had their non-reference sequence softclipped. This led to fewer false positive calls around insertions due to misaligned reads. PS: This is a really good test region. Both the insertions and the surrounding reference sequence are complex enough to allow for uniquely mapping reads, and the accuracy of variant calling is greatly improved when reads from either insertion are not mismapped to the linear reference sequence. It is also in an intron of a CNS expressed gene that has been linked to appetite and overeating, suggesting potential clinical relevance. I'm very excited to eventually see this code on the main branch!
Then: |
Hi @glennhickey, I wonder if you are still planning to implement this feature? -Joe |
Yes! But I've been completely bogged down with other things. The only good news is that I think I'm on the hook to present about it in a few weeks so I won't have much choice but to get moving. I still have your examples to reproduce problems that I will check out. Sorry for the long wait. |
Not a problem @glennhickey, I'm so grateful for the work you've put in. I just wanted to double-check, as I'm a few months out from putting together a manuscript comparing the effect of vg-deepvariant with bwamem-haplotypecaller and bwamem-deepvariant, and I think this would be a good addition to the paper. I'm in the middle of a too-many-projects-too-little-time period myself, so I am very, very familiar with the feeling. |
This reverts commit 89af9b6.
Changelog Entry
To be copied to the draft changelog by merger:
vg paths -R
option added to compute an rGFA cover (based on a reference sample) from a (mutable) graph, and add it as path fragments.Description
This is a fixup and refactor of #3891. It uses the same greedy (on steps) method to select snarl traversals to add to the cover. The cover is now stored in a (really simple) in-memory index that maps nodes to fragments that cover them, and these fragments can be walked back to the rank-0 reference.
rGFA covers can be computed from full paths, loaded from path fragments, and saved to path fragments. They are always defined relative to a rank-0 / reference sample which stays as a normal reference path in the graph.
Off-reference rGFA path fragments are REFERENCE-sense paths with a special sample name (
_rGFA_
) which allows them to be easily selected. Would like to find something less hacky, but this should work for experiments now.Corresponding Cactus update is: ComparativeGenomicsToolkit/cactus#1186