Skip to content

Detect Creation/Destruction of Sequence Motifs as a Result of Mutations

License

Notifications You must be signed in to change notification settings

LPCDRP/motif-variants

Repository files navigation

install with bioconda

NAME

mvp - detect creation/destruction of sequence motifs as a result of mutations

DESCRIPTION

Sequence variation may cause the appearance or disappearance of certain motifs. Since motifs can be recognition sites for biological functions such as regulation or DNA modification, their gain and loss can have additional consequences.

Using a list of variants in variant call format, the corresponding reference sequence, and a set of motifs to search for, mvp (motif-variant probe) identifies variants responsible for changing the number of occurrences of these motifs in the sequence. mvp can process both nucleotide and amino acid sequences. For the latter, the variant call format is still used to represent the amino acid changes. Motifs must be input using IUPAC ambiguity codes, simple regular expressions, or a combination of the two.

EXAMPLES

See the help menu for usage information:

$ mvp --help
usage: mvp [-h] [-o OUTFILE] -r REFERENCE (-f MOTIF_FILE | -m MOTIF_LIST)
           [-t {dna,aa}]
           infile

Motif-Variant Probe: detect motif gain and loss due to mutations

positional arguments:
  infile                vcf or vcf.gz file containing mutations (default:
                        stdin)

optional arguments:
  -h, --help            show this help message and exit
  -o OUTFILE, --outfile OUTFILE
                        results table (default: stdout)
  -r REFERENCE, --reference REFERENCE
                        reference sequence in fasta format
  -f MOTIF_FILE, --motif-file MOTIF_FILE
                        file containing a list of motifs to check
  -m MOTIF_LIST, --motif-list MOTIF_LIST
                        a comma-delimited string of motifs to check
  -t {dna,aa}, --sequence-type {dna,aa}
                        DNA or amino acid (default: dna)

Process the example genomic data:

$ mvp -r reference.fasta -m gagtc,agcta,aagctc example.vcf  | column -t
motif   strand  position  reference  variant
GAGTC   +       85        1          0
AAGCTC  +       1243      1          0
AGCTA   +       905       0          1
AGCTA   -       905       0          1

If the strand is negative, the information corresponds to the partner motif. position is that of the first VCF record in the set of variants responsible for the effect. Multiple variants can be responsible for an effect if they are close enough together with respect to the length of the motif. The number under reference is the number of occurrences of the motif in the reference segment, and likewise for the variant segment. Thus, for the example above, we see that two of our motifs (GAGTC and AAGTC) had instances destroyed by mutations, while both AGCTA and its partner motif (TAGCT) were instantiated by variants around position 905 with respect to the reference sequence.

It is also possible to specify motifs using IUPAC ambiguity codes, simple regular expressions, or a combination of the two. This works for both DNA and amino acid sequences.

Running the same example as above, but making the same motifs a little more ambiguous:

$ mvp -r reference.fasta -m grbts,[ac]gcta,a[ac]dctc example.vcf  | column -t
motif            strand  position  reference  variant
G[AG][CGT]T[GC]  +       85        1          0
G[AG][CGT]T[GC]  -       85        1          1
A[AC][AGT]CTC    +       1243      1          0
[AC]GCTA         +       905       0          1
[AC]GCTA         -       905       0          1

We provided one motif using only the IUPAC codes, the second using a simple regular expression ([ac] meaning either A or C, which would correspond to M in the IUPAC code), and the third using a mix of the two. The results are returned to us labeling all the motifs as simple regular expressions for consistency.

You can see now that we have now picked up more results due to our relaxing the motif specifications. In particular, the second line showing a single occurrence of the partner motif for G[AG][CGT]T[GC] in both the reference and the variant can have multiple meanings. Either it is the same motif unaffected by the variant's perturbation, or it may have shifted a bit along the sequence.

CITATION

Elghraoui, Afif, and Faramarz Valafar. MVP: Detection of Motif-Making and -Breaking Mutations. arXiv, October 15, 2022. doi:10.48550/arXiv.2210.09842

About

Detect Creation/Destruction of Sequence Motifs as a Result of Mutations

Resources

License

Stars

Watchers

Forks

Packages

No packages published