Skip to content

Latest commit

 

History

History
371 lines (320 loc) · 16.4 KB

README.md

File metadata and controls

371 lines (320 loc) · 16.4 KB

BEAST2 XML

This is a first, and very simplistic, cut at generating BEAST2 XML from Python (2.7, and 3.5 to 3.10 are all known to work).

BEAST2 is a complex program and so is its input XML. People normally generate the input XML using a GUI tool, BEAUti. BEAUti is also a complex tool, and the XML it generates can vary widely. Because BEAUti is a GUI tool it's not possible to use it to programmatically generate XML.

I wrote beast2-xml because I wanted a way to quickly and easily generate BEAST2 XML files, from the command line and also from my Python code.

There are a lot of things that could be added to this code! Contributions welcome.

The package provides

  • A command-line script (bin/beast2-xml.py) to generate BEAST2 XML files.
  • A simplistic Python class (in beast2xml/beast2.py) that may be helpful if you are writing Python that needs to generate BEAST2 XML files. (This Python class is of course used by the command line script.)

Installation

$ pip install beast2-xml

You can also get the source from PyPI or on Github.

Generate XML from the command line

You can use bin/beast2-xml.py to quickly generate BEAST2 XML. You must provide the sequences for the analysis (as FASTA or FASTQ), either on standard input or using the --fastaFile option.

Run beast2-xml.py --help to see currently supported options:

$ beast2-xml.py --help
usage: beast2-xml.py [-h] [--clock_model MODEL | --templateFile FILENAME]
                     [--chain_length LENGTH] [--age ID=N [ID=N ...]]
                     [--default_age N] [--date_unit UNIT]
                     [--date_direction DIRECTION]
                     [--log_file_basename BASE-FILENAME] [--trace_log_every N]
                     [--tree_log_every N] [--screen_log_every N] [--mimic_beauti]
                     [--sequence_id_date_regex REGEX]
                     [--sequence_id_age_regex REGEX]
                     [--sequenceIdRegexMayNotMatch] [--fastaFile FILENAME]
                     [--readClass CLASSNAME] [--fasta | --fastq | --fasta-ss]

Given FASTA on stdin (or in a file via the --fastaFile option), write an XML
BEAST2 input file on stdout.

optional arguments:
  -h, --help            show this help message and exit
  --clock_model MODEL    Specify the clock model. Possible values are 'random-
                        local', 'relaxed-exponential', 'relaxed-lognormal', or
                        'strict' (default: strict)
  --templateFile FILENAME
                        The XML template file to use. (default: None)
  --chain_length LENGTH  The MCMC chain length. (default: None)
  --age ID=N [ID=N ...]
                        The age of a sequence. The format is a sequence id, an
                        equals sign, then the age. For convenience, just the
                        first part of a full sequence id (i.e., up to the
                        first space) may be given. May be specified multiple
                        times. (default: None)
  --default_age N        The age to use for sequences that are not explicitly
                        given an age via --age. (default: 0.0)
  --date_unit UNIT       Specify the date unit. Possible values are 'day',
                        'month', or 'year'. (default: year)
  --date_direction DIRECTION
                        Specify whether dates are back in time from the
                        present or forward in time from some point in the
                        past. Possible values are 'forward' or 'backward'.
                        (default: backward)
  --log_file_basename BASE-FILENAME
                        The base filename to write logs to. A ".log" or
                        ".trees" suffix will be appended to this to make
                        complete log file names. (default: beast-output)
  --trace_log_every N     How often to write to the trace log file. (default:
                        2000)
  --tree_log_every N      How often to write to the tree log file. (default:
                        2000)
  --screen_log_every N    How often to write logging to the screen (i.e.,
                        terminal). (default: 2000)
  --mimic_beauti         If specified, add attributes to the <beast> tag that
                        mimic what BEAUti uses so that BEAUti will be able to
                        load the XML. (default: False)


  --sequence_id_date_regex REGEX
                        A regular expression that will be used to capture sequence
                        dates from their ids. The regular expression must have three
                        named capture regions ("year", "month", and "day"). Regular
                        expression matching is anchored to the start of the id string
                        (i.e., Python's re.match function is used, not the re.search
                        function), so you must explicitly match the id from its beginning.
                        For example, you might use
                        --sequence_id_date_regex '^.*_(?P<year>\d\d\d\d)-(?P<month>\d\d)-(?P<day>\d\d)'.
                        (default: None)
  --sequence_id_age_regex REGEX
                        A regular expression that will be used to capture sequence ages
                        from their ids. The regular expression must have a single capture
                        region. Regular expression matching is anchored to the start of
                        the id string (i.e., Python's re.match function is used, not the
                        re.search function), so you must explicitly match the id from its
                        beginning. For example, you might use --sequence_id_age_regex '^.*_(\d+)$'
                        to capture an age preceded by an underscore at the very end of the
                        sequence id. If --sequence_id_date_regex is also given, it
                        takes precedence when matching sequence ids. (default: None)
  --sequenceIdRegexMayNotMatch
                        If specified (and --sequence_id_date_regex or --sequence_id_age_regex is given)
                        it will not be considered an error if a sequence id does not
                        match the given regular expression. In that case, sequences will be assigned
                        an age of zero unless one is given via --age. (default: False)
  --fastaFile FILENAME  The name of the FASTA input file. Standard input will
                        be read if no file name is given.
  --readClass CLASSNAME
                        If specified, give the type of the reads in the input.
                        Possible choices: SSAARead, DNARead, TranslatedRead,
                        RNARead, SSAAReadWithX, AAReadORF, Read, AARead,
                        AAReadWithX. (default: DNARead)
  --fasta               If specified, input will be treated as FASTA. This is
                        the default. (default: False)
  --fastq               If specified, input will be treated as FASTQ.
                        (default: False)
  --fasta-ss            If specified, input will be treated as PDB FASTA
                        (i.e., regular FASTA with each sequence followed by
                        its structure). (default: False)

As mentioned, this is extremely simplistic. If you need to generate more complex XML, you can pass in a template file using --templateFile. Your template will need to have a high-level structure that's similar to those produced by BEAUti, otherwise the various command-line options for manipulating the template wont find what they need (you'll see an error message in this case).

If you don't pass a template file name, a default will be chosen based on the clock model (strict by default). The default templates all come from BEAUti, so if you generate a template yourself using BEAUti, you can almost certainly pass it to beast2-xml.py to use as a basis to create variants from.

Note that the generated XML contains just the first part of sequence ids in the given FASTA input. I'm not sure if this is a requirement, but it's what BEAUti does and so I have done the same.

Generate BEAST2 XML in Python

If you want to create BEAST2 XML from your own template xml in Python, you can use the BEAST2XML class defined in beast2xml/beast2.py, as shown below:

Note Many of these methods as of version 1.3.2 are not available in command line usage.

from beast2xml.beast2 import BEAST2XML
from dark.fasta import FastaReads
import pandas as pd

temp_xml_file = 'template_BEAST2.xml' # Path to your template xml.
temp_xml = BEAST2XML(template=temp_xml_file)

fasta_alignment = 'alignment.fasta' # Path to your fasta file.
temp_xml.add_sequences(fasta_alignment) # NOTE this replaces the sequences in the template BEAST2 xml.

metadata_file = 'metatdata.csv' # Path to your age information. 
temp_xml.add_ages(metadata_file, seperator=',', age_column="year_decimal")
# NOTE currently only year decmals/fractions are accepted (NOT dates). 

temp_xml.change_prior('origin', 'uniform', lower=0.5, start=1.5, upper=4, wild_card_ending=True)
# This will search for any prior for a parameter whose name starts with origin and then change the prior.

newick_tree = 'initial.newick'# Path to Newick file to be used as an initial tree.
temp_xml.add_initial_tree(newick_tree)

path_to_your_modified_xml = 'path_to_your_modified_xml.xml'
temp_xml.to_xml(path_to_your_modified_xml)

There are several options you can pass to the BEAST2XML constructor:

class BEAST2XML(object):
    """
    Create BEAST2 XML instance.

    Parameters
    ----------
    template: str, default=None
        A filename or an open file pointer to read the
        XML template from. If C{None}, a template based on C{clockModel}
        will be used.
    clock_model: str, default="strict"
        Clock model to be used. Possible values
        are 'random-local', 'relaxed-exponential', 'relaxed-lognormal',
        and 'strict.
    sequence_id_date_regex: str, default=None
        If not C{None}, gives a C{str} regular
        expression that will be used to capture sequence dates from their ids.
        See the explanation in ../bin/beast2-xml.py
    sequence_id_age_regex: str, default=None
        If not C{None}, gives a C{str} regular
        expression that will be used to capture sequence ages from their ids.
        See the explanation in ../bin/beast2-xml.py
    sequence_id_regex_must_match: bool, default=True
        If C{True} it will be considered an error
        if a sequence id does not match the regular expression given by
        C{sequenceIdDateRegex} or C{sequenceId_age_regex}.
    date_unit: str, default="year"
        A C{str}, either 'day', 'month', or 'year' indicating the
        date time unit.

    """

and options you can pass to its to_string or to_xml methods:

def to_string(self,
                  chain_length=None,
                  default_age=0.0,
                  date_direction=None,
                  log_file_basename=None,
                  trace_log_every=None,
                  tree_log_every=None,
                  screen_log_every=None,
                  store_state_every=None,
                  transform_func=None,
                  mimic_beauti=False):
        """ Generate str version of xml.etree.ElementTree for running on BEAST.

        Parameters
        ----------
        chain_length: int, default=None
            The length of the MCMC chain. If C{None}, the value in the template will
             be retained.
        default_age: float or int, default=0.0
            The age to use for sequences that have not
            explicitly been given (see C{add_age}, C{add_ages} C{add_sequence},
             C{add_sequences}).
        date_direction: str, default=None
            A C{str}, either 'backward', 'forward' or "date" indicating whether dates are
             back in time from the present or forward in time from some point in the
              past.
        log_file_basename: str, default=None
            The base filename to write logs to. A .log or .trees suffix will be appended
            to this to make the actual log file names.  If None, the log file names in
            the template will be retained
        trace_log_every: int, default=None
            Specifying how often to write to the trace log file. If None, the value in the
            template will be retained.
        tree_log_every: int, default=None
            Specifying how often to write to the tree log file. If None, the value in the
            template will be retained.
        screen_log_every: int, default=None
            Specifying how often to write to the terminal (screen) log. If None, the
            value in the template will be retained.
        store_state_every: int, default=None
            Specifying how often to write MCMC state file. If None, the
            value in the template will be retained.
        transform_func: callable, default=None
            A callable that will be passed the C{ElementTree} instance and which
            must return an C{ElementTree} instance.
        mimic_beauti: bool, default=False
            If True, add attributes to the <beast> tag in the way that BEAUti does, to
            allow BEAUti to load the XML we produce.

        Returns
        -------
        tree: str
            String representation of xml.etree.ElementTree for running on BEAST
        """
def to_xml(self,
               path,
               chain_length=None,
               default_age=0.0,
               date_direction=None,
               log_file_basename=None,
               trace_log_every=None,
               tree_log_every=None,
               screen_log_every=None,
               store_state_every=None,
               transform_func=None,
               mimic_beauti=False):
        """
        Generate xml.etree.ElementTree for running on BEAST and write to xml file.

        Parameters
        ----------
        path: str
            Path to write xml file to.
        chain_length : int, default=None
            The length of the MCMC chain. If C{None}, the value in the template will
             be retained.
        default_age: float or int, default=0.0
            The age to use for sequences that have not
            explicitly been given (see C{add_age}, C{add_ages} C{add_sequence},
             C{add_sequences}).
        date_direction: str, default=None
            A C{str}, either 'backward', 'forward' or "date" indicating whether dates are
            back in time from the present or forward in time from some point in the
            past.
        log_file_basename: str, default=None
            The base filename to write logs to. A .log or .trees suffix will be appended
            to this to make the actual log file names.  If None, the log file names in
            the template will be retained.
        trace_log_every: int, default=None
            Specifying how often to write to the trace log file. If None, the value in the
            template will be retained.
        tree_log_every: int, default=None
            Specifying how often to write to the tree log file. If None, the value in the
            template will be retained.
        screen_log_every: int, default=None
            Specifying how often to write to the terminal (screen) log. If None, the
            value in the template will be retained.
        store_state_every : int, default=None
            Specifying how often to write MCMC state file. If None, the
            value in the template will be retained.
        transform_func: callable, default=None
            A callable that will be passed the C{ElementTree} instance and which
            must return an C{ElementTree} instance.
        mimic_beauti: bool, default=False
            If True, add attributes to the <beast> tag in the way that BEAUti does, to
            allow BEAUti to load the XML we produce.

        Returns
        -------
        None

        """

An example of using the Python class can be found in the beast2-xml.py script. Small examples showing all functionality can be found in the tests in test/test_beast2.py.

Development

To run the tests:

$ make check

or if you have Twisted installed, you can use its trial test runner, via

$ make tcheck

You can also use

$ tox

to run tests for various versions of Python.