Methods

source: Genomics/phylogenetics/methods.md

There are four popular molecular phylogenetic methods.

The following is a summary of information from a number of resources, workshops and other bits I have come across during my data analysis learning. As with the rest of this webpage it is certainly not extensive and written in a context relevent to the work I do (i.e. parasite identification and a dabble in taxonomy and systematics).

Australian National Postgraduate Training Workshop in Systematics support by the Society of Australian Systematic Biologists, The Australian Biological Resources Study and Australian Centre for Evolutionary Biology and Biodiversity
Sydney Phylo Workshop by Simon Ho.

1. Maximum parsimony

Identifies the tree topology that can explain the sequence data, using the smallest number of inferred substitution events. It is commonly used for morphological data, and now days is rarely used for analysing genetic data. It cannot estimate evolutionary rates or timescales and does not correct for multiple substitutions at the same site which leads to what is known as "long-branch attraction".

2. Distance-based methods

Clustering algorithms

Unweighted Pair Group Method with Arithmetic Mean (UPGMA)
Neighbour joining (NJ)

Tree searching using optimality criteria

Minimum evolution
Least-squares inference

Pros

Able to take into account multiple substitutions and long-branch attraction
Useful for analysing very large data sets with a lot of taxa (1000's +)

Cons

Does not use all information in alignment
Loss of information in pairwise comparisons
Unable to implement sophisticated evolutionary models

3. Maximum likelihood (ML)

Search through the space of possible trees and parameter values
Calculate the likelihood for these
Find best tree and model parameter values
Multivariate optimisation

Pros

Rigorous statistical method
Deals with multiple substitutions and long-branch attraction
Highly robust to violations of assumptions

Cons

Not feasible to implement very parameter-rich models
Searching tree space can be difficult

ML Analysis

RAxML = Randomized Axelerated Maximum Likelihood. Rapid bootstrapping and can run sequentially or in parallel.
PhyML
MEGA
PAML
IQ-Tree

4. Bayesian inference

Bayesian phylogenetic analysis was developed in the mid 1990s, and it now one of the most widely used methods for molecular phylogenetics.

Key features of Bayesian paradigm

Contrast with frequentist statistics (likelihood)
Parameters have distributions
Before the data are observed, each parameter has a prior distribution
The likelihood of the data is computed
The prior distribution is combined (updated) with the likelihood to yield the posterior distribution

Priors

Priors are chosen in the form of probability distributions
Reflect our prior expectations (and uncertainty) about values of parameters (without knowledge of the data)
- Past observations
- Personal beliefs
- Use of a biological model
Uninformative priors

‘Prior options'

Use a flat prior for tree topology (MrBayes)
- All trees have equal probability
- Also need a prior for branch lengths or node times
Use a biological model to generate prior distribution (BEAST and MrBayes)
- Among species: speciation model
- Within species: coalescent model

Markov Chain Monte Carlo Sampling

It is impossible to obtain the posterior directly. Instead, the posterior can be estimated using Markov chain Monte Carlo simulation. This is usually done using the Metropolis-Hastings algorithm. There is a wealth of literature available for more information on this topic. It can get super technical - we've just covered the basics.

Pros

Able to implement highly parameterised models
Estimating tree uncertainty is straightforward
Can only do this indirectly in likelihood (via bootstrapping)
Posterior probabilities have an intuitive interpretation
Can incorporate independent information (in the prior)

Popular software and programs for Bayesian analysis

MrBayes
- Primarily designed for species-level data
- Simultaneous estimation of tree and node times
- Range of clock models
- Range of tree priors
- Multiple chains and MCMC diagnostics
RevBayes
- Uses its own R-like language, Rev
- Interactive construction of graphical model
- Flexible and can be used for simulation and inference
- Ongoing development
BEAST1 = Bayesian Evolutionary Analysis by Sampling Trees
- Analyse population- or species-level data
- Simultaneous estimation of tree and node times
- Range of clock models
- Range of tree priors and demographic models
BEAST2
- Re-write of BEAST to increase modularity
- Users can extend BEAST by adding packages
- Additional tree priors not available in BEAST 1
- Capacity to perform simulations
- For a comparison of BEAST 1 and 2: www.beast2.org/beast-features

Bootstrapping

Felsenstein (1985) Evolution 39(4):783-791 (Pub med) Bootstrapping provides a confidence interval that contains the phylogeny that would be estimated from repeated sampling of many characters from the underlying set of all characters. Bootstrap values are measures of repeatability. High when the data set is large. Not meaningful when analysing genome-scale data.