Molecular Clocks, Part 2: Practice

Here we will look at how a molecular clock analysis is actually done (in no great detail).

The first step is gathering molecular data. You have to choose relevant taxa and appropriate outgroups. The outgroups are basically there as a calibration. Then you choose what you will be basing your analysis on: protein sequence or gene sequence? Nowadays, genetic sequences are the norm. But it’s not that simple: as I mentioned before, your choice of sequence has to be appropriate (nothing that’s evolving too fast and nothing that’s evolving too slow). You also have to consider whether you use DNA sequences or RNA sequences. Do youtake sequences from the cellular genome or organelle genomes (mitochondrial, ribosomal, chlorplast)? These are all things that must be considered. Of course, you then have to find and download the data from various databases.

Then comes the bioinformatics part, where you input your sequence data into a specific program. This program aligns your sequences so they correspond to each other. Note that there are several programs available, each running different algorithms which might give different results. You take the produced alignment and put it into a different program, which calculates the genetic distance  between your different aligned sequences and creates a phylogenetic tree based on those calculations. The same warning about different algorithms applies here as well.

That was the basics of how to construct a phylogenetic tree based on molecular data. So far, we only have (tentative) phylogenetic relationships. To add time, the nodes on the tree are calibrated using fossils. This is where you can run into considerable problems. Because of the nature of the fossil record, it is often impossible to calibrate every node on the tree (quality is much more important than quantity). You then have the fact that each node is supposed to represent the ‘last common acestor’, which is an organism you will never ever find. The calibrations you’re doing are approximate: you can’t take the oldest fossil you have, but you have to stay within the boundaries set up by the tree. Basically, this is the oldest and most fundamental problem of classifying organisms. As a general rule of thumb, the oldest possible age of the layer where the oldest relevant fossil was found is taken as a calibration point.

Another, much better, way of calibrating your nodes is with palaeogeography. Of course, there are constraints as to where you can use this, but when possible, it is much more accurate. Here’s an example.

Hawaiian island ages + Phylogenetic trees. Bromham & Penny, 2005.

The Hawaiian Islands lie on a hotspot. The magma plume forming the islands stays in the same place while the tectonic plate above it is moving to the north west. This means that from the northwest to the southeast, the islands get progressively younger. Their origins have been dated by geologists (sorry for bastardising the formation of the islands, btw), and so we can assume that those dates are the date of speciation, when there was a stop to genetic intermixing. Looking at the phylogenetic tree above, we can see that the tree’s topology agrees with our hypothesis.

Honeycreeper Distance. Bromham & Penny, 2005.

The above line shows the genetic distance between honeycreeper species on the different Hawaiian islands. We see that the distance corresponds exactly to the time of separation, meaning that geographical separation can be used as a very reliable calibration method. The same can be seen when looking at Drosophila from the Hawaiian islands.

Getting back to our computer, the good calibration dates are put in, and the program calculates a substitution rate based on the genetic distance and the calibration date. The same warning about different algorithms applies here. It then extrapolates back in time. So at it’s core, a molecular clock analysis is nothing more than a simple extrapolation.

Of course, that’s an extremely silly thing to do as we’ve seen by looking at all the different factors affecting the substitution rate in the previous post. This was realised pretty soon, so the mathematicians got to work on creating ever more complex algorithms to¬† simulate biological stochasticity. The most advanced ones used nowadays, are based on Bayesian statistics and use Markov Chain Monte Carlo methods (MCMC).

Those who know me know how skeptical I am of such mathematical models. All these processes lead to a nice, fuzzy result with large error bars in a bid to appear accurate. It isn’t. These are nothing more than simulations, not based on the biological factors outlined in the previous post but on the calculated probabilities on which the tree is based on. Add to this the inaccuracy of calibrations and you end up with an unrealistic and bloated result. All these mathematical processes do is lead to an illusion of precision, not a satisfactory date for species divergence.

In the next post, we will look at four examples of molecular clock dates and see how they compare to the fossil record.

Leave a Reply