The Supertree Method

Supertrees are phylogenies reconstructed by stitching together many smaller phylogenies, allowing us to gain a coarser view useful for looking at higher-order systematics. When we say “we’re going to build a tree of all life”, we usually mean “we’re going to build a supertree of life”, because it’s more computationally-efficient (and building a tree of life would be computationally impossible for now).

supertree_approachesThere are two general approaches to building a supertree: the direct and the indirect one (diagram above, Bininda-Emonds et al., 2002). A thorough listing of methods can be found in Bininda-Emonds (2004b), and a comparison between them is provided by Wilkinson et al. (2005); note that new methods and algorithms are being derived every year, and this classification here is somewhat outdated or overgeneralised.

  • The direct approach simply constructs a supertree based on the consensus of the input trees; see the end of this post for an explanation of consensus trees. It was first properly formalised by Gordon (1986), but suffers from the problem that if the source trees are too conflicting, the resultant supertree will just be an unresolved bush. Therefore, it’s not such a popular method.
  • The indirect approach breaks down the input tree topologies into matrices. Each taxon is coded according to whether it belongs to a clade (1 for yes, 0 for no), and the resultant matrix is then used to reconstruct a brand new supertree, in much the same way as a phylogeny is reconstructed from a character matrix. [Note: there are different ways of coding the matrix, this is the most popular one.] This is called the matrix representation with parsimony approach (MRP), and was developed independently by Baum (1992) and Ragan (1992). The final supertree is decided as the one that requires the least changes in length from the source tree-matrix to the supertree as the parsimony principle demands. Similarly, there is MRC (Compatibility), MRD (Distances), and MRF (Flipping).

overlapNote that there is no requirement for the source trees to share the same taxa. In other words, one tree can analyse wolf, sheep, and mice, and another tree can analyse wolf, sheep, and humans, and both trees can be combined to make a supertree. The way this is done in MRP and other matrix representation methods is by coding the missing taxa with a ? instead of a 1 or 0 (Pisani & Wilkinson, 2002). As long as there are overlapping taxa, it’s not a technical problem. This is demonstrated above (Bininda-Emonds et al., 2002). It must be said that obviously, the amount of overlap has to be sufficient in order for the tree to actually be useful – if your two input trees share one species, then the resultant supertree won’t tell you anything new.

Just as with regular trees, there are several statistics for measuring the possible validity of a supertree. The two more common ones are the Robinson-Foulds metric (Robinson & Foulds, 1981) and MAST (Maximum Agreement Sub-Tree; Finden & Gordon (1985)).

Since the input data for supertrees is phylogenies and topologies, they offer a distinct advantage in that they allow us to mix molecular and morphological analyses to make up the supertree (Liu et al., 2001). They also allow us to mix analyses where there is missing data, e.g. mix fossil-only phylogenies together with Modern phylogenies, with the fossil ones missing a lot of the soft-part anatomical characters and molecular stuff (Sanderson et al., 1998); although see Wiens (2003) for a critique of such a justification.

Most commonly, supertrees are used to summarise the results of molecular analyses based on different single sequences (Daubin et al., 2002), but they can also be used to gain a general overview of where the major agreements are (the reasoning being that if all studies resolve a certain node, then it’s very likely for it to be correct), allowing us to focus our attention to the areas where the big incongruencies are. This is the main advantage of supertrees: the amount of raw data available for phylogenetics is too great for current computers, so until we can analyse all of that, these summaries do an adequate job, and so can be of help in comparative biology and community ecology (Webb & Donoghue, 2005).

gatesyThe use of supertrees is not without controversy. For example, even something as fundamental as whether it’s a valid method for inferring phylogenies still is under question by several authors. Gatesy et al. (2002) provide a great example of how a supertree analysis can produce an erroneous result, as summarised in the diagram above. If the supertree contains a clade that isn’t found in any of the source trees (a mathematical possibility), is this clade taken seriously or is it discarded as an analytical artefact? The way I see it, they can only be used as phylogeny generators if all the input trees use the same evidence-based optimality criterion, or else we are just lumping together different phylogenetic signals and treating them as the same. Supertrees are simply summaries of past data, offering no new hypotheses.

The main pitfalls of supertrees are summarised below:

  • They are only as good as the smaller phylogenies they’re stitched up from, as they weigh poorly-supported and well-supported input trees equally. Any incongruencies will end up as polytomies. However, Bininda-Emonds & Sanderson (2001) point out that number of unresolved nodes in supertrees doesn’t differ so much from regular trees.
  • Phylogenetic signal may be lost if present in only a few of the input trees, but this can be rectified by bootstrapping (see e.g. Burleigh et al., 2011).
  • The input trees must be rooted properly (Bininda-Emonds et al., 2005).
  • Branch lengths are meaningless (Sanderson et al., 1998), although this is one of the next frontiers to cross in supertree methodological research (Ropiquet et al., 2009) – there are now methods for including divergence time information (Semple et al., 2004) and for dating supertrees (Gernhard, 2008), for example.
  • The supertree resolution is too low and doesn’t allow a look past the broad systematic levels. This is due to the input trees clashing in their topologies (McMorris & Wilkinson, 2011).

If you want to know more about supertrees, then I heartily recommend the authoritative 2004 book edited by Bininda-Emonds, Phylogenetic supertrees: Combining information to reveal the Tree of Life.

For teachers, here’s a nifty question for your next exam: Is it correct to consider concensus tree building as a case of supertree building with input trees that have the same sets of leaves? The correct answer is yes, consensus trees are a special type of supertree. In fact, the original supertree methods were developed as generalisations of the well-established consensus methods (Wilkinson et al., 2005).

Warning: The supertree approach is not to be confused with the more popular supermatrix approach, where character sets (e.g. genes) are concatenated together into new datasets and analysed. See de Queiroz & Gatesy (2007) for a summary of this. Bininda-Emonds & Sanderson (2001) and Criscuolo et al. (2006) provide some very basic comparisons between supertree and supermatrix approaches, finding that supermatrices are slightly more accurate, a conclusion that can’t be taken too much at face value given that the comparisons weren’t very realistic. However, given the support for such a conclusion by many studies, it is generally accepted as true for most cases; see Kupczok et al., (2010). Note that there is also a way to combine supertree and supermatrix approaches to form mega-phylogenies, as outlined by Smith et al. (2009).

References:

Baum BR. 1992. Combining Trees as a Way of Combining Data Sets for Phylogenetic Inference, and the Desirability of Combining Gene Trees. Taxon 41, 3-10.

Bininda-Emonds ORP. 2004a. Phylogenetic supertrees: Combining information to reveal the Tree of Life.

Bininda-Emonds ORP. 2004b. The evolution of supertrees. TrEE 19, 315-322.

Bininda-Emonds ORP & Sanderson MJ. 2001. Assessment of the Accuracy of Matrix Representation with Parsimony Analysis Supertree Construction. Systematic Biology 50, 565-579.

Bininda-Emonds ORP, Gittleman JL & Steel MA. 2002. THE (SUPER)TREE OF LIFE: Procedures, Problems, and Prospects. Annual Review of Ecology and Systematics 33, 265-289.

Bininda-Emonds ORP, Beck RMD & Purvis A. 2005. Getting to the Roots of Matrix Representation. Systematic Biology 54, 668-672.

Burleigh JG, Bansal MS, Eulenstein O, Hartmann S, Wehe A & Vision TJ. 2011. Genome-Scale Phylogenetics: Inferring the Plant Tree of Life from 18,896 Gene Trees. Systematic Biology 60, 117-125.

Criscuolo A, Berry V, Douzery EJP & Gascuel O. 2006. SDM: A Fast Distance-Based Approach for (Super)Tree Building in Phylogenomics. Systematic Biology 55, 740-755.

Daubin V, Gouy M & Perrière G. 2002. A Phylogenomic Approach to Bacterial Phylogeny: Evidence of a Core of Genes Sharing a Common History. Genome Research 12, 1080-1090.

De Queiroz A & Gatesy J. 2007. The supermatrix approach to systematics. TrEE 22, 34-41.

Finden CR & Gordon AD. 1985. Obtaining common pruned trees. Journal of Classification 2, 255-276.

Gatesy J, Matthee C, DeSalle R & Hayashi C. 2002. Resolution of a supertree/supermatrix paradox. Systematic Biology 51, 652-664.

Gernhard T. 2008. The conditioned reconstructed process. Journal of Theoretical Biology 253, 769-778.

Gordon AD. 1986. Consensus supertrees: The synthesis of rooted trees containing overlapping sets of labeled leaves. Journal of Classification 3, 335-348.

Kupczok A, Schmidt HA & von Haeseler A. 2010. Accuracy of phylogeny reconstruction methods combining overlapping gene data sets. Algorithms for Molecular Biology 5, 37.

Liu F-GR, Miyamoto MM, Freire NP, Ong PQ, Tennant MR, Young TS & Gugel KF. 2001. Molecular and Morphological Supertrees for Eutherian (Placental) Mammals. Science 291, 1786-1789.

McMorris FR & Wilkinson M. 2011. Conservative Supertrees. Systematic Biology 60, 232-238.

Pisani D & Wilkinson M. 2002. Matrix Representation with Parsimony, Taxonomic Congruence, and Total Evidence. Systematic Biology 51, 151-155.

Ragan MA. 1992. Phylogenetic inference based on matrix representation of trees. Molecular Phylogenetics and Evolution 1, 53-58.

Robinson DF & Foulds LR. 1981. Comparison of phylogenetic trees. Mathematical Biosciences 53, 131-147.

Ropiquet A, Li B & Hassanin A. 2009. SuperTRI: A new approach based on branch support analyses of multiple independent data sets for assessing reliability of phylogenetic inferences. Comptes Rendus Biologies 332, 832-847.

Sanderson MJ, Purvis A & Henze C. 1998. Phylogenetic supertrees: Assembling the trees of life. TrEE 13, 105-109.

Semple C, Daniel P, Hordijk W, Page RDM & Steel M. 2004. Supertree algorithms for ancestral divergence dates and nested taxa. Bioinformatics 20, 2355-2360.

Smith SA, Beaulieu JM & Donoghue MJ. 2009. Mega-phylogeny approach for comparative biology: an alternative to supertree and supermatrix approaches. BMC Evolutionary Biology 9, 37.

Webb CO & Donoghue MJ. 2005. Phylomatic: tree assembly for applied phylogenetics. Molecular Ecology Resources 5, 181-183.

Wiens JJ. 2003. Missing Data, Incomplete Taxa, and Phylogenetic Accuracy. Systematic Biology 52, 528-538.

Wilkinson M, Cotton JA, Creevey C, Eulenstein O, Harris SR, Lapointe F-J, Levasseur C, Mcinerney JO, Pisani D & Thorley JL. 2005. The Shape of Supertrees to Come: Tree Shape Related Properties of Fourteen Supertree Methods. Systematic Biology 54, 419-431.

Leave a Reply