The Principle of Parsimony in Evolution and Phylogenetics

Pluralitas non est ponenda sine necessitate.

This rule comes from theologian and philosopher Wilhelm of Ockham (~ 1280–1349), and is now called the principle of parsimony (in the popular literature, Ockham’s Razor). It states that when trying to explain a phenomenon, it is always better to make the least number of assumptions. A rock stays on the ground because of gravity and density, not because an assumed invisible man is pushing it down.

However, do not confuse this with meaning that the simplest possible explanation is the correct one. The simplest possible explanation for whale evolution is that whales evolved directly from already-marine fish, but their physiology and fossil record make it clear that a more complex scenario happened (secondary evolution to an aquatic lifestyle). What the principle of parsimony states is that if you had to explain it, it’s better to assume a complicated origin of aquatic whales from terrestrial ancestors rather than a simpler origin from fish, given that there is no evidence supporting a simpler origin.

In evolutionary biology, the principle of parsimony is very important because of the chaotic nature of evolution. The probability of the same mutations and developmental processes evolving independently is low, consequently it’s safer to assume that similarities between organisms are due to common descent (i.e. the mutations occurred once in a common ancestor and have been passed on) rather than due to convergence (i.e. that they occurred multiple times independently).

This thinking underlies phylogenetic systematics: character analysis and homologisation are treated as matters of parsimony, and one prominent method of building phylogenetic trees is based purely on parsimonious principles. We refer to the number of assumptions required in a tree as the cost.

Immediately, you can start seeing when costs start piling up. Look at the hypothetical diagram below.


The star indicates a character that is present in taxa H, J, I, and G. Let’s examine some of the plausible explanations for this pattern.

  1. Origin at the base of the ((H,J),(I,G)) clade = 1 assumption of origin.
  2. Origins in the (H,J) clade and the (I, G) clade = 2 assumptions.
  3. Origins at each terminal taxon = 4 assumptions.
  4. Origins in (H,J) clade, in I, and in G = 3 assumptions.
  5. Origins in H, J, and in (I,G) clade = 3 assumptions.
  6. Origin at (A,((H,J),(I,G))) = 2 assumptions.

Given only this tree, we follow the principle of parsimony and say that the first scenario is the most likely: it requires only the one assumption that the trait originated in the common ancestor of the species that have the trait, and the trait is a homology. Scenario two requires two assumptions of origination, while scenario six requires one assumption of origination and one assumption of loss. These two scenarios are thus given a lower probability and should be considered only with further evidence. The rest seem absurd in comparison to these… but they similarly cannot be completely discounted.

This was a very simple example. Now do the same exercise with the following tree.


The simplest explanation here is convergence or analogy: independent origins at (H,J) and at (C,E) = 2 assumptions. There is no possible explanation with just one assumption. Whether the trait is analogous or convergent cannot be found out from the phylogeny. Convergent traits evolve due to similar evolutionary pressures or conditions; analogous traits evolve similarly just by chance. The neutral term, if you don’t know whether a trait is a convergence or an analogy, is homoplasy.

And now we move on to a third tree.


Oh boy. Let’s list some possible origins.

  1. (H,J), (F,D), and (C,E): 3 assumptions.
  2. H, J, F, D, C, and E: 6 assumptions.
  3. ((A,(H,J),(I,G)),(F,D)), and (C,E): 4 assumptions.
  4. (((A,((H,J),(I,G))),(F,D)),((C,E),B)): 4 assumptions.

Those are the plausible scenarios, not taking into account weird permutations. You’ll see that the simplest explanation which would have the highest probability according to the principle of parsimony requires three independent origins and thus three assumptions. Scenarios 4 and 3 have less origins of the trait (one and two, respectively), but they assume more trait loss, and thus are given a lower probability. Scenario 2, of independent origin in each species, is highly-unlikely, but not impossible.

These trees may be hypothetical, but most character distributions on phylogenies are as spotty as this (leglessness in lizards is a prime example). The reason why systematists fight all the time and why we have so much trouble explaining the evolution of every taxon is precisely because it’s so rare to get clear-cut cases as in the first tree – and even then, there are so many possible scenarios that anyone claiming to have found the truth is deceiving himself. This is why phylogenetic trees and any conclusions drawn from them are always treated as hypotheses, always subject to change.

The most important thing to remember about the principle of parsimony, as applied to phylogenetics and evolution, is that it applies only as a guideline. Remember that we are talking about probabilities: a scenario with a low probability, that requires more assumptions than another scenario, may still be closer to what really happened. The most parsimonous tree is not necessarily the tree most reflective of history.

There is therefore a disconnect when discussing parsimony in phylogenetics and evolution. Parsimony as a philosophical position is different from parsimony as a methodological approach. The latter is what we will discuss from now.

The principle of minimum evolution by Edwards & Cavalli-Sforza (1963) was the first true application of parsimony in phylogenetics. According to minimum evolution, a phylogenetic tree which requires the least “amount of evolution” is the correct one, since evolution is a parsimonious process. Translated to actual data analysis, minimum evolution algorithms (Kluge & Farris, 1969) calculate the lengths of the branches of a phylogenetic tree; the shorter the sum of the branches, the more likely the tree is true. This means that parsimony here is limited to a philosophical position; the methodology itself is similar to neighbour-joining (NJ proved to be more popular due to greater efficiency and computing speed).

The method based purely on parsimony is maximum parsimony (MP). As we saw in our examples before though, parsimony as a criterion by itself doesn’t guarantee a reliable tree. In fact, like any phylogenetic method, analysing a dataset using MP will return you a lot of different trees that have the same cost, and these most parsimonious trees may not even be the proper phylogenetic tree (more on this later).

What maximum parsimony is most suited for is finding similarities that group organisms together (akin to phenetics) and thus the identification of homologies. In MP, the tree-building can be thought of as the data analysis, rather than the end result. In other phylogenetic methods, the tree is the end result.

The key to this method is understanding the subjectivity involved. Let’s modify one of our hypothetical trees by adding another trait.


Species I has its own unique trait, Blue Box. It doesn’t appear in any of the other species. For maximum parsimony, this trait is considered a trivial character: it cannot affect the calculation of the tree at all. Parsimony-informative characters are those that can be used to differentiate groups of species, such as Yellow Star. MP methods only work if the characters examined are homologies; consequently any characters found in only one taxon are useless.

But as we saw in our hypothetical examples, we cannot automatically assume homology without corroborating that assumption with information from outside the phylogenetic tree. In other words, there is always a degree of subjectivity in choosing which traits to code for a maximum parsimony analysis.

Subjectivity is also introduced by the possibility of giving each possible character transition a weight. An insect evolving a field of sensory hairs is a fairly easy feat; an insect evolving an extra pair of limbs is very unlikely. Treating both character transitions as the same is disingenuous, and so you can weigh the evolution of a new pair of limbs as a very costly transition. If two species share that extra pair of limbs, the most parsimonious result will thus be that they are sister species, rather than them evolving it convergently.

This concept of weighting is one of constant discussion among phylogeneticists (Cox et al., 2014), so much so that some derogatively refer to such subjective evaluation as “intuitive” (e.g. Yeates, 1995). It’s true that we should strive to keep our science as objective as possible. To that end, most programs can apply weights statistically, based on tree topologies… but this also doesn’t solve the problem, since the quality of the characters and their importance in the evolutionary history of a taxon are not taken into account.

Weightings play an important role in specialised cases when MP is applied, which we will explore below.

Wagner parsimony is based on the criteria set by Wagner (1961). Two assumptions rule Wagner parsimony:

  • Any character can be reversible, i.e. a transition from absence to presence of a character counts the same as a transition from presence to absence of a character.
  • For characters with varied levels of complexity, the most complex versions can only evolve by going through all the previous steps. In other words, the evolution of a character from absence to level 3 costs 3 assumptions (0 → 1, 1 → 2, 2 → 3).

Fitch (1971) modified the second rule of Wagner parsimony by saying that characters can evolve between any of their possible states in one go. In phylogenetic jargon, Wagner parsimony employs ordered characters while Fitch parsimony employs unordered characters.

In practical terms, Fitch parsimony is ideal for analysing DNA sequences, since a change from A to G is exactly equivalent to a change from A to C and that is exactly equivalent to a change from A to T. Morphological characters are deemed more complicated, and so for them, Wagner parsimony is more suitable.

To explain the next flavour, Dollo parsimony, let’s bring out yet another hypothetical tree.


For this scenario, we know that the Star trait originated at the base of ((A,(H,J),(I,G)),(F,D)), but it got lost in the lineage leading to A. Normally, if this trait determined the tree’s topology, A would be lumped together with ((B,(C,E)). By applying Dollo parsimony however, you assume that the loss of a character is less costly than the evolution of brand new character that is the same as an ancestral one. Consider that the ancestor at the base of ((A,(H,J),(I,G)),(F,D)) was a deep-sea organism, and the star represents the character “eye loss” (no star means “eyes present”). It is then easier for A to move to a shallower ocean habitat and regain its eyes, rather than for A to move to shallower seas and evolve a whole new genetic and developmental program to evolve a whole new type of eye separate from its ancestral eyes.

This brings up the issue of subjectivity again, in that you, as the working scientist, have to judge the possibility and probability of each scenario. Dollo characters are given special costs: the cost of gaining a Dollo character (in our example, gaining eye loss) is high, while that of losing a Dollo character is low. Because absence of a trait should never define the definition of a taxon, this makes Dollo characters fit only as support for evolutionary scenarios.

For a variant of Dollo parsimony in which reversals are not allowed, you have to check out Camin-Sokal parsimony (Camin & Sokal, 1965).

Like any other phylogenetic analysis, maximum parsimony of any kind leads to many trees being recovered, and a non-trivial number of them have the same amount of support, this especially happening when you have more taxa than characters. When that happens, we call it a terrace (Sanderson et al., 2011). Therefore, there has to be a way to minnow down those trees to a workable number of plausible ones.

There are two problems associated with this. The smaller one, which we’ve discussed so far, is how to calculate the parsimony score for a single tree, and it’s more or less been solved.

The second big challenge for parsimony is to find which is the most parsimonious tree of all the possible trees that can be built from the same data, a problem that is NP-hard (Foulds & Graham, 1982).

All tree-building methods are in themselves hefty computational challenges. The number of total possible trees increases exponentially with the addition of every taxon (you can read about this from even before computers existed in Schröder (1870)). Imagine having to compute hundreds of thousands of trees for even small datasets… and then also having to compute the parsimony score of each tree. It’s simply way too inefficient, so methods have been developed to ease the process, mostly by reducing the number of trees that need to be searched for (Varón et al., 2010).

These methods are heuristic, gradually and “intelligently” exploring the tree space. The most common way of doing this is by applying branch-and-bound algorithms, first done in phylogenetics by Hendy & Penny (1982). These algorithms basically do the tree-building one taxon at a time: every time a taxon is added, the trees are recalculated, and any trees with a higher parsimony score get discarded, and the taxon combinations in those trees are never explored again in future tree-building, leading to a constantly reduced tree search space.

You can spot the potential problem with this though. The tree spaces that are not searched for may have seemed bad initially, but addition of more taxa and more characters might lead to them being much more parsimonious later. These methods may increase efficiency, but they sacrifice accuracy (Bastert et al., 2002). The latest programs, like POY 4, try and handle this by allowing you to select multiple starting points for the tree-building, so that the “next taxon” or the “bad neighbour” effects are minimised. Techniques used in other phylogenetic methods, such as subtree pruning and regrafting, have been shown to not work properly with parsimony (Goloboff & Pol, 2007).

The robustness of MP trees is usually tested in two ways.

Jackknifing consists of deleting random characters from the consensus tree then running the MP analysis again. This is repeated x times (at least 1000). Those clades that occur more frequently in these repeats are considered more robust (Farris et al., 1996). Of course, this does have the drawback that those lineages with more autapomorphic characters will have higher frequencies, but this is not a major problem given enough repetitions, and the speed of MP combined with jackknifing has made this a popular method.

With weighted characters, another method for robustness testing is also used: Bremer support or the decay index. This calculates the difference in length between the consensus tree and the most parsimonious tree. For example, if the most parsimonious tree has 100 steps and the consensus tree has 150 steps, the Bremer support/decay index is 50. The meaning of the number is that removing 50 character state changes from the tree results in the collapse of the tree’s stem. The main drawback of the method is its intense computational requirements and thus slow speed.

As has already been stated, MP cannot be used willy-nilly. You need to understand the taxon’s evolution, its history, its physiology, and all other pertinent characteristics, because you need to interpret the meaning of the recovered trees and analyse whether they make sense and identify proper homologies. However, MP does find the trees that require the least assumptions about underlying evolutionary processes, and are thus ideal for null hypothesising.

I am not an expert in molecular phylogenetics and I personally wouldn’t use MP with DNA sequences, because there are too many analogous traits in molecular data which, I think, should throw the analysis off. That said, parsimony is pretty popular there, and was one of the first ways in which molecular phylogenetics was done (Goodman et al., 1979). Those who use MP with DNA should be warned about gaps in aligned sequences. Gaps can be treated as missing information, in which case the gaps are considered unimportant and not given much weight by the analysis. Alternatively, they can be treated as a fifth nucleotide, in which case the gap is treated as a proper character and can be given a weight and possibly be considered as a homology.

Overall, MP is so popular because it’s very efficient, applicable to just about any dataset, and fittingly is the most parsimonious of all the methods, containing less assumptions about the processes of evolution. The inconsistencies discussed in this post may have given it a bad image, but an article on parametric methods, model-based methods, Bayesian methods, and whatever else would highlight many drawbacks as well.

In special cases, MP offers unique advantages with sister groups that have long branches. Other phylogenetic methods sometimes have problems recovering these taxa as sister groups because of a phenomenon called long-branch repulsion; for some reason, maximum parsimony doesn’t have problems with them (Pol & Siddall, 2001). On the other hand, parsimony experiences lots of problems with long branch attraction (Felsenstein, 1978).

Besides in the building of individual phylogenies, MP methods have found use in supertree methods, mostly as tests of robustness.

[expand title=”References:”]

Bastert O, Rockmore D, Stadler PF & Tinhofer G. 2002. Landscapes on spaces of trees. Applied Mathematics and Computation 131, 439-459.

Camin JH & Sokal RR. 1965. A Method for Deducing Branching Sequences in Phylogeny. Evolution 19, 311-326.

Cox CJ, Li B, Foster BG, Embley TM & Civáň P. 2014. Conflicting Phylogenies for Early Land Plants are Caused by Composition Biases among Synonymous Substitutions. Systematic Biology 63, 272-279.

Edwards AWF & Cavalli-Sforza LL. 1963. The reconstruction of evolution. Annals of Human Genetics 27, 104-105.

Farris JS, Albert VA, Källersjö M, Lipscomb D & Kluge AG. 1996. Parsimony jackknifing outperforms neighbor-joining. Cladistics 12, 99-124.

Fitch WM. 1971. Toward Defining the Course of Evolution: Minimum Change for a Specific Tree Topology. Systematic Biology 20, 406-416.

Foulds LR & Graham RL. 1982. The Steiner problem in phylogeny is NP-complete. Advances in Applied Mathematics 3, 43-49.

Gascuel O, Bryant D & Denis F. 2001. Strengths and Limitations of the Minimum Evolution Principle. Systematic Biology 50, 621-627.

Goloboff PA & Pol D. 2007. On Divide-and-Conquer Strategies for Parsimony Analysis of Large Data Sets: Rec-I-DCM3 versus TNT. Systematic Biology 56, 485-495.

Goodman M, Czelusniak J, Moore GW, Romero-Herrera AE & Matsuda G. 1979. Fitting the Gene Lineage into its Species Lineage, a Parsimony Strategy Illustrated by Cladograms Constructed from Globin Sequences. Systematic Biology 28, 132-163.

Hendy MD & Penny D. 1982. Branch and bound algorithms to determine minimal evolutionary trees. Mathematical Biosciences 59, 277-290.

Kluge AG & Farris JS. 1969. Quantitative Phyletics and the Evolution of Anurans. Systematic Biology 18, 1-32.

Pol D & Siddall ME. 2001. Biases in Maximum Likelihood and Parsimony: A Simulation Approach to a 10-Taxon Case. Cladistics 17, 266-281.

Sanderson MJ, McMahon MM & Steel M. 2011. Terraces in phylogenetic tree space. Science 333, 448-450.

Schröder E. 1870. Vier combinatorische Probleme. Zeitschrift für Mathematik und Physik 15, 361-376.

Varón A, Vinh LS & Wheeler WC. 2010. POY version 4: phylogenetic analysis using dynamic homologies. Cladistics 26, 72-85.

Wagner WH. 1961. Problems in the classification of ferns. In: Recent Advances in Botany 1, 841-844.

Yeates DK. 1995. Groundplans and exemplars: paths to the tree of life. Cladistics 11, 343-357.


Leave a Reply