Building Cladograms 1: Introduction, Building a Character Matrix

This series of posts is basically a walk-through tutorial showing you how you can build your own phylogenetic tree based on morphological data. Maybe in the future I will do something similar for molecular data, and then for total evidence (combined molecular and morphological).

This post is the introduction, to note some important things before setting you on the journey into the wonderful world of cladistics.

What I will outline here are comprehensive ways of getting reliable phylogenies, and methods that I personally use all the time (except for special purposes, obviously), but there are alternative ways and programs, which I can detail after the series is done.

Before you even begin to construct a phylogeny, you have to make your research question very clear in your head. Don’t expect to plot a phylogeny and then get some brilliant insight – that might occur with molecular data, but not with morphological data, because here, you’re the one building your dataset by hand, and you need to know what to populate it with. I sometimes only want to build identification keys and use only characters that are easily visible to the naked eye, and then use ancestral trait reconstruction to get an outline for a dichotomous key (as in the exercise here) – this isn’t a proper phylogeny, but is a valid use of the method. A proper phylogeny will take into account all the characters that can be mustered. One including fossils will have to take special precautions for the limited amount of data and possible homoplasies. Investigating the evolution of a single character or complex of characters will need some extra thought: do you plot a phylogeny based on the character you’re investigating, or do you plot a general phylogeny and then see how the character changes on this tree? Both methods are valid, but for different purposes, and you have to think carefully, unless you really want to spend weeks building character matrices.

Know what you are doing. Do not just follow my steps blindly. At several points, you will come to junctures where you have to choose an evolutionary model, or weigh some parameters, or otherwise make decisions that will be integral for how your dataset is computed. Remember that all these fancy algorithms are nothing more than statistics, they have no consciousness of their own, so unless you order them to do what you specifically need, they will just output nonsense. This is why I will be including explanations for all the relevant models you will be choosing from and factors you will be altering. Be sure to read these, if only for the sake of good phylogenetics.

I am assuming a Linux OS. Mac users should sell their computers and use the money to buy 5 superior ones and install Slackware on them. Windows users should only need Past (which doubles as an excellent statistics program!), although all the programs here are available for Windows as well. Linux users, you can also run Past under Wine (or any other virtual box), but while that’s good if you want to do run-of-the-mill statistics or just building a character matrix, it will be too slow for phylogenetics.

For this series of posts, you will need the following programs and packages, all of which are free:

  • Mesquite
  • R [sidenote: invest some time to learn how to write in R, you will not regret it. It’s currently the best platform for running statistics, but has a steep learning curve.]
  • Phylip
  • Dendroscope (register it, it’s free and you need to have it registered if you want to save the graphics.)

Step 1: Build your character matrix

This is the most personally time-consuming task. Get your taxon sampling straight, look at your organisms and code every character as thoroughly as you can. The characters here are all discrete. They can be binary, coding only for presence/absence (1/0, respectively), or they can be multistate for when a character has multiple forms and it isn’t enough to simply code for its presence and absence. In those cases, you include numbers from 0 to 8. I don’t recommend going for more than 8 states for a single character; you won’t need to in most cases anyway (most I’ve done is 5). If a character cannot be coded (often the case with fossils), do not put a 0. Either leave it blank or use a ? as a wildcard. Coding it as a 0 confirms the character’s existence; with a wildcard, the algorithm will not take it into account.

You can write the character matrix in any spreadsheet or in a plain text editor, but to avoid formatting errors, I suggest using the character matrix editor in Mesquite (or the table in Past).

In Mesquite, go to File/New…, make your folder, and make sure you choose the same options as in the next two pictures:

You will then land on the Character Matrix Editor. Click on the “taxon” rows to change the taxon names, and on the character lines to edit the character names. And then start adding the data in the cells. As a protip, I recommend coding in real-time while examining your specimens. The next picture shows you what a complete binary character matrix looks like.

This dataset is for 215 creation myths (the taxa), with 46 characters coded for each (see the sidebar); more details in the interpretation post. In addition, I added their geographical extent as colour groups. This does not affect the analysis, and is a feature only in Mesquite, and makes the final tree nicer to look at (should you choose to make it in Mesquite) by allowing you to colour branches/names after each group. If I were doing a phylogeny of an insect family, I would make a colour group for every tribe or genus, for example, because this will allow me to instantly spot paraphyly in the tree produced by Mesquite.

To add new taxa, use the fourth tool from the sidebar (the one with the up-and-down arrows). To add more characters, use the one above it (with the left-and-right arrows).

Don’t forget to save.

The next posts will be about how to plot and visualise trees in various programs, followed by the various analyses possible (besides just staring at the trees and making up evolutionary trends). One post will be dedicated to editing your tree to make it publication-worthy. The final post will be the interpretation of the creation myth dataset, just to exemplify how to do such things properly. And yes, I did write this paragraph as a guide to myself, not for you. Seriously, there’s over 10 draft posts I haven’t finished simply because I got distracted by a factoid while writing and ended up writing about a related, but different subject. Now I’m forced to do this series exactly in this way. No post schedule, as usual (it’s grant-writing season over the next month).

Jump to: Plotting in Mesquite; Plotting in Phylip; Plotting in R; Analyses in Mesquite; Analyses in R; Polishing the Tree; A Phylogeny of Creation Myths

Leave a Reply