How to Do Phylogenetic Tree- Complete Guide
What Is a Phylogenetic Tree?
A phylogenetic tree is a diagram showing how species or genes evolved from common ancestors. It maps relationships between organisms based on genetic or morphological similarities.
Biologists use these trees to understand evolution. They answer questions like: which species are most closely related? When did a particular trait evolve? How did pathogens branch out?
If you're working in genetics, microbiology, systematics, or evolutionary biology, you'll need to build one eventually. This guide tells you exactly how.
Types of Phylogenetic Trees
Not all trees are the same. The format you choose depends on what you're trying to show.
Rooted vs. Unrooted Trees
A rooted tree has a single common ancestor at the base. It shows direction of evolution and when lineages split.
An unrooted tree shows relationships without specifying a common ancestor. Useful when you're still figuring out outgroups.
Cladogram vs. Phylogram
A cladogram shows branching order only. Branch lengths don't represent time or amount of change.
A phylogram adjusts branch lengths to reflect genetic distance. Longer branches = more mutations.
ultrametric Trees
Branch tips align at the same level, representing equal time since divergence. Useful for molecular clock analyses.
Data You Need to Build a Phylogenetic Tree
You can't build a tree without data. Here's what works:
- DNA sequences — most common. Use mitochondrial DNA, nuclear genes, or whole genomes.
- Protein sequences — amino acid data. Good when you lack nucleotide data.
- Morphological traits — physical characteristics. Used for fossil species or organisms without genetic data.
- Combined datasets — multiple genes + morphology. Gives more robust results.
The quality of your tree depends almost entirely on the quality and quantity of your input data. Garbage in, garbage out.
How to Build a Phylogenetic Tree: Step by Step
Step 1: Collect and Align Your Sequences
Grab your sequences from databases like GenBank, NCBI, or UniProt. Format them as FASTA files.
Then align them. This is where most people screw up. Alignment quality determines tree quality. Use tools like:
- MAFFT — fast and accurate for most alignments
- Muscle — good for protein sequences
- Clustal Omega — simple, works for smaller datasets
Inspect your alignment manually. Remove poorly aligned regions. Check for conserved motifs.
Step 2: Choose a Substitution Model
DNA evolves at different rates and patterns. Your model accounts for this.
Common models include:
- GTR+G+I — general time reversible with gamma distribution and invariable sites. Most commonly used for good reason.
- HKY85 — simpler, faster. Works for less divergent sequences.
- JC69 — simplest. Only use for very similar sequences.
Most phylogenetic software can select the best model automatically using AIC or BIC criteria. Don't guess — let the software decide.
Step 3: Select a Tree-Building Method
Two main approaches exist. Each has trade-offs.
Distance-Based Methods
Calculate genetic distance between all pairs of sequences. Build a tree that best fits these distances.
UPGMA — assumes constant evolution rate. Rarely used for real data because this assumption is usually wrong.
Neighbor-Joining — doesn't assume a molecular clock. Faster. Good for initial exploration.
Character-Based Methods
These use the actual sequence data at each position, not just distances.
Maximum Likelihood (ML) — finds the tree most likely to produce your observed data given the evolutionary model. Best balance of accuracy and computational cost. Most researchers use this.
Bayesian Inference — calculates probability that a tree is correct given your data. Produces a posterior distribution of trees. Computationally intensive but often the most powerful method.
Maximum Parsimony — finds tree requiring fewest evolutionary changes. Simple but prone to long-branch attraction artifacts. Avoid for molecular data unless you have a specific reason.
Step 4: Assess Tree Support
A tree without support values is useless. You need to know how reliable each branch is.
Bootstrap resampling — most common. Resample columns of your alignment, rebuild tree, see if same branches appear. Values above 70% generally indicate reasonable support. Below 50% — treat that branch with skepticism.
Bayesian posterior probabilities — for Bayesian trees. Values above 0.95 indicate strong support.
Step 5: Visualize and Interpret Your Tree
Export your tree in Newick or Nexus format. Open it in:
- FigTree — designed specifically for phylogenetic trees
- MEGA — free, popular, easy to use
- Interactive Tree of Life (iTOL) — web-based, good for sharing
- R (ggtree package) — customizable, scriptable
Root your tree correctly. Include an outgroup — a species known to be distantly related to all others in your analysis. This determines tree orientation.
Phylogenetic Tree Software Comparison
| Software | Type | Best For | Cost | Learning Curve |
|---|---|---|---|---|
| MEGA | GUI | Beginners, teaching, small datasets | Free | Low |
| RAxML | Command-line | Large datasets, ML analysis | Free | Medium |
| IQ-TREE | Command-line | ML with model testing, ultrafast bootstrap | Free | Medium |
| MrBayes | Command-line | Bayesian inference | Free | Medium-High |
| PHYML | Command-line | Fast ML analysis | Free | Medium |
| PAUP* | GUI/Command-line | Parsimony analysis | Commercial | High |
Getting Started: Quick Workflow
Here's a practical starting point for beginners:
- Download sequences in FASTA format from NCBI
- Align with MAFFT online (no install needed)
- Open alignment in MEGA
- Select Model Test to find best substitution model
- Run Maximum Likelihood tree building
- Apply 100 bootstrap replicates for support
- Visualize in FigTree
This gets you a basic, defensible tree in under an hour.
Common Mistakes to Avoid
- Bad alignment — the #1 problem. Alignment errors propagate through everything downstream.
- Ignoring model selection — wrong model = wrong tree.
- No support values — never publish a tree without bootstrap or posterior probabilities.
- Wrong outgroup — your tree's direction will be backwards.
- Too few characters — one gene rarely tells the whole story. Use multiple loci when possible.
- Long-branch attraction — distantly related fast-evolving taxa can cluster falsely. Use more data or different methods to test.
Advanced Considerations
Once you've mastered the basics, these areas will improve your trees:
Concatenation vs. coalescence — combine all genes into one supermatrix, or analyze each gene separately and compare trees? Coalescent methods are increasingly popular for species-level phylogenies.
Divergence time estimation — add fossil calibrations or known mutation rates to estimate when lineages split.
Species tree vs. gene tree — gene trees can disagree with species trees due to incomplete lineage sorting, hybridization, or horizontal transfer. Know which you're actually trying to reconstruct.
Whole genome approaches — for closely related taxa, whole-genome alignments or SNP-based methods outperform single-gene trees.
When to Use What
- Closely related species → use more characters, consider coalescent methods
- Distantly related taxa → mitochondrial DNA often better than nuclear for resolution
- Large datasets (hundreds of taxa) → RAxML or IQ-TREE, not Bayesian
- Bacteria/archaea → use core genome alignments, not single genes
- Virus evolution → consider Bayesian phylodynamics (BEAST package)
The Bottom Line
Building a phylogenetic tree isn't magic. It's a pipeline: align → model → infer → support → visualize. Each step has standard tools and accepted practices.
Start simple. Use MEGA or IQ-TREE. Get one working tree. Then refine from there.
Don't overthink the theory before you've built your first tree. You'll learn more from doing than from reading documentation.