How to Do Phylogenetic Tree- Complete Guide

What Is a Phylogenetic Tree?

A phylogenetic tree is a diagram showing how species or genes evolved from common ancestors. It maps relationships between organisms based on genetic or morphological similarities.

Biologists use these trees to understand evolution. They answer questions like: which species are most closely related? When did a particular trait evolve? How did pathogens branch out?

If you're working in genetics, microbiology, systematics, or evolutionary biology, you'll need to build one eventually. This guide tells you exactly how.

Types of Phylogenetic Trees

Not all trees are the same. The format you choose depends on what you're trying to show.

Rooted vs. Unrooted Trees

A rooted tree has a single common ancestor at the base. It shows direction of evolution and when lineages split.

An unrooted tree shows relationships without specifying a common ancestor. Useful when you're still figuring out outgroups.

Cladogram vs. Phylogram

A cladogram shows branching order only. Branch lengths don't represent time or amount of change.

A phylogram adjusts branch lengths to reflect genetic distance. Longer branches = more mutations.

ultrametric Trees

Branch tips align at the same level, representing equal time since divergence. Useful for molecular clock analyses.

Data You Need to Build a Phylogenetic Tree

You can't build a tree without data. Here's what works:

DNA sequences — most common. Use mitochondrial DNA, nuclear genes, or whole genomes.
Protein sequences — amino acid data. Good when you lack nucleotide data.
Morphological traits — physical characteristics. Used for fossil species or organisms without genetic data.
Combined datasets — multiple genes + morphology. Gives more robust results.

The quality of your tree depends almost entirely on the quality and quantity of your input data. Garbage in, garbage out.

How to Build a Phylogenetic Tree: Step by Step

Step 1: Collect and Align Your Sequences

Grab your sequences from databases like GenBank, NCBI, or UniProt. Format them as FASTA files.

Then align them. This is where most people screw up. Alignment quality determines tree quality. Use tools like:

MAFFT — fast and accurate for most alignments
Muscle — good for protein sequences
Clustal Omega — simple, works for smaller datasets

Inspect your alignment manually. Remove poorly aligned regions. Check for conserved motifs.

Step 2: Choose a Substitution Model

DNA evolves at different rates and patterns. Your model accounts for this.

Common models include:

GTR+G+I — general time reversible with gamma distribution and invariable sites. Most commonly used for good reason.
HKY85 — simpler, faster. Works for less divergent sequences.
JC69 — simplest. Only use for very similar sequences.

Most phylogenetic software can select the best model automatically using AIC or BIC criteria. Don't guess — let the software decide.

Step 3: Select a Tree-Building Method

Two main approaches exist. Each has trade-offs.

Distance-Based Methods

Calculate genetic distance between all pairs of sequences. Build a tree that best fits these distances.

UPGMA — assumes constant evolution rate. Rarely used for real data because this assumption is usually wrong.

Neighbor-Joining — doesn't assume a molecular clock. Faster. Good for initial exploration.

Character-Based Methods

These use the actual sequence data at each position, not just distances.

Maximum Likelihood (ML) — finds the tree most likely to produce your observed data given the evolutionary model. Best balance of accuracy and computational cost. Most researchers use this.

Bayesian Inference — calculates probability that a tree is correct given your data. Produces a posterior distribution of trees. Computationally intensive but often the most powerful method.

Maximum Parsimony — finds tree requiring fewest evolutionary changes. Simple but prone to long-branch attraction artifacts. Avoid for molecular data unless you have a specific reason.

Step 4: Assess Tree Support

A tree without support values is useless. You need to know how reliable each branch is.

Bootstrap resampling — most common. Resample columns of your alignment, rebuild tree, see if same branches appear. Values above 70% generally indicate reasonable support. Below 50% — treat that branch with skepticism.

Bayesian posterior probabilities — for Bayesian trees. Values above 0.95 indicate strong support.

Step 5: Visualize and Interpret Your Tree

Export your tree in Newick or Nexus format. Open it in:

FigTree — designed specifically for phylogenetic trees
MEGA — free, popular, easy to use
Interactive Tree of Life (iTOL) — web-based, good for sharing
R (ggtree package) — customizable, scriptable

Root your tree correctly. Include an outgroup — a species known to be distantly related to all others in your analysis. This determines tree orientation.

Phylogenetic Tree Software Comparison

Software	Type	Best For	Cost	Learning Curve
MEGA	GUI	Beginners, teaching, small datasets	Free	Low
RAxML	Command-line	Large datasets, ML analysis	Free	Medium
IQ-TREE	Command-line	ML with model testing, ultrafast bootstrap	Free	Medium
MrBayes	Command-line	Bayesian inference	Free	Medium-High
PHYML	Command-line	Fast ML analysis	Free	Medium
PAUP*	GUI/Command-line	Parsimony analysis	Commercial	High

Getting Started: Quick Workflow

Here's a practical starting point for beginners:

Download sequences in FASTA format from NCBI
Align with MAFFT online (no install needed)
Open alignment in MEGA
Select Model Test to find best substitution model
Run Maximum Likelihood tree building
Apply 100 bootstrap replicates for support
Visualize in FigTree

This gets you a basic, defensible tree in under an hour.

Common Mistakes to Avoid

Bad alignment — the #1 problem. Alignment errors propagate through everything downstream.
Ignoring model selection — wrong model = wrong tree.
No support values — never publish a tree without bootstrap or posterior probabilities.
Wrong outgroup — your tree's direction will be backwards.
Too few characters — one gene rarely tells the whole story. Use multiple loci when possible.
Long-branch attraction — distantly related fast-evolving taxa can cluster falsely. Use more data or different methods to test.

Advanced Considerations

Once you've mastered the basics, these areas will improve your trees:

Concatenation vs. coalescence — combine all genes into one supermatrix, or analyze each gene separately and compare trees? Coalescent methods are increasingly popular for species-level phylogenies.

Divergence time estimation — add fossil calibrations or known mutation rates to estimate when lineages split.

Species tree vs. gene tree — gene trees can disagree with species trees due to incomplete lineage sorting, hybridization, or horizontal transfer. Know which you're actually trying to reconstruct.

Whole genome approaches — for closely related taxa, whole-genome alignments or SNP-based methods outperform single-gene trees.

When to Use What

Closely related species → use more characters, consider coalescent methods
Distantly related taxa → mitochondrial DNA often better than nuclear for resolution
Large datasets (hundreds of taxa) → RAxML or IQ-TREE, not Bayesian
Bacteria/archaea → use core genome alignments, not single genes
Virus evolution → consider Bayesian phylodynamics (BEAST package)

The Bottom Line

Building a phylogenetic tree isn't magic. It's a pipeline: align → model → infer → support → visualize. Each step has standard tools and accepted practices.

Start simple. Use MEGA or IQ-TREE. Get one working tree. Then refine from there.

Don't overthink the theory before you've built your first tree. You'll learn more from doing than from reading documentation.