The Tree of Life and Genome Diversity

Beginner Cell Biology ~35 min

← Previous Next →

Compare genomes across the three domains of life — Bacteria, Archaea, and Eukarya — and discover how sequence similarity, gene duplication, and horizontal transfer shape the diversity of genomes.

Introduction: Unity and Diversity

The previous lesson established what all cells share. This lesson asks the opposite question: how do cells differ, and what can their genomes tell us about the history of life?

The answer comes from comparing sequences. When we line up the DNA or protein sequences of different organisms, patterns emerge — patterns of conservation and change that reveal how species are related, how genes are born, and how evolution has shaped the staggering diversity of life on Earth. Along the way, we will introduce the computational tools that make these comparisons possible: similarity searching, multiple sequence alignment, and phylogenetic tree construction.

Diverse Energy Sources Power Different Cells

All cells need free energy, but they obtain it in remarkably different ways. The diversity of energy metabolism is one of the most striking features of the living world, and it is overwhelmingly concentrated among the prokaryotes.

Phototrophs harvest energy from sunlight. These include cyanobacteria (which carry out oxygenic photosynthesis, producing O&sub2; as a byproduct), purple and green bacteria (which use anoxygenic photosynthesis), and all photosynthetic eukaryotes (plants and algae, which acquired the ability through endosymbiosis with cyanobacteria).

Chemolithotrophs extract energy from inorganic chemical reactions — oxidizing hydrogen, iron, sulfur, ammonia, or other inorganic substrates. These organisms drive the biogeochemical cycles that shape Earth’s chemistry. For example, nitrifying bacteria oxidize ammonia to nitrate, and iron-oxidizing bacteria produce the rust-colored deposits seen in acid mine drainage.

Chemoorganotrophs (including animals, fungi, and most familiar bacteria) obtain energy by oxidizing organic molecules such as sugars and fats.

Some organisms play a special ecological role by fixing essential elements into biologically usable forms. Nitrogen-fixing bacteria and archaea convert atmospheric N&sub2; into ammonia (NH&sub3;), making nitrogen available to all other life. Without biological nitrogen fixation, terrestrial ecosystems would be starved of this essential element. Similarly, carbon-fixing autotrophs convert CO&sub2; into organic molecules through photosynthesis or chemosynthesis, forming the base of virtually every food web.

Energy strategy	Energy source	Examples
Photoautotroph	Light + CO&sub2;	Cyanobacteria, plants, algae
Photoheterotroph	Light + organic C	Purple non-sulfur bacteria
Chemolithoautotroph	Inorganic chemicals + CO&sub2;	Nitrifying bacteria, iron oxidizers
Chemoorganoheterotroph	Organic molecules	Animals, fungi, most bacteria

The Greatest Biochemical Diversity Exists Among Prokaryotes

While eukaryotes have evolved extraordinary diversity of form — from diatoms to elephants — their basic metabolic repertoire is relatively limited. Almost all eukaryotes are aerobic chemoorganotrophs. The true biochemical innovators are the prokaryotes: bacteria and archaea.

Prokaryotes have colonized virtually every environment on Earth. Methanogens (archaea) produce methane from CO&sub2; and H&sub2; in anaerobic environments like swamps and the guts of ruminants. Thermophiles thrive at temperatures above 80°C in hot springs and hydrothermal vents. Halophiles tolerate salt concentrations that would kill most organisms. Acidophiles grow at pH values below 2. This metabolic versatility reflects billions of years of evolution in diverse and often extreme environments.

The total genetic diversity of the prokaryotic world dwarfs that of eukaryotes. Metagenomic studies — sequencing DNA directly from environmental samples without culturing — have revealed vast numbers of previously unknown bacterial and archaeal lineages, many of which have never been grown in the laboratory.

The Tree of Life: Bacteria, Archaea, and Eukarya

In the 1970s, Carl Woese and George Fox revolutionized our understanding of life’s diversity by comparing sequences of ribosomal RNA (rRNA) genes. Because all cells need ribosomes to make proteins, rRNA genes are universally present and highly conserved, making them ideal molecular markers for deep evolutionary comparisons.

Their analysis revealed that life divides into three domains: Bacteria, Archaea, and Eukarya. This was a surprise — archaea had previously been lumped together with bacteria as “prokaryotes.” But at the molecular level, archaea are in many ways more closely related to eukaryotes. They share similar transcription machinery, use histones to organize their DNA, and have homologous DNA replication proteins.

The three-domain tree is rooted using outgroup comparison — since all three domains diverged from a common ancestor, comparing the sequences helps establish which two are most closely related. The current consensus places Archaea and Eukarya as sister groups, with Bacteria as the outgroup.

Let’s see how sequence alignment reveals these relationships. Here, two closely related gene fragments produce a high alignment score, while a more distant comparison produces a lower one:

let archaea = "ATGGCTAGCAAAGACTTCACC"
let eukarya = "ATGGCAAGCAAAGACTTCACC"
let bacteria = "ATGGATAGCGAAGATTTTACC"
let ae = Phylo.distance(archaea, eukarya, "jc69")
let ab = Phylo.distance(archaea, bacteria, "jc69")
print("Archaea vs Eukarya distance: " + ae)
print("Archaea vs Bacteria distance: " + ab)

The higher score between Archaea and Eukarya reflects their closer evolutionary relationship — fewer mutations have accumulated since their divergence.

Some Genes Evolve Rapidly; Others Are Highly Conserved

Not all genes change at the same rate. Highly conserved genes — such as those encoding ribosomal RNA, DNA polymerase, and core metabolic enzymes — have changed very little over billions of years because almost any mutation in these genes is harmful and is eliminated by natural selection (purifying selection).

Other genes evolve much more rapidly. Genes involved in the immune system, surface antigens of pathogens, and reproductive proteins often show high rates of change, driven by the constant evolutionary arms race between hosts and parasites or by sexual selection.

The concept of a molecular clock rests on the observation that, for a given gene, mutations accumulate at a roughly constant rate over time. By calibrating this clock against the fossil record, biologists can estimate when two lineages diverged. The molecular clock is imperfect — rates vary across lineages and genes — but it remains one of the most powerful tools for dating evolutionary events.

let conserved_a = "ATGGCTAGCAAAGACTTCACCGAG"
let conserved_b = "ATGGCTAGCAAAGACTTCACCGAG"
let fast_a = "ATGGCTAGCAAAGACTTCACCGAG"
let fast_b = "ATGCATACCGAAGATTTCGCCAAG"
let cons_hamming = Seq.hamming(conserved_a, conserved_b)
let fast_hamming = Seq.hamming(fast_a, fast_b)
print("Conserved gene — Hamming distance: " + cons_hamming)
print("Fast-evolving gene — Hamming distance: " + fast_hamming)
let cons_dist = Phylo.distance(conserved_a, conserved_b, "jc69")
let fast_dist = Phylo.distance(fast_a, fast_b, "jc69")
print("Conserved gene — JC69 evolutionary distance: " + cons_dist)
print("Fast-evolving gene — JC69 evolutionary distance: " + fast_dist)

A perfectly conserved gene produces a perfect alignment score. A rapidly evolving gene from the same two species shows many more differences over the same time span.

Prokaryotic Genome Sizes

Most bacteria and archaea have genomes encoding between 1,000 and 6,000 genes. Genome size correlates with lifestyle: free-living organisms with complex metabolic capabilities tend to have larger genomes, while obligate parasites and symbionts that depend on a host for many nutrients can lose genes they no longer need, resulting in smaller genomes.

Organism	Lifestyle	Genome size	Genes
Mycoplasma genitalium	Obligate parasite	580 kb	~475
Chlamydia trachomatis	Obligate intracellular	1.0 Mb	~900
Escherichia coli K-12	Free-living	4.6 Mb	~4,300
Bacillus subtilis	Free-living, soil	4.2 Mb	~4,100
Streptomyces coelicolor	Free-living, complex metabolism	8.7 Mb	~7,800
Sorangium cellulosum	Free-living, multicellular stages	13.0 Mb	~9,400

The largest bacterial genomes approach the size of small eukaryotic genomes, blurring the line that was once thought to separate prokaryotes from eukaryotes.

let sizes = '[{"label": "Mycoplasma", "value": 0.58}, {"label": "Chlamydia", "value": 1.0}, {"label": "E. coli", "value": 4.6}, {"label": "B. subtilis", "value": 4.2}, {"label": "Streptomyces", "value": 8.7}, {"label": "Sorangium", "value": 13.0}]'
let chart = Viz.bar(sizes, '{"title": "Prokaryotic Genome Sizes (Mb)", "color": "#06B6D4"}')
print(chart)

In prokaryotes, genome size and gene count show a strong linear correlation — a stark contrast to the C-value paradox seen in eukaryotes:

let data = '[{"x": 0.58, "y": 475, "label": "Mycoplasma"}, {"x": 1.0, "y": 900, "label": "Chlamydia"}, {"x": 4.6, "y": 4300, "label": "E. coli"}, {"x": 4.2, "y": 4100, "label": "B. subtilis"}, {"x": 8.7, "y": 7800, "label": "Streptomyces"}, {"x": 13.0, "y": 9400, "label": "Sorangium"}]'
let plot = Viz.scatter(data, '{"title": "Prokaryotic Genome Size vs Gene Count", "x_label": "Genome size (Mb)", "y_label": "Genes", "color": "#06B6D4"}')
print(plot)

New Genes from Old: Duplication and Divergence

Where do new genes come from? The primary mechanism is gene duplication. When a segment of DNA is accidentally copied — through unequal crossing over, retrotransposition, or whole-genome duplication — the organism ends up with two copies of the same gene. One copy can continue its original function while the other is free to accumulate mutations. Over time, the duplicate may acquire a new function (neofunctionalization), specialize in a subset of the original function (subfunctionalization), or accumulate disabling mutations and become a pseudogene.

This process, repeated over millions of years, creates gene families — groups of related genes within one genome. The globin gene family is a classic example: a single ancestral globin gene has given rise to hemoglobin α, β, γ, δ, and ε chains, as well as myoglobin, each optimized for a different tissue or developmental stage.

let globin_alpha = "ATGGCTAGCAAAGACTTCACCGAG"
let globin_beta = "ATGGCTAGTCAAGACTTCGCCGAG"
let myoglobin = "ATGGCAACCGAAGACTTCACTCAG"
let d_ab = Phylo.distance(globin_alpha, globin_beta, "jc69")
let d_am = Phylo.distance(globin_alpha, myoglobin, "jc69")
let d_bm = Phylo.distance(globin_beta, myoglobin, "jc69")
print("Alpha vs Beta distance:    " + d_ab)
print("Alpha vs Myoglobin distance: " + d_am)
print("Beta vs Myoglobin distance:  " + d_bm)
let labels = '["Alpha", "Beta", "Myoglobin"]'
let distances = '[[0, ' + d_ab + ', ' + d_am + '], [' + d_ab + ', 0, ' + d_bm + '], [' + d_am + ', ' + d_bm + ', 0]]'
let tree = Phylo.nj(labels, distances)
print(tree)

Alpha and beta globin are more similar to each other (they duplicated more recently) than either is to myoglobin (which diverged earlier). Alignment scores thus reconstruct the history of duplication events within a gene family.

More than 200 gene families are found in all three domains of life, indicating that the last universal common ancestor (LUCA) already possessed a sophisticated genetic toolkit. These universal families include genes for DNA replication, transcription, translation, and core metabolism — the minimal shared inheritance of all living things.

Horizontal Gene Transfer

Not all genes are inherited vertically from parent to offspring. Horizontal gene transfer (HGT) moves genes between organisms that are not parent and child — sometimes across vast evolutionary distances.

In prokaryotes, three major mechanisms drive HGT:

Transformation — uptake of free DNA from the environment
Transduction — transfer of DNA by bacteriophages (bacterial viruses)
Conjugation — direct transfer of DNA between cells through a pilus or bridge

HGT has profound practical consequences. The spread of antibiotic resistance genes among bacteria is largely driven by horizontal transfer, often on mobile genetic elements called plasmids. A resistance gene that arises in one species can rapidly spread to unrelated species in the same environment.

HGT also complicates evolutionary analysis. If genes move freely between lineages, the history of a gene may differ from the history of the organism that carries it. For this reason, the metaphor of a “tree” of life has been supplemented by the image of a web of life, reflecting the reticulate nature of prokaryotic evolution.

Sex and Genetic Recombination

While HGT transfers genes between species, sexual reproduction shuffles genetic information within a species. During meiosis, homologous chromosomes undergo crossing over (recombination), exchanging segments of DNA. Combined with the independent assortment of chromosomes, this generates an enormous number of genetically unique offspring from just two parents.

The evolutionary advantage of sex is thought to lie in this genetic diversity. By producing offspring with novel combinations of alleles, sexual reproduction allows populations to adapt more rapidly to changing environments, resist parasites, and purge harmful mutations. Nearly all eukaryotes engage in some form of sexual reproduction, and even bacteria exchange genetic material through conjugation — a form of “parasexual” gene exchange.

Deducing Gene Function from Sequence

One of the most powerful principles in genomics is that sequence similarity implies functional similarity. If a newly sequenced gene is highly similar to a gene whose function is already known in another organism, there is a strong inference that the new gene has a similar function.

This principle works because homologous genes — those descended from a common ancestor — tend to retain the same three-dimensional protein structure and biochemical activity, even after hundreds of millions of years of independent evolution. For example, the cell-division protein FtsZ in bacteria is clearly homologous to tubulin in eukaryotes; although their sequences have diverged significantly, they share the same GTPase fold and play analogous roles in cell division.

Functional annotation of newly sequenced genomes relies heavily on this approach. After a genome is assembled and genes are predicted, each predicted protein is compared against databases of characterized proteins to assign putative functions. Genes with no detectable homologs are labeled hypothetical proteins — and these often make up 20–40% of a newly sequenced prokaryotic genome, highlighting how much remains to be discovered.

Sequence Homology: Orthologs and Paralogs

Two genes are homologous if they share a common ancestor. Homology is an all-or-nothing concept — two genes are either homologous or they are not — though the degree of sequence similarity can vary enormously.

Homologs fall into two categories based on how they arose:

Orthologs arise by speciation. When a species splits into two lineages, each inherits a copy of the ancestral gene. Human insulin and mouse insulin are orthologs — they descend from the insulin gene in the common ancestor of humans and mice.
Paralogs arise by gene duplication within a genome. Human α-globin and β-globin are paralogs — they arose when an ancestral globin gene was duplicated and the two copies diverged.

The distinction matters for functional prediction. Orthologs typically retain the same function (because they continue to fill the same biological role in their respective organisms). Paralogs may retain similar functions, but they can also diverge significantly — one copy may evolve a new substrate specificity, expression pattern, or tissue localization.

BLAST and Sequence Similarity Searching

The most widely used tool in bioinformatics is BLAST (Basic Local Alignment Search Tool). Given a query sequence, BLAST rapidly searches a database of millions of sequences to find the best matches.

BLAST works by a seed-and-extend heuristic strategy:

Break the query into short seed words (typically 11 bases for DNA or 3 amino acids for protein)
Scan the database for exact or near-exact matches to these seeds
Extend each seed match in both directions using local alignment
Report matches with statistically significant scores

The key statistical measure in BLAST output is the E-value (expect value): the number of alignments with an equal or better score that would be expected by chance in a database of that size. An E-value of 10−50; means the match is overwhelmingly unlikely to be a coincidence; an E-value near 1 means the match could easily be random.

BLAST comes in several flavors: blastn (DNA vs. DNA), blastp (protein vs. protein), blastx (translated DNA vs. protein), tblastn (protein vs. translated DNA), and tblastx (translated DNA vs. translated DNA).

At its core, BLAST relies on local alignment — the same Smith-Waterman principle we can demonstrate here:

let query = "ATGGCTAGCAAAGAC"
let database_hit = "CCCATGGCAAGCAAAGACTTTCCC"
let result = Align.local(query, database_hit)
print("Local alignment score: " + result.score)
print(result.alignment)

BLAST uses heuristics to make this process fast enough to search entire databases in seconds, trading a small amount of sensitivity for enormous gains in speed.

Multiple Sequence Alignment

Pairwise alignment compares two sequences at a time. Multiple sequence alignment (MSA) extends this to three or more sequences simultaneously, revealing patterns of conservation across an entire gene family or set of orthologs.

MSA is computationally challenging — the optimal alignment of N sequences is an NP-hard problem. In practice, progressive alignment algorithms build the MSA by first aligning the most similar pair, then adding more distant sequences one by one. The most widely used MSA tools include:

ClustalW / Clustal Omega — the classic progressive aligner, widely used and well-established
MUSCLE — faster than ClustalW, uses iterative refinement to improve accuracy
MAFFT — highly accurate, especially for large datasets; uses fast Fourier transform for initial alignment

The output of an MSA reveals conserved positions (columns where all sequences agree) and variable positions (where sequences differ). Conserved positions often correspond to functionally or structurally important residues. The consensus derived from an MSA can also be used to build sequence profiles and hidden Markov models (HMMs) for more sensitive database searches.

Let’s align orthologous gene fragments from four species simultaneously. Notice which positions are conserved across all four — these are likely functionally critical:

let human =   "ATGGCTAGCAAAGACTTCACCGAG"
let mouse =   "ATGGCTAGCAAAGACTTCGCCGAG"
let chicken = "ATGGCAAGCAAAGACTTTACCGAG"
let fish =    "ATGGCAAGCGAAGACTTTACTGAG"
let msa = Seq.msa(human, mouse, chicken, fish)
print(msa)

The MSA reveals a core of positions conserved across all four species (likely essential for protein function) and variable positions that accumulate mutations at different rates.

Phylogenetic Tree Construction

A phylogenetic tree is a branching diagram that represents the evolutionary relationships among a set of organisms or sequences. Phylogenetic trees are built from multiple sequence alignments using one of several methods:

Distance-based methods compute a pairwise distance matrix from the alignment and use clustering algorithms to build the tree. Neighbor-joining is the most popular distance method — it is fast and produces reasonable trees for closely related sequences.

Maximum parsimony finds the tree that requires the fewest evolutionary changes (mutations) to explain the observed sequences. It works well when sequences have not diverged too much, but can be misled when rates of evolution vary across lineages.

Maximum likelihood evaluates every possible tree topology by calculating the probability of the observed alignment given a model of sequence evolution. It is statistically rigorous and generally the most accurate method, but computationally expensive.

Bayesian inference is similar to maximum likelihood but incorporates prior probabilities and produces posterior probability distributions for tree topologies. Programs like MrBayes and BEAST use Markov chain Monte Carlo (MCMC) sampling.

Let’s build a tree from real distance data. We compute pairwise evolutionary distances between four species using the Jukes-Cantor model, then apply the neighbor-joining algorithm:

let human =   "ATGGCTAGCAAAGACTTCACCGAG"
let mouse =   "ATGGCTAGCAAAGACTTCGCCGAG"
let chicken = "ATGGCAAGCAAAGACTTTACCGAG"
let fish =    "ATGGCAAGCGAAGACTTTACTGAG"
let d_hm = Phylo.distance(human, mouse, "jc69")
let d_hc = Phylo.distance(human, chicken, "jc69")
let d_hf = Phylo.distance(human, fish, "jc69")
let d_mc = Phylo.distance(mouse, chicken, "jc69")
let d_mf = Phylo.distance(mouse, fish, "jc69")
let d_cf = Phylo.distance(chicken, fish, "jc69")
let labels = '["Human", "Mouse", "Chicken", "Fish"]'
let distances = '[[0, ' + d_hm + ', ' + d_hc + ', ' + d_hf + '], [' + d_hm + ', 0, ' + d_mc + ', ' + d_mf + '], [' + d_hc + ', ' + d_mc + ', 0, ' + d_cf + '], [' + d_hf + ', ' + d_mf + ', ' + d_cf + ', 0]]'
let tree = Phylo.nj(labels, distances)
print(tree)

The neighbor-joining algorithm correctly groups Human and Mouse together (mammals), then joins Chicken, with Fish as the outgroup — matching the known vertebrate phylogeny.

Once a tree is constructed, we can inspect its properties — number of taxa, internal nodes, and total branch length — using Phylo.info():

let newick = "((Human:0.04,Mouse:0.04):0.08,(Chicken:0.13,Fish:0.17):0.04);"
let info = Phylo.info(newick)
print("Tree summary:")
print(info)

The reliability of a tree is assessed by bootstrap analysis: the alignment is resampled many times, a tree is built from each resampled dataset, and the percentage of trees supporting each branch is reported. Bootstrap values above 70–80% are generally considered well-supported.

Reading Phylogenetic Trees: Newick Format

Phylogenetic trees are stored as text strings in Newick format, a compact parenthetical notation. A tree with three species might look like:

((Human:0.1, Mouse:0.12):0.08, Fish:0.35);

This says Human and Mouse are sister taxa (separated by a node), with branch lengths of 0.1 and 0.12 substitutions per site respectively. Their common ancestor is 0.08 substitutions from the root ancestor, and Fish is an outgroup at distance 0.35.

Key rules for reading trees: the tips (leaves) represent extant sequences or organisms; internal nodes represent inferred common ancestors; branch lengths represent evolutionary distance (usually substitutions per site); and sister groups share a more recent common ancestor with each other than with any other group. A common mistake is reading left-to-right order as meaningful — it is not. The branching pattern (topology) is what matters, not the order of the tips.

Genome Databases and Browsers

The comparative genomics discussed in this lesson depends on a vast infrastructure of public databases and genome browsers:

NCBI Genome — hosts thousands of complete genome assemblies with gene annotations, linked to GenBank, RefSeq, and the taxonomy database
Ensembl (EBI/EMBL) — provides gene annotations, comparative genomics, and variation data with a powerful genome browser and programmatic API access
UCSC Genome Browser — a highly customizable browser for visualizing genomic data in the context of a reference genome, with hundreds of annotation tracks

These browsers allow researchers to zoom from whole-chromosome views down to individual bases, examine gene structures, compare synteny (conserved gene order) across species, and overlay functional data like gene expression and epigenetic marks. Programmatic access via APIs and tools like BioMart enables large-scale automated queries.

Exercise: Compute Evolutionary Distances

You have rRNA gene fragments from three domains of life. Compute the Jukes-Cantor evolutionary distance between each pair. Which pair has the smallest distance (i.e., which two domains are most closely related)?

let bacteria = "ATGGATAGCGAAGATTTTACCGAT"
let archaea =  "ATGGCTAGCAAAGACTTCACCGAG"
let eukarya =  "ATGGCAAGCAAAGACTTCACCGAG"
let d_ba = Phylo.distance(bacteria, archaea, "jc69")
let d_be = Phylo.distance(bacteria, eukarya, "jc69")
let d_ae = Phylo.distance(archaea, eukarya, "jc69")
print("Bacteria-Archaea distance: " + d_ba)
print("Bacteria-Eukarya distance: " + d_be)
print("Archaea-Eukarya distance:  " + d_ae)
// Which pair is closest? Set answer to the pair name
let answer = "Archaea-Eukarya"
print(answer)

Exercise: Build a Phylogenetic Tree from Distance Data

Three globin paralogs have diverged at different rates. Compute all pairwise evolutionary distances and use the UPGMA algorithm to build a tree. Which two globins cluster together first (i.e., which pair duplicated most recently)?

let alpha =    "ATGGCTAGCAAAGACTTCACCGAG"
let beta =     "ATGGCTAGTCAAGACTTCGCCGAG"
let myoglobin = "ATGGCAACCGAAGACTTCACTCAG"
let d_ab = Phylo.distance(alpha, beta, "jc69")
let d_am = Phylo.distance(alpha, myoglobin, "jc69")
let d_bm = Phylo.distance(beta, myoglobin, "jc69")
print("Alpha-Beta distance:      " + d_ab)
print("Alpha-Myoglobin distance: " + d_am)
print("Beta-Myoglobin distance:  " + d_bm)
let labels = '["Alpha", "Beta", "Myoglobin"]'
let distances = '[[0, ' + d_ab + ', ' + d_am + '], [' + d_ab + ', 0, ' + d_bm + '], [' + d_am + ', ' + d_bm + ', 0]]'
let tree = Phylo.upgma(labels, distances)
print(tree)
// Which pair clusters first?
let answer = "Alpha-Beta"
print(answer)

Exercise: Interpret a Multiple Sequence Alignment

Align the rRNA fragments from all three domains of life simultaneously. Examine the MSA output and count how many positions are perfectly conserved across all three. Report whether most positions are conserved or variable.

let bacteria = "ATGGATAGCGAAGATTTTACCGAT"
let archaea =  "ATGGCTAGCAAAGACTTCACCGAG"
let eukarya =  "ATGGCAAGCAAAGACTTCACCGAG"
let msa = Seq.msa(bacteria, archaea, eukarya)
print(msa)
// Are most positions conserved or variable?
let answer = "conserved"
print(answer)

Knowledge Check

Summary

In this lesson you covered the diversity of genomes and the computational tools for comparing them:

Diverse energy sources power cells — phototrophy, chemolithotrophy, and chemoorganotrophy — with prokaryotes showing by far the greatest metabolic diversity
Nitrogen-fixing and carbon-fixing organisms make essential elements available to all other life
The tree of life has three domains — Bacteria, Archaea, and Eukarya — revealed by rRNA sequence comparison, with Archaea and Eukarya as sister groups
Gene conservation varies — some genes (rRNA, core enzymes) are nearly identical across all life; others evolve rapidly under selection pressure
Prokaryotic genomes typically encode 1,000–6,000 genes, with genome size correlating with lifestyle complexity
Gene duplication and divergence creates gene families; more than 200 gene families are universal to all three domains
Horizontal gene transfer moves genes between unrelated organisms via transformation, transduction, and conjugation — driving antibiotic resistance and complicating the tree of life
Sexual recombination shuffles genetic variation within species through crossing over and independent assortment
Sequence similarity predicts function — homologous genes typically perform similar roles across organisms
Orthologs (from speciation) and paralogs (from duplication) are the two classes of homologous genes
BLAST searches databases for sequence matches using a seed-and-extend heuristic, with E-values indicating statistical significance
Multiple sequence alignment (ClustalW, MUSCLE, MAFFT) reveals conserved positions across gene families
Phylogenetic trees are built by neighbor-joining, maximum parsimony, maximum likelihood, or Bayesian methods, and stored in Newick format
Genome databases and browsers (NCBI, Ensembl, UCSC) provide the infrastructure for comparative genomics

References

Alberts B, Johnson A, Lewis J, Morgan D, Raff M, Roberts K, Walter P. Molecular Biology of the Cell, 7th ed. New York: W.W. Norton; 2022. Chapter 1: Cells and Genomes.
Woese CR, Fox GE. Phylogenetic structure of the prokaryotic domain: the primary kingdoms. Proc Natl Acad Sci USA. 1977;74(11):5088–5090.
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215(3):403–410.
Saitou N, Nei M. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol. 1987;4(4):406–425.
Thompson JD, Higgins DG, Gibson TJ. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 1994;22(22):4673–4680.
Katoh K, Standley DM. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol. 2013;30(4):772–780.
Ochman H, Lawrence JG, Groisman EA. Lateral gene transfer and the nature of bacterial innovation. Nature. 2000;405(6784):299–304.
Zuckerkandl E, Pauling L. Molecules as documents of evolutionary history. J Theor Biol. 1965;8(2):357–366.

Powered by

cyanea-seq cyanea-phylo

evolution phylogenetics homology tree of life orthologs paralogs horizontal gene transfer gene families BLAST multiple sequence alignment genome browsers prokaryotes eukaryotes