How Genomes Evolve

Intermediate Molecular Biology ~35 min

← Previous Next →

Learn how mutations, duplications, rearrangements, and selection drive genome evolution — and the computational methods for detecting these forces through whole-genome comparison.

Introduction

Genomes are not static. Every generation, DNA replication introduces a small number of errors, recombination shuffles existing variation, and transposable elements rearrange genomic architecture. Over evolutionary time, these changes accumulate and diverge between lineages, providing both the raw material for adaptation and a molecular record of evolutionary history.

This lesson explores the forces that shape genome evolution — from point mutations and gene duplications to whole-chromosome rearrangements — and introduces the computational methods used to compare genomes, detect selection, and reconstruct the history of life at the DNA sequence level.

Conservation Reveals Function

The most fundamental principle of comparative genomics is that functional DNA sequences are conserved through evolution. Mutations that disrupt an important function reduce the organism’s fitness and are eliminated by purifying (negative) selection. Regions under purifying selection therefore change much more slowly than neutral sequences (those with no function).

By aligning genomes from multiple species, we can identify conserved sequences — and these are likely to be functionally important. This approach has been spectacularly successful: comparison of the human genome with those of other mammals identifies roughly 5% of the human genome as being under evolutionary constraint, far more than the ~1.5% that encodes proteins. The remaining constrained sequence includes regulatory elements, structural RNAs, and sequences of unknown function.

let species_a = "ATGGCTAGCAAAGACTTCACC"
let species_b = "ATGGCTAGCAAAGACTTCGCC"
let dist = Seq.hamming(species_a, species_b)
print("Hamming distance: " + dist)

A single-base difference (Hamming distance of 1) between these orthologous genes indicates strong purifying selection: the gene has barely changed despite millions of years of divergence. Hamming distance provides a quick measure of sequence divergence for equal-length sequences.

Genome Alterations: Mutations, Repair Failures, and Transposons

Genome change arises from three broad categories of events:

Failures of DNA replication and repair introduce point mutations (single nucleotide changes), small insertions and deletions (indels), and larger rearrangements when replication forks stall or repair pathways make errors. The human germline mutation rate is approximately 1–1.5 × 10−8; mutations per base pair per generation, amounting to roughly 60–80 new mutations per individual.

Chromosomal rearrangements reshuffle genome architecture through:

Inversions — a segment flips orientation within a chromosome
Translocations — a segment moves to a different chromosome
Fusions and fissions — chromosomes join or split (human chromosome 2 is a fusion of two ancestral ape chromosomes)
Segmental duplications — large blocks (>1 kb) are copied to new locations

Transposable elements are sequences that copy or move within the genome, creating insertions, deletions, and rearrangements. About 45% of the human genome consists of recognizable TE-derived sequences. While most are now inactive fossils, some LINE-1 elements remain active and continue to generate new insertions. TEs can also serve as a creative evolutionary force: TE insertions near genes can donate new regulatory elements, and some protein-coding genes have been derived from “domesticated” transposons.

Sequence Divergence Tracks Evolutionary Time

The genome sequences of two species differ in proportion to the time since they diverged from a common ancestor. This proportionality is the basis of the molecular clock: by calibrating sequence divergence against fossil dates, biologists can estimate divergence times for lineages without a fossil record.

For a given gene, the rate of sequence change depends on functional constraint:

Highly constrained genes (rRNA, histones) change very slowly
Moderately constrained genes (most enzymes) change at intermediate rates
Pseudogenes and neutral intergenic DNA change at the highest rate (the neutral substitution rate)

The neutral substitution rate provides the clock’s tick rate — it approximates the underlying mutation rate, because neutral mutations are fixed by drift at the same rate they arise.

We can observe the molecular clock in action by simulating sequence evolution. Starting from an ancestral sequence, mutations accumulate over generations at a rate we control, producing a diverged descendant:

let ancestor = "ATGGCTAGCAAAGACTTCACCGAG"
let descendant = Phylo.simulate(ancestor, 500, 0.002, 42)
print("Ancestral sequence:  " + ancestor)
print("After 500 generations: " + descendant)
let dist = Seq.hamming(ancestor, descendant)
print("Accumulated mutations: " + dist)

Synonymous vs. Nonsynonymous Substitutions

Because the genetic code is degenerate, some mutations change the DNA but not the protein:

Synonymous (silent) substitutions change a codon but not the amino acid (e.g., GCT → GCC, both alanine)
Nonsynonymous substitutions change the amino acid (e.g., GCT [Ala] → GAT [Asp])

The ratio of nonsynonymous to synonymous substitution rates, dN/dS (also called ω or Ka/Ks), is one of the most informative statistics in evolutionary genomics:

dN/dS (ω)	Interpretation
ω < 1	Purifying selection — amino acid changes are harmful (most genes)
ω ≈ 1	Neutral evolution — no constraint on amino acid changes (pseudogenes)
ω > 1	Positive selection — amino acid changes are favored (adaptive evolution)

let gene_v1 = "ATGGCTAGCAAAGACTGA"
let gene_v2 = "ATGGCCAGCAAGGACTGA"
let prot_v1 = Seq.translate(gene_v1)
let prot_v2 = Seq.translate(gene_v2)
print("Gene v1 protein: " + prot_v1)
print("Gene v2 protein: " + prot_v2)

Both genes encode the same protein despite two DNA differences — these are synonymous substitutions, accumulating at the neutral rate while purifying selection prevents amino acid changes.

The Tree of Life from Genome Comparisons

Phylogenetic trees can be constructed from single genes, but whole-genome comparisons provide far more robust evolutionary histories by sampling hundreds or thousands of gene families simultaneously. Concatenated gene alignments, coalescent-based methods (ASTRAL), and whole-genome distance methods all contribute to building the tree of life.

Genome-scale comparisons have resolved many long-standing evolutionary questions: the placement of turtles within archosaurs (closer to birds and crocodilians than to lizards), the branching order of early animal phyla, and the relationships among the major groups of bacteria and archaea.

Human and Mouse: How Genome Structures Diverge

The human and mouse genomes provide a detailed case study in genome evolution. Both contain approximately 20,000 protein-coding genes, and about 80% of human genes have a clear ortholog in mouse. Overall, the two genomes share about 40% sequence identity at the nucleotide level (much higher in protein-coding regions).

However, the genomes have been extensively rearranged. About 300 syntenic blocks (regions where gene order is conserved) can be identified, but these blocks are reshuffled across chromosomes. Human chromosome 1, for example, corresponds to segments of mouse chromosomes 1, 3, 4, and 6. These rearrangements — primarily inversions and translocations — disrupted gene order without disrupting gene function.

The sizes of the two genomes also differ: the human genome (~3.2 Gb) is about 14% larger than the mouse genome (~2.8 Gb). This reflects different rates of DNA addition (transposon expansion, segmental duplication) and DNA loss (deletion) in the two lineages. In vertebrates generally, genome size reflects the balance between these opposing forces, not the number of genes.

Ancient Genomes and Paleogenomics

Advances in ancient DNA extraction and sequencing have enabled researchers to reconstruct the sequences of extinct organisms and ancient populations:

The Neanderthal genome revealed that modern non-African humans carry 1–4% Neanderthal DNA, evidence of ancient interbreeding
The Denisovan genome (from a finger bone) revealed another archaic human lineage that interbred with ancestors of present-day Melanesians and some Asian populations
Ancient DNA from preserved remains has been used to track the spread of agriculture, map past migrations, and study the evolution of pathogens

These studies have transformed our understanding of human evolution, demonstrating that the human lineage is not a simple branching tree but a reticulate network involving multiple episodes of gene flow between diverging populations.

Multispecies Comparisons Identify Conserved Elements of Unknown Function

Aligning the genomes of multiple mammalian species has identified hundreds of thousands of conserved non-coding elements — sequences under purifying selection that do not encode proteins. Many of these correspond to known regulatory elements (enhancers, silencers, non-coding RNAs), but a substantial fraction has no known function.

The ENCODE project has characterized many of these elements using functional assays (ChIP-seq, ATAC-seq, reporter assays), but the function of many conserved non-coding elements remains mysterious. They represent some of the most fascinating open questions in genome biology: why has evolution preserved these sequences if they appear to have no detectable biochemical activity?

Human Accelerated Regions

While most of the genome changes slowly in the human lineage, a small number of sequences that were highly conserved across mammalian evolution have changed rapidly in the human lineage. These human accelerated regions (HARs) are candidates for sequences that contributed to uniquely human traits.

The most dramatic example, HAR1, is a non-coding RNA expressed in the developing brain that acquired 18 substitutions in the human lineage compared to just 2 changes in the entire lineage from chicken to chimpanzee. Other HARs are enriched near genes involved in brain development, limb morphology, and other traits that distinguish humans from other primates.

HARs are identified computationally by comparing substitution rates in the human lineage to background rates estimated from multispecies alignments. The excess of substitutions is tested statistically to distinguish genuine acceleration from neutral rate variation.

Whole-Genome Alignment Tools

Comparing entire genomes requires specialized alignment tools:

MUMmer — uses maximal unique matches (MUMs) for rapid alignment of closely related genomes; excellent for bacterial genome comparisons
Progressive Mauve — aligns genomes that may contain rearrangements (inversions, translocations), identifying locally collinear blocks
minimap2 — an ultrafast aligner used for both long-read mapping and genome-to-genome alignment
LASTZ and MULTIZ — used by UCSC Genome Browser for pairwise and multiple genome alignments across vertebrates

Synteny Analysis and Visualization

Synteny analysis identifies regions where gene order is conserved between species, revealing the rearrangements that have occurred since divergence:

SynMap (CoGe) — an interactive tool for comparing gene order between any two genomes, identifying syntenic blocks and visualizing rearrangements
Circos — produces circular layout diagrams that beautifully display chromosomal correspondences and rearrangements between genomes

Synteny analysis is invaluable for genome assembly validation (if gene order disagrees with a closely related species, the assembly may have an error), for studying the evolution of genome architecture, and for transferring gene annotations from a well-studied genome to a newly sequenced one.

dN/dS Analysis and Selection Detection

Computational detection of selection relies on comparing substitution patterns:

dN/dS analysis compares the rates of nonsynonymous and synonymous substitutions across a phylogeny. Tools like PAML (Phylogenetic Analysis by Maximum Likelihood) and HyPhy implement sophisticated codon substitution models that can detect:

Gene-wide selection — is the average ω across the gene significantly different from 1?
Branch-specific selection — did selection pressure change on a particular lineage?
Site-specific selection — which individual codons are under positive selection?

Molecular clock estimation uses calibrated phylogenies to date divergence events. Programs like BEAST and r8s implement relaxed molecular clock models that allow the rate of evolution to vary across lineages, producing more accurate divergence time estimates.

Ancestral genome reconstruction uses parsimony or maximum likelihood methods to infer the gene content and gene order of ancestral genomes. This has allowed researchers to reconstruct the genomes of hypothetical ancestors — such as the ancestral boreoeutherian mammal — and trace the rearrangements that occurred in each descendant lineage.

let alpha_globin = "ATGGTGCTGTCTCCTGCCGAC"
let beta_globin = "ATGGTGCATCTGACTCCTGAG"
let gamma_globin = "ATGGTGCATCTGACTCCTGAA"
let msa = Seq.msa([alpha_globin, beta_globin, gamma_globin])
print("Globin paralog family alignment:")
print(msa)

Multiple sequence alignment of paralog family members reveals both the conservation of core functional residues and the divergence that enabled specialization. The α-, β-, and γ-globins arose by ancient gene duplications and have since diverged to serve distinct roles in oxygen transport.

let alpha = "ATGGTGCTGTCTCCTGCCGAC"
let beta = "ATGGTGCATCTGACTCCTGAG"
let gamma = "ATGGTGCATCTGACTCCTGAA"
let delta = "ATGGTGCATCTGACTCCTGAT"
let dist = Phylo.distance([alpha, beta, gamma, delta])
let tree = Phylo.nj(dist)
print("Globin gene family tree:")
print(tree)

Neighbor-joining tree construction from pairwise distances reveals the branching history of gene family evolution. The β-, γ-, and δ-globins cluster together, reflecting their more recent duplication, while α-globin diverged earlier.

let pop = PopGen.wright_fisher(100, 0.1, 200)
print("Wright-Fisher drift simulation:")
print("Population: 100, initial freq: 0.1, generations: 200")
print(pop)

The Wright-Fisher model simulates allele frequency change due to genetic drift alone. In finite populations, neutral alleles fluctuate randomly and may be lost or fixed entirely by chance. Smaller populations experience stronger drift, making it harder for weak selection to override stochastic effects.

Exercise: Build a Phylogenetic Tree from Gene Sequences

Compute pairwise distances for four homologous sequences and build a neighbor-joining tree. Determine which pair of sequences are most closely related based on their branching pattern.

let seq_a = "ATGGCTAGCAAAGACTTCACC"
let seq_b = "ATGCATACCGAAGATTTCGCC"
let seq_c = "ATGGCTAGCAAAGACTTCGCC"
let seq_d = "ATGGCTAGCAAAGACTTCGAC"
let dist = Phylo.distance([seq_a, seq_b, seq_c, seq_d])
print("Distance matrix:")
print(dist)
let tree = Phylo.nj(dist)
print("Neighbor-joining tree:")
print(tree)
let answer = "Seq C and Seq D"
print(answer)

Exercise: Simulate Genetic Drift

Run Wright-Fisher simulations with two different population sizes to observe how population size affects the strength of genetic drift. Determine which population shows more stable allele frequencies.

let small_pop = PopGen.wright_fisher(50, 0.5, 100)
let large_pop = PopGen.wright_fisher(500, 0.5, 100)
print("Small population (N=50) drift:")
print(small_pop)
print("Large population (N=500) drift:")
print(large_pop)
let answer = "Large population"
print(answer)

Exercise: Compare Paralog Divergence

Align a set of paralogs using multiple sequence alignment and use Hamming distance to determine which pair has diverged the most since duplication. Greater divergence suggests an older duplication event or relaxed constraint.

let alpha = "ATGGTGCTGTCTCCTGCCGAC"
let beta = "ATGCATACCGAAGATTTCGCC"
let gamma = "ATGGTGCATCTGACTCCTGAG"
let dist_ab = Seq.hamming(alpha, beta)
let dist_ag = Seq.hamming(alpha, gamma)
let dist_bg = Seq.hamming(beta, gamma)
print("alpha vs beta: " + dist_ab)
print("alpha vs gamma: " + dist_ag)
print("beta vs gamma: " + dist_bg)
let msa = Seq.msa([alpha, beta, gamma])
print("Paralog MSA:")
print(msa)
let answer = "alpha vs beta"
print(answer)

Knowledge Check

Summary

In this lesson you covered the mechanisms and analysis of genome evolution:

Conservation reveals function — ~5% of the human genome is under evolutionary constraint, far exceeding the ~1.5% that encodes protein
Genome alterations arise from replication errors, repair failures, transposon activity, and chromosomal rearrangements
Sequence divergence tracks time — the molecular clock relates substitution rates to divergence dates
dN/dS analysis distinguishes purifying selection (ω < 1), neutral evolution (ω ≈ 1), and positive selection (ω > 1)
Whole-genome comparisons build robust phylogenies and reveal ~300 syntenic blocks between human and mouse
Genome size reflects the balance of DNA addition and loss, not gene number
Ancient genomes (Neanderthal, Denisovan) reveal past admixture; 1–4% of non-African human DNA is Neanderthal-derived
Conserved non-coding elements of unknown function represent open questions in genome biology
Human accelerated regions (HARs) are candidates for sequences driving uniquely human traits
Whole-genome aligners (MUMmer, Mauve, minimap2) handle genome-scale comparisons
Synteny visualization (SynMap, Circos) displays conserved gene order and rearrangements
dN/dS tools (PAML, HyPhy) detect selection at gene, branch, and site levels; BEAST estimates divergence times

References

Alberts B, Johnson A, Lewis J, Morgan D, Raff M, Roberts K, Walter P. Molecular Biology of the Cell, 7th ed. New York: W.W. Norton; 2022. Chapter 4: DNA, Chromosomes, and Genomes.
Ohno S. Evolution by Gene Duplication. Berlin: Springer-Verlag; 1970.
Lynch M, Conery JS. The evolutionary fate and consequences of duplicate genes. Science. 2000;290(5494):1151–1155.
Yang Z. PAML 4: phylogenetic analysis by maximum likelihood. Mol Biol Evol. 2007;24(8):1586–1591.
Pond SLK, Frost SDW, Muse SV. HyPhy: hypothesis testing using phylogenies. Bioinformatics. 2005;21(5):676–679.
Drummond AJ, Suchard MA, Xie D, Rambaut A. Bayesian phylogenetics with BEAUti and the BEAST 1.7. Mol Biol Evol. 2012;29(8):1969–1973.
Emms DM, Kelly S. OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome Biol. 2019;20:238.
Kurtz S, Phillippy A, Delcher AL, et al. Versatile and open software for comparing large genomes. Genome Biol. 2004;5:R12.

Powered by

cyanea-phylo cyanea-popgen cyanea-seq

genome evolution mutations selection synteny comparative genomics dN/dS molecular clock whole-genome alignment ancient DNA positive selection conserved elements human accelerated regions