DNA Replication

Intermediate Molecular Biology ~35 min

← Previous Next →

Understand how cells faithfully copy their DNA — from the replication fork machinery and error correction to replication origins, telomeres, and the computational tools for analyzing replication dynamics.

Introduction

Every time a cell divides, it must accurately copy its entire genome — 6.4 billion base pairs in a human diploid cell. This replication must be fast (the entire genome is copied in about 8 hours) and extraordinarily accurate, with a final error rate of only about one mistake per billion base pairs. The machinery that achieves this is one of the most impressive molecular systems in biology.

This lesson examines how the replication fork works, the multiple layers of error correction that ensure fidelity, the coordination of replication across large eukaryotic chromosomes, and the special problem of replicating chromosome ends. We also cover the computational tools for analyzing mutation rates, mutational signatures, and replication dynamics.

Mutation Rates Are Extremely Low

The spontaneous mutation rate in humans is approximately 1–1.5 × 10−8; per base pair per generation — roughly 60–80 new mutations per individual. This extraordinarily low rate is achieved through the combined action of accurate base selection by DNA polymerase, proofreading, and post-replication mismatch repair.

Why must mutation rates be so low? A human genome with 6.4 billion base pairs and ~20,000 genes can tolerate only a small number of mutations per generation before the cumulative mutational load begins to degrade fitness. Low mutation rates are necessary for life as we know it — if the error rate were even 10-fold higher, the number of deleterious mutations per generation would overwhelm natural selection’s ability to eliminate them, leading to a phenomenon called error catastrophe.

Yet the rate cannot be zero. Mutations are the ultimate source of genetic variation, and some level of variation is essential for adaptation. The observed mutation rate represents an evolutionary balance between the cost of errors and the cost of maintaining higher-fidelity replication machinery.

The Replication Fork Is Asymmetrical

DNA replication begins at specific sequences called origins of replication, where the double helix is unwound to form a replication fork — the Y-shaped structure where new DNA is synthesized. E. coli has a single origin; human chromosomes have thousands.

Base-pairing underlies DNA replication: each strand of the parental double helix serves as a template for the synthesis of a new complementary strand, following the Watson-Crick rules (A pairs with T, G pairs with C).

Because DNA polymerase can only synthesize in the 5′→3′ direction, the two template strands at the fork must be handled differently:

The leading strand is synthesized continuously in the same direction as fork movement
The lagging strand is synthesized discontinuously as short Okazaki fragments (100–200 bp in eukaryotes, 1,000–2,000 bp in bacteria), each initiated by a new RNA primer

This asymmetry is a fundamental consequence of the antiparallel structure of the double helix and the unidirectional activity of DNA polymerase.

Only synthesis in the 5′→3′ direction allows efficient error correction: the 3′→5′ exonuclease (proofreading) activity of DNA polymerase can remove a mismatched nucleotide from the growing 3′ end and replace it with the correct one. If synthesis occurred 3′→5′, the energy-rich triphosphate would be at the growing end and removal of a mismatch would destroy the energy source needed for the next addition.

let template = "TACGACTGCAGTACGATCG"
let new_strand = Seq.complement(template)
print("Template (3'→5'): " + template)
print("New strand (5'→3'): " + new_strand)

Detecting AT-Rich Replication Origins

Origins of replication in bacteria (and many eukaryotic origins) are enriched in AT base pairs because A-T pairs have only two hydrogen bonds, making the helix easier to unwind. We can scan a sequence for AT-rich regions by counting short k-mers.

let origin_region = "AATATTGATCTCTTATTAGGATCATTTATTAAATAAT"
let flanking_region = "GCGCCTGCAGGCGTACCGCGGCTAGCGCCGG"
print("Origin-region k-mer profile (k=2):")
print(Seq.kmer_count(origin_region, 2))
print("Flanking-region k-mer profile (k=2):")
print(Seq.kmer_count(flanking_region, 2))

The origin region shows high counts of AA, AT, TA, and TT dinucleotides, reflecting the AT-richness that facilitates strand separation during replication initiation.

The Replication Machinery

The replication fork is serviced by a team of proteins working together as a replication machine:

Protein	Function
Helicase (DnaB/MCM)	Unwinds the double helix ahead of the fork
Single-strand binding proteins (SSB/RPA)	Stabilize exposed single-stranded DNA
Primase (DnaG)	Synthesizes short RNA primers to initiate each Okazaki fragment
DNA polymerase (Pol III/Pol ε, Pol δ)	Main replicative polymerase; synthesizes new DNA 5′→3′
Sliding clamp (β-clamp/PCNA)	Ring-shaped protein that tethers polymerase to DNA for processive synthesis
Clamp loader (γ complex/RFC)	Uses ATP to load the sliding clamp onto primed DNA
DNA ligase	Seals the nicks between Okazaki fragments after RNA primers are replaced
Topoisomerase (gyrase/Topo I, II)	Relieves torsional stress (supercoiling) ahead of the fork

These proteins cooperate as a coordinated machine. The leading- and lagging-strand polymerases are physically linked, allowing the lagging strand to loop back so that both strands are synthesized in the same direction at the fork. This trombone model explains how the replication machinery moves in one direction while synthesizing both strands.

Proofreading and Error Correction

DNA replication achieves extraordinary fidelity through three layers of error correction, each reducing the error rate by a factor of ~100–1,000:

Layer	Mechanism	Error rate
Base selection	DNA polymerase preferentially inserts the correct nucleotide	~10−5; (1 in 100,000)
Proofreading	3′→5′ exonuclease removes mismatches immediately	~10−7; (1 in 10 million)
Mismatch repair	Post-replication MMR system scans and corrects remaining errors	~10−9; (1 in 1 billion)

The strand-directed mismatch repair system is the final proofreading checkpoint. In E. coli, MutS detects mismatches, MutL coordinates the repair response, and MutH nicks the newly synthesized strand (identified by its transient lack of methylation). The mismatch and surrounding sequence are excised and resynthesized. In eukaryotes, the mechanism is similar but strand discrimination uses a different signal (likely nicks associated with the replication process).

Loss of mismatch repair in humans causes Lynch syndrome (hereditary nonpolyposis colorectal cancer) — a dramatically elevated mutation rate leading to cancer predisposition.

Replication Origins and Timing

Bacterial chromosomes typically have a single origin of replication (oriC in E. coli), from which two replication forks proceed bidirectionally around the circular chromosome until they meet at the terminus.

Eukaryotic chromosomes are far too large for a single origin. Instead, replication initiates at thousands of origins distributed along each chromosome. A large multisubunit complex — the pre-replication complex (pre-RC) — assembles at each origin during G1 phase, “licensing” it for replication. The complex includes the ORC (Origin Recognition Complex), Cdc6, Cdt1, and the MCM helicase. Each origin fires exactly once per cell cycle, ensuring that no region is replicated more than once.

DNA replication in eukaryotes takes place during S phase only, not throughout the cell cycle. The licensing system ensures that origins cannot re-fire until the cell passes through mitosis and enters the next G1 phase.

Different regions on the same chromosome replicate at distinct times during S phase. Gene-rich, euchromatic regions tend to replicate early, while gene-poor heterochromatin replicates late. The timing of replication is remarkably consistent between cells of the same type and correlates with chromatin state and gene expression.

Behind the replication fork, new nucleosomes are assembled on the daughter strands. Both old histones (recycled from the parental chromatin) and newly synthesized histones are incorporated, and the parental pattern of histone modifications must be restored — a process critical for maintaining epigenetic information through replication.

Features of the human genome that specify origins of replication remain to be fully understood. Unlike bacteria, which have well-defined origin sequences, human replication origins do not share a strong consensus sequence. Origins are associated with open chromatin, CpG islands, and specific histone marks, but the rules governing which sites are selected remain an active area of research.

Telomeres and the End-Replication Problem

Linear chromosomes face a fundamental challenge: the lagging strand cannot be fully replicated at the chromosome end because there is no place to lay down the last RNA primer. Without a solution, chromosomes would shorten by 50–200 bp with each division.

Telomerase solves this. It is a specialized reverse transcriptase that carries its own RNA template and adds TTAGGG repeats (in humans) to the 3′ end of the chromosome. This extension provides a template for conventional lagging-strand synthesis, compensating for the end-replication problem.

Telomerase is active in germ cells, stem cells, and most cancer cells, but is largely silenced in normal somatic cells. The result is progressive telomere shortening with each division in somatic cells — a molecular clock that limits the number of times a cell can divide (the Hayflick limit).

Telomere length is regulated by a multi-protein complex called shelterin, which caps and protects the telomere from being recognized as a DNA break. When telomeres become critically short, the shelterin complex can no longer fully protect them, triggering a DNA damage response that leads to cell cycle arrest (senescence) or apoptosis.

let telomere_unit = "TTAGGG"
let telomere = telomere_unit + telomere_unit + telomere_unit + telomere_unit
print("Telomere repeat: " + telomere)
print("Length: " + Seq.length(telomere) + " bases")
print("GC content: " + Seq.gc_content(telomere))

Replication Error Rates

We can simulate the effect of replication errors by comparing an original strand to a replicated copy that has accumulated mutations. The Hamming distance counts the number of positions where the two sequences differ — a direct measure of the error count.

let original  = "ATGGCTAGCAAAGACTTCACCGAGTATCCG"
let replicated = "ATGGCTAGCAAAGACTTCACCGAGTATCCG"
let errors = Seq.hamming(original, replicated)
print("Errors after faithful replication: " + errors)
let mutated = "ATGGCTAGCAAAGACTTTACCGAGTTTCCG"
let errors2 = Seq.hamming(original, mutated)
print("Errors after impaired proofreading: " + errors2)

With normal proofreading, the replicated strand is identical (0 errors). When proofreading is impaired, mismatches accumulate — here we see the effect of just two base substitutions.

Mutation Databases and Mutational Signatures

The consequences of replication errors and DNA damage are cataloged in major databases:

ClinVar (NCBI) — links genetic variants to clinical phenotypes, classifying variants as pathogenic, benign, or of uncertain significance
COSMIC (Catalogue Of Somatic Mutations In Cancer) — the most comprehensive resource for somatic mutations in cancer, with data from millions of tumors
dbSNP — catalogs single nucleotide polymorphisms and small variants in human populations

Mutational signatures are characteristic patterns of mutations left by different mutagenic processes. For example:

Signature SBS1 (aging) — C→T transitions at CpG sites, caused by spontaneous deamination of methylated cytosine
Signature SBS7 (UV light) — C→T mutations at dipyrimidine sites
Signature SBS4 (tobacco) — C→A transversions from benzo[a]pyrene adducts

The transition/transversion ratio (Ti/Tv) is a basic quality metric: in human germline variants, Ti/Tv is typically ~2.0–2.1 (transitions are more common because they involve less structural change). Deviations suggest either selection or technical artifacts.

Analysis of mutational signatures from cancer genomes can reveal the mutagenic processes active in a tumor, guide treatment decisions (e.g., tumors with homologous recombination deficiency respond to PARP inhibitors), and estimate the age of individual mutations.

Replication Origin and Timing Analysis

Computational tools analyze replication dynamics genome-wide:

Ori-Finder and oris databases predict replication origins in bacterial genomes from sequence features (GC skew, DnaA box motifs)
Replication timing profiles from BrdU-seq, Repli-seq, or sort-seq data map when each region of the genome replicates during S phase
OK-seq (Okazaki fragment sequencing) maps replication fork directionality by sequencing the short Okazaki fragments produced on the lagging strand
Simulation tools model replication dynamics to predict how changes in origin firing, fork speed, or fork stalling affect genome stability

Telomere length estimation from whole-genome sequencing counts telomeric reads (TTAGGG repeats) in WGS data to estimate average telomere length without specialized assays. Tandem repeat analysis tools like TRF (Tandem Repeats Finder) identify and characterize tandem repeats throughout the genome, including telomeric, centromeric, and satellite sequences.

Replication Timing Data

Replication timing profiles assign each genomic region a value reflecting when it replicates during S phase. We can summarize these data with descriptive statistics and visualize origin firing times.

let timing = [0.12, 0.18, 0.25, 0.41, 0.55, 0.62, 0.73, 0.81, 0.88, 0.95]
print("Replication timing (fraction of S phase):")
print(Stats.describe(timing))

let labels = ["Origin A", "Origin B", "Origin C", "Origin D", "Origin E"]
let fire_times = [0.08, 0.22, 0.45, 0.67, 0.89]
print("Origin firing time during S phase:")
print(Viz.bar(labels, fire_times))

Early-firing origins (Origin A, B) correspond to euchromatic, gene-rich regions. Late-firing origins (Origin D, E) correspond to heterochromatic, gene-poor regions.

Exercise: Detect an AT-Rich Origin of Replication

Bacterial origins of replication are AT-rich to facilitate strand separation. Use Seq.kmer_count() to compare the dinucleotide profiles of two regions and determine which one is the likely origin. Print the name of the AT-rich region.

let region_a = "TAATATATTGATAAATATTAATTATAA"
let region_b = "GCGCCGCGGCGCTAGCGCCGGCGCGC"
print("Region A dinucleotides:")
print(Seq.kmer_count(region_a, 2))
print("Region B dinucleotides:")
print(Seq.kmer_count(region_b, 2))
let answer = "region_a"
print(answer)

Exercise: Measure Replication Error Rate

Compare an original DNA strand to two replicated copies — one from a normal polymerase and one from a proofreading-deficient mutant. Use Seq.hamming() to count the mismatches in each case and print the error count for the mutant.

let original = "ATGGCTAGCAAAGACTTCACCGAGTATCCG"
let normal_copy = "ATGGCTAGCAAAGACTTCACCGAGTATCCG"
let mutant_copy = "ATGGCTAGCAAAGAATTCACTGAGTTTCCA"
print("Normal polymerase errors: " + Seq.hamming(original, normal_copy))
let mutant_errors = Seq.hamming(original, mutant_copy)
print("Proofreading-deficient errors: " + mutant_errors)
print(mutant_errors)

Exercise: Visualize Replication Fork Speed

Different genomic regions are replicated at different rates depending on chromatin state, origin density, and obstacles. Use Viz.bar() to display replication fork speeds across five chromosomal regions and identify the slowest region.

let regions = ["Early euchromatin", "Gene body", "Fragile site", "Heterochromatin", "Telomere-proximal"]
let speeds = [1.9, 1.5, 0.8, 0.6, 1.1]
print("Replication fork speed (kb/min) by region:")
print(Viz.bar(regions, speeds))
let slowest = "Heterochromatin"
print(slowest)

Knowledge Check

Summary

In this lesson you covered DNA replication and its analysis:

Mutation rates are extremely low (~10−9; per bp per division) — essential for maintaining genome integrity across generations
The replication fork is asymmetrical — leading strand synthesized continuously, lagging strand as Okazaki fragments, because DNA polymerase works only 5′→3′
Three layers of error correction (base selection, proofreading, mismatch repair) achieve 10−9; fidelity
A multi-protein replication machine coordinates helicase, primase, polymerase, clamp, ligase, and topoisomerase
Eukaryotic replication uses thousands of origins, fires during S phase only, with early/late replication timing correlating with chromatin state
New nucleosomes are assembled behind the fork, maintaining epigenetic information
Telomerase extends chromosome ends with TTAGGG repeats; telomere length is regulated by shelterin
Mutation databases (ClinVar, COSMIC, dbSNP) catalog the consequences of replication errors
Mutational signatures reveal mutagenic processes active in tumors; Ti/Tv ratio is a basic quality metric
Replication origin prediction, timing profiles, OK-seq, and telomere length estimation provide computational windows into replication dynamics

References

Alberts B, Johnson A, Lewis J, Morgan D, Raff M, Roberts K, Walter P. Molecular Biology of the Cell, 7th ed. New York: W.W. Norton; 2022. Chapter 5: DNA Replication, Repair, and Recombination.
Meselson M, Stahl FW. The replication of DNA in Escherichia coli. Proc Natl Acad Sci USA. 1958;44(7):671–682.
Okazaki R, Okazaki T, Sakabe K, Sugimoto K, Sugino A. Mechanism of DNA chain growth, I. Possible discontinuity and unusual secondary structure of newly synthesized chains. Proc Natl Acad Sci USA. 1968;59(2):598–605.
Blackburn EH, Gall JG. A tandemly repeated sequence at the termini of the extrachromosomal ribosomal RNA genes in Tetrahymena. J Mol Biol. 1978;120(1):33–53.
Petryk N, Kahli M, d'Aubenton-Carafa Y, et al. Replication landscape of the human genome. Nat Commun. 2016;7:10208.
Macheret M, Halazonetis TD. Intragenic origins due to short G1 phases underlie oncogene-induced DNA replication stress. Nature. 2018;555(7694):112–116.
Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012;9(4):357–359.
Alexandrov LB, Nik-Zainal S, Wedge DC, et al. Signatures of mutational processes in human cancer. Nature. 2013;500(7463):415–421.

Powered by

cyanea-seq cyanea-stats

DNA replication replication fork polymerase telomeres proofreading mismatch repair replication origins Okazaki fragments mutation rate ClinVar COSMIC mutational signatures replication timing