The Structure and Function of DNA

Beginner Molecular Biology ~30 min

← Previous Next →

Explore the double helix in depth — from nucleotide chemistry and base pairing to Chargaff's rules, DNA topology, and the computational methods for analyzing DNA sequence composition.

Introduction

DNA is the molecule of heredity — the chemical substance in which the genetic instructions for building and maintaining an organism are encoded. While earlier lessons introduced DNA in broad strokes, this lesson examines the molecule in detail: the chemistry of its building blocks, the elegant logic of the double helix, and the computational tools used to analyze DNA sequences at scale.

Understanding DNA structure is not merely an exercise in chemistry. The physical properties of the molecule — its base composition, its topology, its melting behavior — directly influence how genes are replicated, expressed, and regulated. The computational methods we introduce here form the analytical foundation for everything that follows in molecular biology and genomics.

A DNA Molecule Consists of Two Complementary Chains of Nucleotides

DNA is a polymer of nucleotides. Each nucleotide has three components: a five-carbon sugar (deoxyribose), a phosphate group, and a nitrogenous base. The bases come in two chemical types: the purines adenine (A) and guanine (G), which have a double-ring structure, and the pyrimidines thymine (T) and cytosine (C), which have a single ring.

Nucleotides are linked by phosphodiester bonds between the 5′ carbon of one sugar and the 3′ carbon of the next, creating a sugar-phosphate backbone with a defined directionality: every strand has a 5′ end (with a free phosphate) and a 3′ end (with a free hydroxyl group). By convention, DNA sequences are always written in the 5′→3′ direction.

The two strands of a DNA molecule are held together by hydrogen bonds between complementary bases:

Adenine (A) pairs with Thymine (T) via two hydrogen bonds
Guanine (G) pairs with Cytosine (C) via three hydrogen bonds

Because a purine always pairs with a pyrimidine, the width of the double helix is constant along its length (~2 nm). The two strands wind around each other in a right-handed helix with approximately 10.5 base pairs per turn and a pitch (rise per turn) of 3.4 nm. This is the B-form of DNA, the predominant form under physiological conditions.

The base-pairing rules mean that the two strands are complementary: knowing the sequence of one strand immediately tells you the sequence of the other. They are also antiparallel — they run in opposite 5′→3′ directions.

let template = "ATGGCTAGCAAAGAC"
let complement = Seq.complement(template)
let rc = Seq.reverse_complement(template)
print("Template (5'→3'):          " + template)
print("Complement:                " + complement)
print("Reverse complement (5'→3'): " + rc)

The complement shows the bases on the opposite strand in the same left-to-right order. The reverse complement reads the opposite strand in the conventional 5′→3′ direction — this is the sequence that would be encountered by enzymes reading the other strand. The reverse complement is one of the most frequently used operations in bioinformatics.

The Structure of DNA Provides a Mechanism for Heredity

Watson and Crick recognized immediately that the base-pairing rules provide a mechanism for heredity. During replication, the two strands separate, and each serves as a template for the synthesis of a new complementary strand. The result is two identical double helices, each containing one parental strand and one newly synthesized strand (semiconservative replication).

This mechanism ensures that the genetic information is faithfully copied from one generation of cells to the next. The beauty of the double helix is that the same structure that stores information also provides the means to copy it.

In Eukaryotes, DNA Is Enclosed in a Cell Nucleus

Prokaryotic cells (bacteria and archaea) typically have a single circular chromosome located in the nucleoid — a region of the cytoplasm not bounded by a membrane. Eukaryotic cells, by contrast, enclose their DNA within a membrane-bound nucleus, separating the genome from the cytoplasm.

This compartmentalization has profound consequences. In eukaryotes, transcription occurs in the nucleus while translation occurs in the cytoplasm, creating a spatial and temporal separation that enables elaborate regulation — including mRNA processing (capping, splicing, polyadenylation) that must occur before the mRNA is exported for translation. In prokaryotes, transcription and translation can occur simultaneously because there is no nuclear envelope.

Eukaryotic genomes are organized into multiple linear chromosomes (23 pairs in humans), each a single continuous DNA molecule complexed with proteins. The ends of linear chromosomes are protected by specialized structures called telomeres, and each chromosome has a centromere for attachment to the mitotic spindle during cell division, plus multiple origins of replication where DNA synthesis initiates.

Chargaff’s Rules and Base Composition

In the late 1940s, Erwin Chargaff analyzed the base composition of DNA from many organisms and discovered a fundamental regularity: in any double-stranded DNA molecule, the amount of A equals T and the amount of G equals C (Chargaff’s first parity rule). This was a crucial clue for Watson and Crick’s discovery of base pairing.

Chargaff’s second parity rule — less well known but equally remarkable — states that even within a single strand, the frequencies of A ≈ T and G ≈ C approximately hold, at least for large sequences. This reflects the roughly equal distribution of genes on both strands of a chromosome.

The GC content (the fraction of bases that are G or C) varies widely across organisms and across regions within a genome:

let gc_rich = "GCGCCGATCGCGGCCGCGATCGGC"
let at_rich = "ATAAATGATATTAACGATAAAGAT"
let balanced = "ATGGCTAGCAAAGACTTCACCGAG"
print("GC-rich:  " + Seq.gc_content(gc_rich))
print("AT-rich:  " + Seq.gc_content(at_rich))
print("Balanced: " + Seq.gc_content(balanced))

GC content influences several physical and biological properties: melting temperature (higher GC = more thermally stable), gene density (GC-rich regions in the human genome tend to be gene-rich), staining pattern (forming the light and dark bands visible in chromosome preparations), and codon usage (organisms with high GC content prefer G/C-ending codons).

Exercise: Chargaff’s Rules in Practice

Compute the GC content of gene fragments from three different organisms. How many sequences did you analyze?

let thermophile = "GCGCCGATCGCGGCCGCGATCGGC"
let mesophile = "ATGGCTAGCAAAGACTTCACCGAG"
let at_rich = "ATAAATGATATTAACGATAAAGAT"
print("Thermophile GC: " + Seq.gc_content(thermophile))
print("Mesophile GC:   " + Seq.gc_content(mesophile))
print("AT-rich GC:     " + Seq.gc_content(at_rich))
let num_sequences = "3"
print(num_sequences)

DNA Topology and Alternative Structures

While B-DNA is the standard form, DNA can adopt alternative structures under specific conditions:

A-DNA — a wider, more compact right-handed helix that forms under dehydrating conditions and is similar to the structure of RNA-DNA hybrids
Z-DNA — a left-handed helix that can form in sequences with alternating purine-pyrimidine bases (e.g., GCGCGC); it may play roles in gene regulation
Cruciform structures — form at palindromic sequences where each strand can fold back on itself
G-quadruplexes — four-stranded structures formed by guanine-rich sequences, found at telomeres and in gene promoters

The topology of DNA — its supercoiling state — is biologically important. Circular bacterial chromosomes and closed loops of eukaryotic chromatin can be overwound (positive supercoiling) or underwound (negative supercoiling). Negative supercoiling, maintained by topoisomerases, facilitates strand separation during replication and transcription.

Sequence Length and Genome Scale

DNA molecules range from a few kilobases (small viral genomes) to billions of base pairs (vertebrate chromosomes). Gene sizes vary enormously: a small bacterial gene may encode a 100-amino-acid protein (300 bp of coding sequence), while the human dystrophin gene spans 2.4 million base pairs (though its coding sequence is only 14 kb — the rest is introns).

let small_gene = "ATGGCTAGCAAAGACTGA"
let medium_gene = "ATGGCTAGCAAAGACTTCACCGAGTACCTGCAGAACCTGATCGGCAAATGA"
print("Small gene:  " + Seq.length(small_gene) + " bp → " + Seq.length(Seq.translate(small_gene)) + " aa")
print("Medium gene: " + Seq.length(medium_gene) + " bp → " + Seq.length(Seq.translate(medium_gene)) + " aa")

Base Composition Analysis: GC Content and Skew

Beyond overall GC content, bioinformaticians analyze base composition along the length of a genome using sliding windows. GC content plots reveal regional variation: in vertebrate genomes, distinct regions of high and low GC content (called isochores) span hundreds of kilobases to megabases.

GC skew measures the asymmetry between G and C on a single strand: (G − C)/(G + C). In bacterial genomes, GC skew changes sign at the origin of replication, providing a computational method for identifying replication origins without any experimental data.

Dinucleotide frequency analysis examines how often each pair of adjacent bases occurs compared to what would be expected from the individual base frequencies. The most notable deviation is the CpG suppression in vertebrate genomes: CpG dinucleotides occur at only ~25% of the expected frequency because methylated cytosines at CpG sites tend to deaminate to thymine over evolutionary time.

CpG Islands and Methylation-Sensitive Analysis

CpG islands are regions of DNA where CpG dinucleotides occur at or near the expected frequency, bucking the genome-wide trend of CpG depletion. They are typically 500–2,000 bp long, have a GC content above 50%, and are found at approximately 60% of human gene promoters.

CpG islands are normally unmethylated and associated with open chromatin and active transcription. When they become methylated (as in cancer or during developmental silencing), the associated gene is stably repressed. Computational prediction of CpG islands from sequence alone — based on GC content, CpG observed/expected ratio, and length thresholds — is a standard tool in genome annotation.

Bisulfite sequencing converts unmethylated cytosines to uracil (which reads as thymine after PCR) while methylated cytosines are protected. Comparing bisulfite-converted and unconverted sequences reveals the methylation status of every cytosine in the genome.

Sequence Complexity

Not all DNA sequences are equally informative. Sequence complexity measures how much of a sequence consists of unique (non-repetitive) information:

Low-complexity sequences contain simple repeats (e.g., ATATAT…, GCGCGC…, polyA tracts) and carry little unique information
High-complexity sequences have a more uniform distribution of bases and substrings

Low-complexity regions can cause false positives in database searches (BLAST automatically filters them using the SEG algorithm for proteins and DUST for DNA) and pose challenges for genome assembly. Linguistic complexity measures treat DNA sequences like texts and apply information-theoretic metrics (entropy, compression ratio) to quantify their information content.

let complex_seq = "ATGGCTAGCAAAGACTTCACCGAG"
let simple_seq = "ATATATATATATATATATATATATAT"
print("Complex: " + Seq.gc_content(complex_seq))
print("Simple:  " + Seq.gc_content(simple_seq))
print("Complex length: " + Seq.length(complex_seq))
print("Simple length:  " + Seq.length(simple_seq))

Exercise: K-mer Diversity and Sequence Complexity

A complex sequence has many distinct k-mers, while a repetitive sequence has few. Compare the dinucleotide diversity of a complex and a simple (repeat) sequence to see this principle in action.

let complex_seq = "ATGGCTAGCAAAGACTTCACCGAG"
let simple_seq = "ATATATATATATATATATATATATAT"
print("Complex dinucleotides:")
print(Seq.kmer_count(complex_seq, 2))
print("Simple dinucleotides:")
print(Seq.kmer_count(simple_seq, 2))
let more_diverse = "Complex"
print(more_diverse)

Simple (low-complexity) sequences have extreme base compositions — here the alternating AT repeat has ~0% GC content — while biologically informative sequences typically show more balanced composition.

Exercise: Analyze Both Strands

Given a DNA sequence, compute the reverse complement, and compare the GC content of the original and reverse complement (they should be identical, since both are the same molecule).

let dna = "ATGGCTAGCAAAGACTGA"
let rc = Seq.reverse_complement(dna)
print("Original: " + dna + " GC: " + Seq.gc_content(dna))
print("RevComp:  " + rc + " GC: " + Seq.gc_content(rc))
print(rc)

Exercise: Compare GC Content Across Species

Compare the GC content of gene fragments from three organisms representing different GC content ranges. Identify which is from a thermophilic organism.

let gene_a = "ATGAAATTTGATACTGAAGATAAATGA"
let gene_b = "ATGGCTAGCAAAGACTTCACCGAGTGA"
let gene_c = "GCGCCGATCGCGGCCGCGATCGGCTGA"
print("Gene A GC: " + Seq.gc_content(gene_a))
print("Gene B GC: " + Seq.gc_content(gene_b))
print("Gene C GC: " + Seq.gc_content(gene_c))
let answer = "Gene C"
print(answer)

Knowledge Check

Summary

In this lesson you covered the structure and analysis of DNA in depth:

DNA consists of two complementary, antiparallel chains of nucleotides linked by phosphodiester bonds and held together by A-T (2 H-bonds) and G-C (3 H-bonds) base pairing
B-DNA is the standard form: right-handed helix, ~10.5 bp/turn, 3.4 nm pitch
DNA structure provides a mechanism for heredity — semiconservative replication uses each strand as a template
Eukaryotic DNA is enclosed in a nucleus, separating transcription from translation; chromosomes have centromeres, telomeres, and replication origins
Chargaff’s rules (A = T, G = C) reflect base pairing; GC content varies across organisms and within genomes
Alternative DNA structures (A-DNA, Z-DNA, G-quadruplexes) form under specific conditions; supercoiling is regulated by topoisomerases
Genome scale ranges from kilobases (viruses) to gigabases (vertebrates)
GC content analysis, GC skew, and dinucleotide frequencies reveal replication origins, isochores, and CpG suppression
CpG islands resist CpG depletion and mark ~60% of human gene promoters; bisulfite sequencing maps methylation
Sequence complexity distinguishes informative sequences from simple repeats; low-complexity filtering is essential for database searches

References

Alberts B, Johnson A, Lewis J, Morgan D, Raff M, Roberts K, Walter P. Molecular Biology of the Cell, 7th ed. New York: W.W. Norton; 2022. Chapter 4: DNA, Chromosomes, and Genomes.
Watson JD, Crick FHC. Molecular structure of nucleic acids: a structure for deoxyribose nucleic acid. Nature. 1953;171(4356):737–738.
Franklin RE, Gosling RG. Molecular configuration in sodium thymonucleate. Nature. 1953;171(4356):740–741.
Chargaff E. Chemical specificity of nucleic acids and mechanism of their enzymatic degradation. Experientia. 1950;6(6):201–209.
Wing R, Drew H, Takano T, et al. Crystal structure analysis of a complete turn of B-DNA. Nature. 1980;287(5784):755–758.
Rich A, Zhang S. Z-DNA: the long road to biological function. Nat Rev Genet. 2003;4(7):566–572.
Lorenz R, Bernhart SH, Höner zu Siederdissen C, et al. ViennaRNA Package 2.0. Algorithms Mol Biol. 2011;6:26.
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215(3):403–410.

Powered by

cyanea-seq

DNA double helix base pairing nucleotides Chargaff GC content CpG islands sequence complexity dinucleotide frequency genome scale