Cells, Genomes, and the Diversity of Life
Explore the universal features shared by all living cells — from DNA and the central dogma to minimal genomes — and learn the computational foundations for working with biological sequences.
Introduction: The Unity of Life
Life on Earth takes breathtakingly diverse forms — from bacteria that thrive in boiling hydrothermal vents to blue whales, from single-celled amoebae to trillion-cell organisms like ourselves. Yet beneath this diversity lies a remarkable unity. Every living cell, without exception, stores its hereditary information in DNA, reads that information through the same molecular logic, and uses it to build the proteins that carry out the work of the cell.
This lesson explores the features that all cells share, then introduces the computational perspective: how biologists represent and analyze the sequences at the heart of life. By the end, you will understand both the biology of the universal cell and the data formats and operations you need to start working with genomic data.
All Cells Store Hereditary Information in DNA
The molecule at the center of heredity is deoxyribonucleic acid (DNA). DNA is a long polymer built from just four nucleotide building blocks, each identified by the nitrogenous base it carries: Adenine, Thymine, Guanine, and Cytosine. The order of these bases along the strand — the sequence — encodes the instructions for building and maintaining a cell.
DNA normally exists as a double helix: two complementary strands wound around each other in an antiparallel arrangement. The bases on opposite strands pair specifically — A always pairs with T, and G always pairs with C. An A-T pair is held together by two hydrogen bonds, while a G-C pair is held together by three hydrogen bonds, making GC-rich regions more thermally stable.
Each strand has a chemical directionality, running from its 5′ end to its 3′ end. The two strands run in opposite directions (antiparallel), so when one strand reads 5′→3′, its partner reads 3′→5′. By convention, sequences are always written in the 5′→3′ direction. Because of strict base pairing, knowing one strand’s sequence immediately tells you the other’s — this is the reverse complement.
This chemistry is ideally suited for heredity: the molecule is chemically stable enough to preserve information across generations, yet the two strands can be separated to allow copying.
let dna = "ATGAAAGCGATTCTGACC"
let comp = Seq.complement(dna)
let rc = Seq.reverse_complement(dna)
print("5' strand: " + dna)
print("Complement: " + comp)
print("Reverse complement: " + rc)
The complement shows the bases on the opposite strand, while the reverse complement reads that opposite strand in the conventional 5′→3′ direction — the sequence a cell’s machinery would encounter when reading genes on the other strand.
DNA Replication by Templated Polymerization
Before a cell divides, it must copy its entire genome so that each daughter cell inherits a complete set of instructions. DNA replication is semiconservative: the double helix unwinds, and each strand serves as a template for the synthesis of a new complementary strand. The result is two identical double helices, each containing one old strand and one new strand.
The enzyme DNA polymerase carries out this synthesis. It reads the template strand in the 3′→5′ direction and synthesizes the new strand in the 5′→3′ direction, adding nucleotides one at a time according to the base-pairing rules. Because the two strands of the helix are antiparallel, replication is continuous on one strand (the leading strand) but must proceed in short segments called Okazaki fragments on the other (the lagging strand). These fragments are later joined by DNA ligase.
Replication is extraordinarily accurate. DNA polymerase has a built-in proofreading activity that detects and corrects mismatched bases immediately after they are incorporated. Combined with post-replication mismatch repair systems, the overall error rate is approximately one mistake per 10&sup9; (one billion) bases copied. This high fidelity is essential: too many errors would be lethal, yet some level of mutation provides the raw genetic variation on which natural selection acts.
The Central Dogma: Transcription and Translation
Cells read their genetic instructions in two steps, a flow of information known as the central dogma: DNA → RNA → Protein.
Transcription is the first step. The enzyme RNA polymerase binds to a gene’s promoter sequence and copies the DNA template into a messenger RNA (mRNA) molecule. The RNA sequence is complementary to the template strand, and thus identical to the coding strand except that thymine (T) is replaced by uracil (U). In eukaryotic cells, the initial mRNA transcript undergoes several processing steps before it leaves the nucleus: a 5′ cap is added for stability, a poly-A tail is appended to the 3′ end, and introns (non-coding intervening sequences) are removed by splicing, leaving only the exons (expressed sequences) joined together.
Translation is the second step. The ribosome — a large molecular machine composed of ribosomal RNA (rRNA) and proteins — reads the processed mRNA in three-nucleotide units called codons. Each codon specifies one of the 20 standard amino acids (or a stop signal). Translation begins at the start codon AUG, which encodes methionine, and continues until the ribosome encounters a stop codon (UAA, UAG, or UGA).
The genetic code — the mapping from 64 possible codons to 20 amino acids plus stop signals — is nearly universal across all life. Bacteria, archaea, plants, fungi, and animals all use essentially the same code, powerful evidence of a single common ancestor. A few minor exceptions exist: mitochondria use a slightly modified code, and a handful of organisms have reassigned stop codons to encode rare amino acids like selenocysteine (the “21st amino acid”) or pyrrolysine (the “22nd amino acid”).
let gene = "ATGAAAGCGATTCTGACCGAATGA"
let mrna = Seq.transcribe(gene)
let protein = Seq.translate(gene)
print("DNA: " + gene)
print("mRNA: " + mrna)
print("Protein: " + protein)
Exercise: The Central Dogma in Practice
Given a DNA sequence encoding a short protein, transcribe it to mRNA and translate it. What is the protein product?
let gene = "ATGAAACGTGATTGA"
let mrna = Seq.transcribe(gene)
let protein = Seq.translate(gene)
print("mRNA: " + mrna)
print(protein)
What Is a Gene?
A gene is the segment of DNA that encodes one functional product — typically a protein, but sometimes a functional RNA molecule. The concept seems simple, but the details reveal surprising complexity.
A typical protein-coding gene in a eukaryote includes several elements: a promoter region where transcription factors and RNA polymerase bind, exons that encode the protein sequence, introns that are transcribed but spliced out before translation, and a terminator sequence that signals the end of transcription. Prokaryotic genes are generally simpler, lacking introns and sometimes organized into operons where multiple genes are transcribed as a single mRNA.
Not all genes encode proteins. Cells also contain genes for transfer RNA (tRNA), which delivers amino acids to the ribosome; ribosomal RNA (rRNA), a structural and catalytic component of the ribosome; microRNA (miRNA), which regulates gene expression; and long non-coding RNA (lncRNA), whose diverse roles are still being discovered. These non-coding RNA genes make up a significant portion of the genome.
The human genome contains roughly 20,000 protein-coding genes, yet these exons account for only about 1.5% of the genome’s 3.2 billion base pairs. The rest includes introns, regulatory sequences, transposable elements, and sequences whose functions remain under investigation.
| Organism | Genome size | Protein-coding genes |
|---|---|---|
| Mycoplasma genitalium | 580 kb | ~475 |
| Escherichia coli | 4.6 Mb | ~4,300 |
| Saccharomyces cerevisiae (yeast) | 12 Mb | ~6,000 |
| Drosophila melanogaster (fruit fly) | 140 Mb | ~14,000 |
| Homo sapiens (human) | 3,200 Mb | ~20,000 |
Notice that genome size does not scale linearly with organismal complexity — the so-called C-value paradox. Much of the size difference is due to non-coding DNA, not additional genes.
let sizes = '[{"label": "Mycoplasma", "value": 0.58}, {"label": "E. coli", "value": 4.6}, {"label": "Yeast", "value": 12}, {"label": "Fruit fly", "value": 140}, {"label": "Human", "value": 3200}]'
let chart = Viz.bar(sizes, '{"title": "Genome Sizes (Mb)", "color": "#06B6D4"}')
print(chart)
Plotting genome size against gene count makes the C-value paradox even more vivid — the two quantities are clearly decoupled:
let data = '[{"x": 0.58, "y": 475, "label": "Mycoplasma"}, {"x": 4.6, "y": 4300, "label": "E. coli"}, {"x": 12, "y": 6000, "label": "Yeast"}, {"x": 140, "y": 14000, "label": "Fruit fly"}, {"x": 3200, "y": 20000, "label": "Human"}]'
let plot = Viz.scatter(data, '{"title": "Genome Size vs Gene Count", "x_label": "Genome size (Mb)", "y_label": "Protein-coding genes", "color": "#06B6D4"}')
print(plot)
let gene = "ATGAAAGCGATTCTGACCGAATGA"
let protein = Seq.translate(gene)
let codons = Seq.codon_usage(gene)
print("Protein: " + protein)
print(codons)
Life Requires Free Energy
A cell is not a static structure — it is a dynamic, far-from-equilibrium system that requires a continuous supply of free energy to maintain its organization, grow, and reproduce. Without energy input, the ordered structures of a cell would decay toward thermodynamic equilibrium, and the cell would die.
Living organisms obtain free energy in two fundamentally different ways. Phototrophs capture energy from sunlight through photosynthesis; this group includes plants, algae, and cyanobacteria. Chemotrophs extract energy from the chemical bonds in organic or inorganic molecules through oxidation reactions; animals, fungi, and most bacteria fall into this category.
Regardless of the energy source, virtually all cells use adenosine triphosphate (ATP) as their universal energy currency. Energy-releasing reactions (such as glucose oxidation) are coupled to ATP synthesis, and ATP hydrolysis in turn powers energy-requiring processes like biosynthesis, transport, and movement. This coupling of favorable and unfavorable reactions is a hallmark of cellular metabolism.
Organisms can also be classified by their carbon source: autotrophs fix inorganic carbon (CO&sub2;) into organic molecules, while heterotrophs obtain carbon by consuming organic compounds produced by other organisms. The interplay between autotrophs and heterotrophs forms the basis of ecological food webs.
Cells as Biochemical Factories
Despite their enormous diversity in form and habitat, all cells work with the same basic set of molecular building blocks. Four classes of small organic molecules serve as the monomers for all of life’s macromolecules:
| Macromolecule | Monomer | Key functions |
|---|---|---|
| Proteins | Amino acids (20 types) | Catalysis, structure, signaling, transport |
| Nucleic acids | Nucleotides (4 types each for DNA and RNA) | Information storage, gene regulation |
| Polysaccharides | Sugars (glucose, etc.) | Energy storage, structural support |
| Lipids | Fatty acids, glycerol | Membranes, energy storage, signaling |
The same 20 amino acids are used to build proteins in every known organism. The same 4 DNA bases store genetic information in all cells. Core metabolic pathways — glycolysis (the breakdown of glucose), the citric acid cycle (also called the TCA or Krebs cycle), and oxidative phosphorylation — are conserved across bacteria, archaea, and eukaryotes. This deep conservation of biochemistry is one of the strongest arguments for the common ancestry of all life.
The Plasma Membrane
Every cell is bounded by a plasma membrane that separates its interior from the external environment. This membrane is built from phospholipids — amphipathic molecules with a hydrophilic (water-loving) head and two hydrophobic (water-fearing) fatty acid tails. In an aqueous environment, phospholipids spontaneously assemble into a lipid bilayer, with the hydrophobic tails facing inward and the hydrophilic heads facing outward on both surfaces.
The lipid bilayer acts as a selectively permeable barrier. Small nonpolar molecules (like O&sub2; and CO&sub2;) can diffuse across freely, but ions, sugars, amino acids, and other polar molecules cannot. To transport these essential nutrients in and waste products out, the membrane contains embedded transport proteins: channels that form selective pores, carriers that undergo conformational changes to shuttle molecules across, and pumps that use ATP to move molecules against their concentration gradient.
The membrane also houses receptor proteins that detect signals from the environment and enzymes that catalyze reactions at the cell surface. In eukaryotic cells, internal membranes create compartments (organelles) such as the nucleus, mitochondria, and endoplasmic reticulum — a key innovation that allows different biochemical processes to occur simultaneously in optimized environments.
Minimal Genomes: How Few Genes Can Sustain Life?
If all cells share a core set of molecular machinery, what is the minimum number of genes needed to support life? This question has fascinated biologists and has become experimentally testable.
Mycoplasma genitalium, a parasitic bacterium, holds the record among naturally occurring free-living organisms with a genome of just 580 kilobases encoding approximately 475 genes. It lacks a cell wall, cannot synthesize many of its own nutrients, and depends heavily on its host — yet it can grow and reproduce independently in culture.
In a landmark synthetic biology experiment, Craig Venter’s team went further. They designed and synthesized a minimal genome from scratch, creating Mycoplasma mycoides JCVI-syn3.0 with just 473 genes — the smallest genome of any self-replicating organism. Remarkably, about one-third of these essential genes have unknown functions, reminding us how much basic cell biology remains to be discovered.
The essential gene categories in a minimal cell include: DNA replication, transcription, and translation machinery; enzymes for basic metabolism and energy production; and components for membrane structure and transport. This minimal instruction set defines what it means to be alive at the molecular level.
Let’s compare the GC content of a minimal-genome organism with a more complex one. GC content often reflects an organism’s lifestyle and evolutionary history:
let mycoplasma = "ATGAAATTTAATAAAGATAATAAATTTGATACT"
let ecoli = "ATGGCGATTCTGACCGCGGCGATCCTGCCG"
print("Mycoplasma-like GC: " + Seq.gc_content(mycoplasma))
print("E. coli-like GC: " + Seq.gc_content(ecoli))
let gc_data = '[{"label": "Mycoplasma", "value": 25}, {"label": "E. coli", "value": 51}]'
let gc_chart = Viz.bar(gc_data, '{"title": "GC Content (%)", "color": "#10B981"}')
print(gc_chart)
Mycoplasma genomes are notably AT-rich (GC content ~25–33%), while E. coli has a more balanced composition (~50.8% GC). The reasons for this variation are complex and involve mutation bias, selection, and metabolic constraints.
We can use descriptive statistics to quantify the composition of each sequence more precisely:
let mycoplasma = "ATGAAATTTAATAAAGATAATAAATTTGATACT"
let ecoli = "ATGGCGATTCTGACCGCGGCGATCCTGCCG"
print("Mycoplasma-like sequence stats:")
print(Stats.describe(mycoplasma))
print("E. coli-like sequence stats:")
print(Stats.describe(ecoli))
Biological Sequences as Strings
From a computational perspective, all of the biological molecules we have discussed — DNA, RNA, and proteins — can be represented as character strings over defined alphabets. This simple abstraction is the foundation of bioinformatics.
| Molecule | Alphabet | Characters |
|---|---|---|
| DNA | 4 bases | A, C, G, T |
| RNA | 4 bases | A, C, G, U |
| Protein | 20 amino acids | A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y |
The single-letter amino acid codes may seem arbitrary, but they are standardized by IUPAC convention. Some are intuitive (Alanine = A, Glycine = G), while others are less so (W = tryptophan, because “W” suggests the double ring in its structure). Ambiguity codes like N (any base in DNA) and X (any amino acid in protein) handle positions of uncertainty.
Representing biological molecules as strings lets us apply the full power of algorithmic analysis: searching for patterns, comparing sequences by alignment, computing statistical properties, and building evolutionary trees. Every tool in bioinformatics ultimately operates on these string representations.
let dna = "ATGAAAGCGATTCTGACCGAATGA"
let protein = Seq.translate(dna)
print("DNA length: " + Seq.length(dna) + " bases")
print("DNA GC content: " + Seq.gc_content(dna))
print("Protein: " + protein)
print("Protein length: " + Seq.length(protein) + " amino acids")
Before performing any analysis, it is good practice to validate that a sequence contains only the expected characters for its type. Seq.validate() checks whether a string is a valid DNA, RNA, or protein sequence:
let dna = "ATGAAAGCGATTCTGACC"
let rna = "AUGAAAGCGAUUCUGACC"
let protein = "MKAILTE"
print("DNA valid: " + Seq.validate(dna, "DNA"))
print("RNA valid: " + Seq.validate(rna, "RNA"))
print("Protein valid: " + Seq.validate(protein, "protein"))
K-mer Analysis
A k-mer is a subsequence of length k drawn from a longer sequence. Counting k-mer frequencies is one of the most fundamental operations in sequence analysis — it underpins genome assembly, species identification, and motif discovery. Dinucleotides (k=2) and trinucleotides (k=3) reveal biases in nucleotide composition that simple GC content cannot capture.
let dna = "ATGAAAGCGATTCTGACCGAATGA"
print("Dinucleotide frequencies:")
print(Seq.kmer_count(dna, 2))
print("Trinucleotide frequencies:")
print(Seq.kmer_count(dna, 3))
Exercise: Trinucleotide Frequencies
Count the trinucleotide (k=3) frequencies in a gene sequence. How many times does the start codon ATG appear as a k-mer?
let gene = "ATGAAAGCGATTCTGACCGAATGA"
let trimers = Seq.kmer_count(gene, 3)
print(trimers)
let atg_count = "1"
print(atg_count)
Sequence Representation and Databases
To share and analyze biological sequences, the scientific community has developed standard file formats and centralized databases.
FASTA is the most basic sequence format. Each record consists of a header line starting with > (containing an identifier and optional description) followed by one or more lines of sequence. FASTA is used for reference genomes, protein databases, and virtually any situation where sequences need to be stored or exchanged.
FASTQ extends FASTA by adding per-base quality scores. Each record has four lines: a header (starting with @), the sequence, a separator line (+), and a quality string where each ASCII character encodes a Phred quality score. A Phred score of Q30 means a 1-in-1,000 chance of error (99.9% accuracy). FASTQ is the standard output format for DNA sequencing instruments.
GenBank (NCBI) and EMBL (EBI) formats are richly annotated flat-file formats that include not just the sequence but also gene coordinates, protein products, literature references, and taxonomic information. These are the primary formats used by major sequence databases.
Three databases form the backbone of modern genomics:
- NCBI (National Center for Biotechnology Information) — hosts GenBank (nucleotide sequences), RefSeq (curated reference sequences), SRA (raw sequencing reads), and many other resources
- UniProt — the definitive protein sequence and annotation database, combining Swiss-Prot (manually curated) and TrEMBL (automatically annotated)
- Ensembl — provides genome assemblies, gene annotations, and comparative genomics data with powerful browser and API access
Every sequence deposited in these databases receives a unique accession number (e.g., NM_000518.5 for human beta-globin mRNA). Accession numbers are versioned, so updates can be tracked over time. Cross-references link entries across databases, so a single gene can be followed from its DNA sequence in GenBank to its protein in UniProt to its genomic context in Ensembl.
The growth of sequence data has been exponential. The first complete genome (Haemophilus influenzae, 1995) contained 1.8 million base pairs. Today, public databases hold sequences from millions of organisms totaling trillions of base pairs — doubling approximately every 18 months.
let fasta = ">beta_globin Human HBB gene\nATGGTGCATCTGACTCCTGA\n>insulin Human INS gene\nATGGCCCTGTGGATGCGC"
let records = IO.parse_fasta(fasta)
print("Parsed " + records.length + " FASTA records:")
print(records)
Exercise: Parse and Count FASTA Records
Parse a multi-record FASTA file and determine how many sequences it contains. This is the first step in any sequence analysis pipeline.
let fasta = ">seq1 Gene A\nATGGCTAGCAAA\n>seq2 Gene B\nATGAAACGTGAT\n>seq3 Gene C\nATGGTGCATCTG"
let records = IO.parse_fasta(fasta)
print(records)
let count = records.length
print(count)
Basic String Operations on Biological Sequences
Several fundamental string operations arise repeatedly in bioinformatics. Understanding them is essential before moving to more complex analyses like alignment or assembly.
Length is the simplest property — the number of characters (bases or amino acids) in a sequence. DNA sequence lengths range from a few hundred bases (a single gene) to billions (a complete genome).
Composition describes the frequency of each character. For DNA, the most common summary statistic is GC content — the fraction of bases that are G or C. GC content varies widely across organisms (from ~25% in Mycoplasma to ~72% in some Streptomyces species) and even within a genome (GC-rich and GC-poor isochores in vertebrates).
Reverse complement converts a DNA sequence to the sequence of the opposite strand read in the 5′→3′ direction. This operation is critical because genes can be encoded on either strand. A genome’s two strands together provide six reading frames — three on the forward strand (starting at positions 1, 2, and 3) and three on the reverse strand — and a gene could occupy any one of them.
Codon usage measures the frequency of each three-letter codon in a coding sequence. Because the genetic code is degenerate (most amino acids are specified by more than one codon), different organisms show preferences for particular synonymous codons — a phenomenon called codon usage bias. Highly expressed genes tend to use codons matched to the most abundant tRNAs, optimizing translation speed.
let gene1 = "ATGAAAGCGATTCTGACCGAATGA"
let gene2 = "ATGAAATTTAATACTGAAGATAAATGA"
print("Gene 1 codon usage:")
print(Seq.codon_usage(gene1))
print("Gene 2 codon usage:")
print(Seq.codon_usage(gene2))
Exercise: Analyze an Unknown Gene
You are given a DNA sequence from a newly sequenced bacterial genome. Compute its reverse complement, then translate it to determine the protein it encodes.
let gene = "ATGAAAGCGATCCTGACCGAGTGA"
let rc = Seq.reverse_complement(gene)
let protein = Seq.translate(gene)
print("Reverse complement: " + rc)
print(protein)
Exercise: Compare Sequence Composition
Thermophilic organisms (those that thrive at high temperatures) tend to have higher GC content than mesophiles, because G-C base pairs with their three hydrogen bonds are more thermally stable. Compare the GC content of these two sequences and identify which likely comes from a thermophile.
let seq_a = "ATGAAATTTGATACTGAAGATAAATGA"
let seq_b = "ATGGCGCCGCTGACCGCGGCGATCCGG"
let gc_a = Seq.gc_content(seq_a)
let gc_b = Seq.gc_content(seq_b)
print("Organism A GC: " + gc_a)
print("Organism B GC: " + gc_b)
let answer = "Thermophile"
print(answer)
Knowledge Check
Summary
In this lesson you covered the universal features of cells and the computational foundations for working with biological sequences:
- All cells store hereditary information in DNA — a double-stranded molecule with complementary base pairing (A-T with 2 hydrogen bonds, G-C with 3)
- DNA replication is semiconservative — each strand templates a new copy with an error rate of ~1 per 10&sup9; bases
- The central dogma — DNA is transcribed to mRNA (T→U), which is translated into protein via a nearly universal genetic code
- A gene encodes one functional product (protein or RNA) — humans have ~20,000 protein-coding genes in 3.2 billion base pairs
- Life requires free energy — obtained by phototrophy or chemotrophy, with ATP as the universal energy currency
- All cells use the same building blocks — 20 amino acids, 4 DNA bases, and conserved metabolic pathways like glycolysis
- The plasma membrane — a lipid bilayer with embedded transport proteins that controls what enters and exits the cell
- Minimal genomes — a free-living cell can exist with fewer than 500 genes (M. genitalium: ~475 genes)
- Biological sequences are strings — DNA, RNA, and protein can be represented as characters over defined alphabets
- Standard formats — FASTA (sequences), FASTQ (sequences + quality), GenBank (annotated sequences)
- Major databases — NCBI, UniProt, and Ensembl store and organize the world’s sequence data
- Basic string operations — length, GC content, reverse complement, and codon usage are the starting tools of sequence analysis
References
- Alberts B, Johnson A, Lewis J, Morgan D, Raff M, Roberts K, Walter P. Molecular Biology of the Cell, 7th ed. New York: W.W. Norton; 2022. Chapter 1: Cells and Genomes.
- Watson JD, Crick FHC. Molecular structure of nucleic acids: a structure for deoxyribose nucleic acid. Nature. 1953;171(4356):737–738.
- Meselson M, Stahl FW. The replication of DNA in Escherichia coli. Proc Natl Acad Sci USA. 1958;44(7):671–682.
- Crick F. Central dogma of molecular biology. Nature. 1970;227(5258):561–563.
- Hutchison CA III, Chuang RY, Noskov VN, et al. Design and synthesis of a minimal bacterial genome. Science. 2016;351(6280):aad6253.
- Cock PJA, Fields CJ, Goto N, Heuer ML, Rice PM. The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res. 2010;38(6):1767–1771.
- NCBI Resource Coordinators. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2022;50(D1):D13–D16. https://www.ncbi.nlm.nih.gov/
- The UniProt Consortium. UniProt: the Universal Protein Knowledgebase in 2023. Nucleic Acids Res. 2023;51(D1):D523–D531. https://www.uniprot.org/