The RNA World and the Origins of Life
Explore the evidence for an RNA world before DNA and protein — from RNA's catalytic abilities and self-replicating molecules to computational methods for predicting RNA structure and searching for ribozymes in genomes.
Introduction
Today’s cells use DNA to store genetic information, RNA as an intermediate, and proteins to catalyze most reactions. But this elegant division of labor raises a profound chicken-and-egg problem: DNA needs proteins to replicate, yet the instructions for making proteins are encoded in DNA. Which came first?
The RNA world hypothesis proposes that before both DNA and protein, RNA served as both the genetic material and the catalyst — a single molecule performing both roles. This idea is supported by remarkable discoveries about what RNA can do: RNA molecules can fold into complex three-dimensional structures, catalyze chemical reactions, and even copy themselves (in laboratory experiments). In modern cells, RNA retains several of these ancient catalytic functions — most notably in the ribosome, where rRNA catalyzes peptide bond formation.
This lesson explores the evidence for the RNA world, how RNA structure enables catalysis, and the computational tools for predicting RNA structure and identifying catalytic RNAs in genomes.
Single-Stranded RNA Molecules Can Fold into Highly Elaborate Structures
Unlike DNA, which typically forms a regular double helix, RNA is usually single-stranded and folds back on itself to form complex three-dimensional structures. The key structural elements include:
- Stems (helical regions) — formed by Watson-Crick base pairing between complementary regions of the same strand (A-U and G-C)
- Loops — unpaired regions that connect stems (hairpin loops, internal loops, bulges, multi-branch junctions)
- Non-canonical base pairs — G-U wobble pairs (very common in RNA), A-G pairs, and others that are rare in DNA but frequent in RNA
- Tertiary interactions — long-range contacts that fold the secondary structure into a compact 3D shape (pseudoknots, base triples, ribose zippers, A-minor motifs)
RNA also has a 2′-hydroxyl group on its ribose sugar (absent in DNA’s deoxyribose), which participates in hydrogen bonding and makes the RNA backbone more chemically versatile — and more susceptible to hydrolysis.
The ability of RNA to form complex, specific three-dimensional structures is the foundation of its catalytic activity. Just as protein enzymes have precisely shaped active sites, catalytic RNAs (ribozymes) use their folded structures to bind substrates and catalyze reactions.
Self-Replicating Molecules Underwent Natural Selection
The RNA world hypothesis proposes a scenario for the origin of life:
- Abiotic synthesis of nucleotides and short RNA polymers occurred in a prebiotic environment (driven by UV radiation, volcanic energy, or mineral surface catalysis)
- Some RNA molecules could catalyze their own replication (or the replication of other RNA molecules), however imperfectly
- Replication with imperfect copying created a population of diverse RNA molecules
- Natural selection favored molecules that replicated faster, more accurately, or in ways that increased their own survival — launching the evolutionary process before cells or DNA existed
Laboratory evolution experiments have provided support for this scenario. In the 1960s, Sol Spiegelman demonstrated that RNA molecules evolving in a test tube (with the enzyme Qβ replicase providing the copying machinery) rapidly evolved to become shorter and replicate faster. More recently, researchers have created RNA polymerase ribozymes — RNA molecules that can copy other RNA molecules — though none yet achieve the efficiency needed for full self-replication.
The transition from an RNA world to the modern DNA/protein world likely occurred gradually:
- Proteins may have first appeared as short peptides synthesized on RNA templates, eventually outcompeting RNA as catalysts due to the greater chemical diversity of the 20 amino acids
- DNA may have replaced RNA as the genetic material because its deoxyribose sugar makes it more chemically stable (the 2′-OH of RNA makes it susceptible to alkaline hydrolysis)
- The ribosome — an RNA machine that synthesizes proteins — may be a relic of this transition, preserving the RNA-catalyzed peptide synthesis of the RNA world
RNA in Present-Day Cells Serves as Specialized Adapter and Catalyst
Modern cells retain multiple RNA molecules with catalytic and structural roles that may be vestiges of the RNA world:
| RNA | Catalytic/structural role | Evidence for RNA world origin |
|---|---|---|
| rRNA (ribosome) | Catalyzes peptide bond formation | The most fundamental protein synthesis activity is RNA-catalyzed |
| tRNA | Adaptor for translation | May have evolved from a simpler aminoacylated RNA |
| snRNA (spliceosome) | Catalyzes intron removal | Spliceosome mechanism resembles group II self-splicing introns |
| RNase P | Cleaves pre-tRNA | The catalytic subunit is RNA, not protein |
| SRP RNA | Part of signal recognition particle | RNA component is essential for function |
| Telomerase RNA | Template for telomere extension | RNA component provides both template and scaffolding |
Additionally, riboswitches — structured RNA elements in the 5′ UTR of bacterial mRNAs that directly sense small-molecule metabolites and regulate gene expression — demonstrate that RNA can serve as a sensor and regulatory switch without any protein partner. Riboswitches bind metabolites such as S-adenosylmethionine (SAM), thiamine pyrophosphate (TPP), and flavin mononucleotide (FMN) with affinities comparable to protein enzymes.
let ribosomal_gene = "ATGGCTAGCAAAGACTTCACCGAGTACCTGCAGAACCTGATCGGCAAATGA"
let protein = Seq.translate(ribosomal_gene)
print("A ribosomal protein gene: " + ribosomal_gene)
print("Protein product: " + protein)
print("GC content: " + Seq.gc_content(ribosomal_gene))
Ribosomal protein genes are among the most highly conserved across all life, reflecting the ancient and essential nature of the translation machinery. Yet the catalytic core of the ribosome is RNA, not protein — the proteins were added later in evolution.
How Did Protein Synthesis Evolve?
The evolution of protein synthesis — the translation apparatus — is one of the deepest questions in biology. The ribosome is so complex that it could not have arisen fully formed. A plausible scenario involves a series of incremental steps:
- Random aminoacylation of RNA — short RNAs became covalently linked to amino acids by chance
- Selection for RNA-amino acid combinations that improved ribozyme function — amino acids added chemical versatility to RNA catalysts
- A primitive peptidyl transferase — an RNA catalyst that could join amino acids together, perhaps initially producing very short peptides
- Evolution of the genetic code — initially crude and ambiguous, gradually becoming more specific as the system became more complex
- The modern ribosome — a large, sophisticated machine optimized over billions of years of evolution
The universality of the genetic code across all life is strong evidence that the translation machinery evolved only once and was inherited by all descendant lineages. The code itself may be partly a “frozen accident” — once established, any change would be catastrophic because it would alter every protein in the cell — and partly optimized by natural selection (amino acids with similar chemical properties tend to have similar codons, minimizing the impact of mutations).
RNA Secondary Structure Prediction
Computational prediction of RNA structure begins with secondary structure — the pattern of base-paired stems and unpaired loops:
- RNAfold (ViennaRNA package) — predicts the minimum free energy (MFE) secondary structure by dynamic programming, using thermodynamic parameters for base pairing, stacking, and loop energies. Also computes base-pairing probabilities and generates dot-bracket notation (e.g.,
(((...)))for a stem-loop) - Mfold — a classic web server for RNA secondary structure prediction, generating multiple near-optimal structures
- CentroidFold — uses a probabilistic model to predict the structure that maximizes expected accuracy rather than minimizing free energy
The minimum free energy approach has limitations: it assumes the RNA folds to its thermodynamic equilibrium structure, which may not be the case in the cell (where folding occurs co-transcriptionally and is influenced by protein binding). Comparative analysis — identifying base pairs that are conserved across related sequences (compensatory mutations) — is often more accurate than thermodynamic prediction alone.
RNA 3D Structure Modeling
Beyond secondary structure, several tools predict or model RNA three-dimensional structures:
- RNAComposer — builds 3D structures by assembling fragments from a library of known RNA structures, guided by the predicted secondary structure
- Rosetta (FARFAR2) — uses fragment assembly and energy minimization to predict 3D RNA structures ab initio
- MC-Fold/MC-Sym — predicts secondary structure with non-canonical base pairs and builds 3D models
Experimental methods for determining RNA 3D structure include X-ray crystallography, NMR spectroscopy, and cryo-electron microscopy, with the latter becoming increasingly powerful for large RNA complexes like the ribosome and spliceosome.
let rna_like = "GCGCUAGCGCGCUAGCGC"
let gc = Seq.gc_content(rna_like)
print("RNA-like sequence: " + rna_like)
print("GC content: " + gc)
print("Length: " + Seq.length(rna_like) + " nt")
Exercise: RNA Composition Analysis
Compare the GC content and dinucleotide frequencies of two RNA-like sequences. In GC-rich RNA, which dinucleotide do you expect to be most common?
let rna_gc_rich = "GCGCGCGCTAGCGCGCGCGC"
let rna_balanced = "AUGCUAGCUAGCUAGCUAGC"
print("GC-rich GC content: " + Seq.gc_content(rna_gc_rich))
print("Balanced GC content: " + Seq.gc_content(rna_balanced))
print("GC-rich dinucleotides:")
print(Seq.kmer_count(rna_gc_rich, 2))
let most_common = "GC"
print(most_common)
High GC content in an RNA sequence typically indicates the potential for extensive base pairing and stable secondary structures, as G-C pairs have three hydrogen bonds (compared to two for A-U pairs).
Ribozyme and Riboswitch Identification from Genome Sequences
Ribozymes (catalytic RNAs) and riboswitches (metabolite-sensing RNA elements) can be identified in genomes through sequence and structure searches:
- Infernal (INFERence of RNA ALignment) — the standard tool for searching genomes for homologs of known RNA families. It uses covariance models (CMs) — probabilistic models that describe both the sequence and the secondary structure of an RNA family, capturing the pattern of base-pair conservation (compensatory mutations)
- CMscan — scans a genome sequence against a database of CMs to find matches to known RNA families
Covariance models are the RNA equivalent of profile hidden Markov models (HMMs) used for protein families. They are essential because RNA homology is defined as much by conserved structure as by conserved sequence — two homologous RNAs may have very different sequences but maintain the same base-paired secondary structure through compensatory mutations.
The Rfam Database
Rfam is the comprehensive database of RNA families, analogous to Pfam for protein domains. Each Rfam family is defined by:
- A seed alignment of representative sequences with annotated secondary structure
- A covariance model built from the seed alignment
- A full alignment of all detected family members from genome databases
Rfam catalogs riboswitches, ribozymes, rRNAs, tRNAs, snRNAs, snoRNAs, miRNAs, lncRNAs, and many other non-coding RNA families. As of recent releases, it contains over 4,000 families representing the known diversity of functional RNAs. Rfam searches are a standard component of genome annotation pipelines, identifying non-coding RNA genes that would be missed by protein-coding gene prediction algorithms.
Comparative RNA Structure Analysis
Comparative analysis is the gold standard for RNA structure determination:
- Collect homologous RNA sequences from multiple species
- Build a multiple sequence alignment
- Identify compensatory base changes — positions where both members of a base pair mutate together (e.g., G-C in one species, A-U in another) while maintaining base pairing
- These covarying positions identify true base pairs with high confidence
The tools Infernal and CMscan formalize this approach computationally. Comparative analysis was used to determine the secondary structures of rRNAs, tRNAs, and many other functional RNAs long before crystal structures were available — and its predictions have been spectacularly confirmed by subsequent structural data.
let species_1 = "ATGGCTAGCAAAGACTTCACCGAG"
let species_2 = "ATGGCTAGCAAAGACTTCGCCAAG"
let result = Align.global(species_1, species_2)
print("Alignment of conserved RNA gene region:")
print("Score: " + result.score)
print(result.alignment)
Aligning homologous RNA gene sequences across species reveals which positions are conserved (functionally important) and which covary (base-paired). This principle underlies all covariance model-based RNA analysis.
Exercise: Analyze an RNA Gene Sequence
Examine the GC content of an RNA gene and its reverse complement. For structured RNAs, high GC content correlates with structural stability.
let rna_gene = "ATGGCTAGCAAAGACTTCACCGAG"
let rc = Seq.reverse_complement(rna_gene)
print("RNA gene: " + rna_gene + " GC: " + Seq.gc_content(rna_gene))
print("RevComp: " + rc + " GC: " + Seq.gc_content(rc))
let mrna = Seq.transcribe(rc)
print(mrna)
Exercise: Compare Conservation of RNA vs. Protein Genes
Compare the alignment scores of a highly conserved ribosomal RNA gene region versus a rapidly evolving sequence. Structural RNAs tend to be among the most conserved sequences in genomes.
let rrna_a = "ATGGCTAGCAAAGACTTCACCGAG"
let rrna_b = "ATGGCTAGCAAAGACTTCACCGAG"
let fast_a = "ATGGCTAGCAAAGACTTCACCGAG"
let fast_b = "ATGCATACCGAAGATTTCGCCAAG"
let score_rrna = Align.global(rrna_a, rrna_b).score
let score_fast = Align.global(fast_a, fast_b).score
print("rRNA gene score: " + score_rrna)
print("Fast-evolving score: " + score_fast)
let answer = "rRNA"
print(answer)
Knowledge Check
Summary
In this lesson you covered the RNA world and RNA structure analysis:
- The RNA world hypothesis proposes that RNA served as both genetic material and catalyst before DNA and proteins evolved
- Single-stranded RNA folds into elaborate structures through stems, loops, non-canonical base pairs, and tertiary interactions
- Self-replicating RNA molecules could have undergone natural selection, launching evolution before cells existed
- The transition from RNA to DNA/protein was likely gradual: proteins offered greater chemical diversity; DNA offered greater chemical stability
- Modern RNAs retain catalytic roles: rRNA (peptide bond formation), snRNA (splicing), RNase P (tRNA processing), riboswitches (metabolite sensing)
- The universality of the genetic code is evidence that translation evolved once; the code is partly a frozen accident and partly optimized by selection
- RNA secondary structure prediction (RNAfold, Mfold) uses thermodynamic models and dynamic programming
- RNA 3D structure modeling (RNAComposer, Rosetta FARFAR2) builds three-dimensional models from sequence
- Ribozymes and riboswitches are identified in genomes using Infernal and covariance models (CMs) that capture both sequence and structure
- Rfam is the comprehensive database of RNA families (>4,000 families), analogous to Pfam for proteins
- Comparative RNA structure analysis uses compensatory base changes across species to identify true base pairs, confirming secondary structure predictions
References
- Alberts B, Johnson A, Lewis J, Morgan D, Raff M, Roberts K, Walter P. Molecular Biology of the Cell, 7th ed. New York: W.W. Norton; 2022. Chapter 6: How Cells Read the Genome: From DNA to Protein.
- Nirenberg MW, Matthaei JH. The dependence of cell-free protein synthesis in E. coli upon naturally occurring or synthetic polyribonucleotides. Proc Natl Acad Sci USA. 1961;47(10):1588–1602.
- Crick FHC, Barnett L, Brenner S, Watts-Tobin RJ. General nature of the genetic code for proteins. Nature. 1961;192(4809):1227–1232.
- Barrell BG, Bankier AT, Drouin J. A different genetic code in human mitochondria. Nature. 1979;282(5735):189–194.
- Knight RD, Freeland SJ, Landweber LF. Rewiring the keyboard: evolvability of the genetic code. Nat Rev Genet. 2001;2(1):49–58.
- Lorenz R, Bernhart SH, Höner zu Siederdissen C, et al. ViennaRNA Package 2.0. Algorithms Mol Biol. 2011;6:26.
- Chan PP, Lowe TM. tRNAscan-SE: searching for tRNA genes in genomic sequences. Methods Mol Biol. 2019;1962:1–14.
- Kalvari I, Argasinska J, Quinones-Olvera N, et al. Rfam 13.0: shifting to a genome-centric resource for non-coding RNA families. Nucleic Acids Res. 2018;46(D1):D335–D342. https://rfam.org/