Analyzing DNA: From Cloning to Sequencing

Intermediate Experimental Methods ~40 min

← Previous Next →

Learn how DNA is manipulated, sequenced, and analyzed — from restriction enzymes, cloning, and PCR to next-generation sequencing technologies, genome assembly, variant calling, and CRISPR-based functional genomics.

Introduction

The ability to isolate, copy, sequence, and manipulate specific DNA sequences is the technical foundation of modern molecular biology and genomics. What began with restriction enzymes and DNA cloning in the 1970s has evolved into an extraordinary toolkit that includes the polymerase chain reaction, massively parallel sequencing, and CRISPR-based gene editing.

This lesson covers the key techniques for analyzing and manipulating DNA — from classical methods that remain essential to cutting-edge sequencing technologies and genome-scale functional screens — and the computational pipelines that transform raw sequencing data into biological insights.

Restriction Nucleases Cut DNA at Defined Sequences

Restriction enzymes (restriction endonucleases) are bacterial enzymes that cut double-stranded DNA at specific recognition sequences, usually 4–8 bp palindromes. Bacteria use them as a defense against bacteriophage DNA; their own DNA is protected by methylation at the recognition sites.

Different restriction enzymes cut at different sequences:

EcoRI recognizes GAATTC (cuts between G and A)
HindIII recognizes AAGCTT
BamHI recognizes GGATCC

Some enzymes cut to produce sticky ends (short single-stranded overhangs that can base-pair with complementary overhangs), while others produce blunt ends. Sticky ends are essential for cloning because they allow fragments from different sources to be joined if they were cut with the same enzyme.

Gel Electrophoresis Separates DNA by Size

Gel electrophoresis separates DNA fragments by size. DNA is loaded onto an agarose (for large fragments) or polyacrylamide (for small fragments) gel and an electric field is applied. Because DNA is negatively charged (due to its phosphate backbone), it migrates toward the positive electrode. Smaller fragments migrate faster through the gel pores, producing size-based separation.

Fragment sizes are determined by comparison to a molecular weight ladder — a set of DNA fragments of known sizes run alongside the samples. Ethidium bromide or SYBR staining allows visualization of DNA bands under UV light.

DNA Cloning Using Bacteria

DNA cloning produces many identical copies of a DNA fragment by inserting it into a vector and replicating it in bacteria:

The target DNA and vector are cut with the same restriction enzyme, creating compatible sticky ends
The fragment is ligated (joined by DNA ligase) into the vector
The recombinant vector is transformed into bacteria
Bacteria containing the vector are selected (typically by antibiotic resistance encoded on the vector)
As bacteria grow, they replicate the vector and its insert, producing millions of copies

Common vectors include:

Plasmids — circular DNA molecules (carry inserts up to ~15 kb)
BACs (bacterial artificial chromosomes) — carry inserts of 100–300 kb
Phage vectors — carry inserts of 10–20 kb

A DNA library is a collection of clones that together represent an entire genome (genomic library) or all the mRNAs in a cell (cDNA library, made by reverse-transcribing mRNA). Genomic libraries contain all sequences including introns and intergenic regions; cDNA libraries contain only expressed sequences.

Hybridization Detects Specific Nucleotide Sequences

Hybridization exploits the base-pairing rules to detect specific DNA or RNA sequences. A labeled probe (a single-stranded DNA or RNA with a known sequence) is incubated with denatured target DNA. If the probe finds a complementary sequence, it base-pairs to form a stable hybrid.

Applications include Southern blotting (DNA), Northern blotting (RNA), fluorescence in situ hybridization (FISH) for localizing sequences on chromosomes, and microarrays (thousands of probes on a chip for genome-wide analysis).

PCR: Copying DNA In Vitro

The polymerase chain reaction (PCR) enables amplification of any DNA sequence from minuscule amounts of starting material. PCR uses:

Two short primers (~18–25 nt) that flank the target sequence
A heat-stable DNA polymerase (Taq polymerase, from the thermophilic bacterium Thermus aquaticus)
Repeated cycles of denaturation (95°C, separates strands), annealing (50–65°C, primers bind), and extension (72°C, polymerase synthesizes new strand)

Each cycle doubles the target DNA, so 30 cycles produce approximately 2^30 ≈ 10^9 (one billion) copies from a single molecule. This exponential amplification is what makes PCR so powerful.

Primer design is critical: primers must be specific to the target (checked by PrimerBLAST), have similar melting temperatures, avoid self-complementarity, and not form primer dimers. Primer3 is the standard computational tool for automated primer design.

PCR applications extend well beyond cloning:

Diagnostic PCR detects pathogens (e.g., COVID-19 testing by RT-qPCR)
Forensic DNA profiling amplifies short tandem repeats (STRs)
Quantitative RT-PCR measures mRNA levels with high sensitivity

let template = "ATGGCTAGCAAAGACTTCACCGAGTACCTGCAGAACCTGATCGGCAAATGA"
let forward = "ATGGCTAGCAAAGAC"
let reverse_rc = Seq.reverse_complement("TTTGCCGATCAGGTT")
print("Template:   " + template)
print("Fwd primer: " + forward)
print("Rev primer: " + reverse_rc)
print("Template: " + Seq.length(template) + " bp")

Exercise: PCR Primer Design

Given a template DNA sequence, compute the reverse complement of a primer region to design the reverse primer. Verify the primer’s GC content is suitable for PCR (ideally 40–60%).

let template = "ATGGCTAGCAAAGAC"
let rev_primer = Seq.reverse_complement(template)
print("Template:    " + template)
print("Rev primer:  " + rev_primer)
print("Primer GC:   " + Seq.gc_content(rev_primer))
print(rev_primer)

DNA Sequencing: Sanger to Next-Generation

Sanger sequencing (chain termination method) was the dominant technology for 30 years. It uses fluorescently labeled dideoxynucleotides (ddNTPs) that terminate DNA synthesis when incorporated, producing fragments of every possible length. Capillary electrophoresis separates these fragments, and the terminal fluorescent label identifies each base. Sanger sequencing reads ~800–1,000 bases per reaction with very high accuracy.

Next-generation sequencing (NGS) technologies revolutionized genomics by producing millions to billions of reads in parallel:

Technology	Read length	Throughput	Error rate	Best for
Illumina	150–300 bp	Very high (~600 Gb/run)	~0.1% (substitution)	Most applications
PacBio HiFi	10–20 kb	Moderate	~0.1% (after CCS)	De novo assembly, structural variants
Oxford Nanopore	1 kb–1 Mb+	Moderate	~1–5% (raw)	Long reads, real-time sequencing, field applications
Sanger	800–1,000 bp	Low (1 read/reaction)	~0.001%	Targeted sequencing, validation

Each technology has trade-offs. Illumina dominates for short-read applications (RNA-seq, ChIP-seq, variant calling) due to its very high throughput and accuracy. Long-read technologies (PacBio, Nanopore) are essential for resolving repetitive regions, structural variants, and phasing haplotypes.

Genome Annotation

Raw genome sequences must be annotated to be useful — genes must be identified, their boundaries defined, and functions predicted:

Prokaryotic annotation (Prokka) is relatively straightforward because prokaryotic genes lack introns and can be found by open reading frame detection combined with homology searches
Eukaryotic annotation (MAKER, BRAKER) is more challenging due to introns, alternative splicing, and non-coding RNAs; it combines ab initio gene prediction (GeneMark, Augustus), RNA-seq evidence (transcript alignments), and protein homology from databases

Recombinant DNA Methods Have Revolutionized Human Health

The techniques above have had transformative medical applications:

Recombinant insulin (first produced in 1982) treats diabetes
Gene therapy corrects genetic defects by delivering functional gene copies
mRNA vaccines (COVID-19) use synthetic mRNA sequences
CRISPR-based therapeutics (e.g., Casgevy for sickle cell disease) edit patient genomes directly

CRISPR and Programmable Nucleases

CRISPR-Cas9 is a programmable nuclease guided by a short guide RNA (gRNA) that specifies the target DNA sequence through base complementarity. Applications include:

Gene knockout — Cas9 creates a double-strand break; error-prone NHEJ repair introduces frameshift mutations
Gene knock-in — a repair template is provided for precise editing via HDR
CRISPRa/CRISPRi — catalytically dead Cas9 (dCas9) fused to activation or repression domains controls transcription without cutting DNA
Base editing — converts specific bases (C-to-T or A-to-G) without double-strand breaks
Prime editing — uses a reverse transcriptase fused to Cas9 for precise insertions, deletions, or substitutions

CRISPR Screens Identify Gene Function at Genome Scale

Pooled CRISPR screens test thousands of genes simultaneously:

A library of guide RNAs (one per gene, or multiple per gene) is packaged in lentivirus
Cells are transduced so each cell receives one guide (one gene knockout)
Cells are subjected to a selective condition (drug treatment, growth competition, etc.)
Guide RNAs enriched or depleted after selection reveal essential or resistance genes

Guide RNA design tools (CRISPRscan, Benchling) optimize on-target efficiency and minimize off-target effects. Screen analysis tools include:

MAGeCK — identifies significantly enriched/depleted genes using robust ranking aggregation
BAGEL — uses Bayesian analysis to classify genes as essential or non-essential

Perturb-seq combines CRISPR perturbation with single-cell RNA-seq, measuring the transcriptional consequences of knocking out each gene at single-cell resolution.

RNAi and Reporter Genes

RNA interference (RNAi) provides a complementary approach to CRISPR for testing gene function. Synthetic siRNAs or shRNAs reduce target mRNA levels through the RISC pathway. While less permanent than CRISPR knockout, RNAi is fast, reversible, and can achieve dose-dependent knockdown.

Reporter genes (GFP, luciferase, β-galactosidase) fused to gene promoters reveal when and where genes are expressed. In situ hybridization uses labeled probes to detect specific mRNAs in tissue sections, revealing the spatial pattern of gene expression.

Quantitative RT-PCR (reverse transcription PCR) measures specific mRNA levels with high sensitivity and dynamic range. For genome-wide expression analysis, RNA-seq has largely replaced microarrays.

Gene Set Enrichment Analysis

When a genome-wide experiment identifies hundreds of differentially expressed or essential genes, gene set enrichment analysis (GSEA) determines whether specific biological pathways or functions are over-represented:

GSEA (Broad Institute) tests whether predefined gene sets show statistically significant enrichment at the top or bottom of a ranked gene list
g:Profiler and Enrichr perform over-representation analysis against Gene Ontology (GO), KEGG, Reactome, and other pathway databases

Read Quality Control and Alignment

The computational pipeline for analyzing NGS data begins with quality control:

FastQC — generates per-base quality scores, GC content plots, adapter contamination checks, and sequence duplication analysis
MultiQC — aggregates FastQC reports across many samples into a single summary
Trimmomatic or fastp — trims adapter sequences and low-quality bases from reads

Reads are then aligned (mapped) to a reference genome:

BWA (Burrows-Wheeler Aligner) and Bowtie2 — standard short-read aligners for DNA-seq
minimap2 — handles both short and long reads; the standard for PacBio and Nanopore data

Variant Calling and Annotation

Variant calling identifies positions where the sequenced genome differs from the reference:

GATK (Genome Analysis Toolkit) — the industry standard, using haplotype-based calling (HaplotypeCaller) with base quality score recalibration and variant quality score recalibration
DeepVariant — uses deep learning (convolutional neural networks) trained on pileup images to call variants; achieves state-of-the-art accuracy
FreeBayes — a Bayesian variant caller that handles polyploid samples

Variant annotation tools determine the functional impact:

VEP (Variant Effect Predictor, Ensembl) — classifies variants by consequence (synonymous, missense, frameshift, splice site, etc.)
SnpEff — similar annotation with integrated database of known functional effects

De Novo Genome Assembly

For organisms without a reference genome, de novo assembly builds contiguous sequences (contigs) from overlapping reads:

SPAdes — assembler for short reads and hybrid (short + long) assemblies, widely used for bacterial genomes
Hifiasm — assembler optimized for PacBio HiFi reads; produces near-complete assemblies with high accuracy
Flye — assembler for Oxford Nanopore or PacBio CLR reads

Assembly quality is evaluated by metrics such as N50 (the contig length at which 50% of the assembly is in contigs of this size or larger), total assembly size, and BUSCO (Benchmarking Universal Single-Copy Orthologs, which checks whether expected genes are present and complete).

let fasta = ">gene1 Beta-globin\nATGGTGCATCTGACTCCTGAGGAG\n>gene2 Alpha-globin\nATGGTGCTGTCTCCTGCCGACAAG"
let records = IO.parse_fasta(fasta)
print("FASTA records:")
print(records)

Systems Biology and Computational Modeling

Modern biology increasingly takes a systems approach — studying how molecular components interact as networks:

ODE modeling (ordinary differential equations) describes the dynamics of regulatory circuits (e.g., gene regulatory networks, signaling cascades)
Stochastic simulation (Gillespie algorithm) captures the inherent randomness of molecular processes, important when molecule counts are low
Network topology analysis examines properties like centrality (which nodes are most connected), modularity (are there distinct functional modules?), and network motifs (recurring circuit patterns like feed-forward loops)

Machine learning is transforming biological data analysis:

Deep learning for sequences (CNNs, transformers) predicts protein function, structure, binding sites, and regulatory element activity from sequence alone (e.g., ESM for proteins, Enformer for gene expression)
Image analysis using convolutional neural networks classifies cell types, detects phenotypes, and segments cells in microscopy data

Biological databases and API access (NCBI E-utilities, Ensembl REST API, UniProt API) enable programmatic retrieval and integration of data from major repositories.

let fastq = "@read1\nATGGCTAGCAAA\n+\nIIIIIHHHGGGF\n@read2\nGCTAGCAAATTT\n+\nIIIHHHGGGFFF"
let reads = IO.parse_fastq(fastq)
print("FASTQ reads:")
print(reads)

Summary statistics help quickly assess read quality across a sequencing run. Here we compute descriptive statistics on the GC content of a set of reads:

let gc_values = [0.50, 0.42, 0.58, 0.47, 0.53, 0.44, 0.61, 0.39]
let summary = Stats.describe(gc_values)
print("GC content summary statistics:")
print(summary)

A narrow distribution centered near the expected GC content for the organism suggests high-quality, unbiased sequencing. Skewed or multimodal GC distributions may indicate contamination or library preparation artifacts.

Exercise: Analyze FASTQ Read Quality

Parse FASTQ reads and compute the GC content for each read. How many reads did you process?

let fastq = "@read1\nATGGCTAGCAAA\n+\nIIIIIHHHGGGF\n@read2\nGCTAGCAAATTT\n+\nIIIHHHGGGFFF"
let reads = IO.parse_fastq(fastq)
print("Read 1 GC: " + Seq.gc_content(reads[0].sequence))
print("Read 2 GC: " + Seq.gc_content(reads[1].sequence))
let num_reads = reads.length
print(num_reads)

Exercise: Parse and Analyze Sequence Data

Parse the following FASTA data, compute the GC content of each sequence, and identify which gene has higher GC content.

let fasta = ">gene1\nATGAAATTTGATACTGAAGAT\n>gene2\nATGGCTAGCAAATTCACCGAG"
let records = IO.parse_fasta(fasta)
print("Gene 1 GC: " + Seq.gc_content(records[0].sequence))
print("Gene 2 GC: " + Seq.gc_content(records[1].sequence))
let answer = "gene2"
print(answer)

Exercise: Design a Reverse Complement Primer

Given a target sequence, compute the reverse complement to design a primer that would bind the opposite strand.

let target = "ATGGCTAGCAAAGAC"
let primer = Seq.reverse_complement(target)
print("Target:  " + target)
print("Primer:  " + primer)
print(primer)

Knowledge Check

Summary

In this lesson you covered DNA analysis and functional genomics:

Restriction enzymes cut DNA at specific palindromic sequences, producing defined fragments with sticky or blunt ends
Gel electrophoresis separates DNA by size; DNA cloning produces copies via vectors and bacterial replication
DNA libraries (genomic or cDNA) represent an organism’s genome or transcriptome as collections of clones
Hybridization (Southern, Northern, FISH, microarrays) detects specific sequences via probe base-pairing
PCR amplifies DNA exponentially (~10^9 copies in 30 cycles); Primer3 and PrimerBLAST automate primer design
Sanger sequencing provides long, accurate reads for targeted applications
NGS technologies (Illumina, PacBio HiFi, Oxford Nanopore) differ in read length, throughput, and error profiles
Genome annotation (Prokka, MAKER, BRAKER) identifies genes from assembled sequences
CRISPR-Cas9 enables gene knockout, knock-in, CRISPRa/i, base editing, and prime editing
Pooled CRISPR screens (analyzed by MAGeCK, BAGEL) test gene function at genome scale; Perturb-seq adds scRNA-seq readout
RNAi provides reversible gene knockdown; reporter genes and in situ hybridization reveal expression patterns
GSEA and pathway analysis tools (g:Profiler, Enrichr) interpret gene lists in biological context
Quality control (FastQC, MultiQC), read alignment (BWA, Bowtie2, minimap2), and variant calling (GATK, DeepVariant, FreeBayes) form the core NGS analysis pipeline
De novo assembly (SPAdes, Hifiasm, Flye) builds genomes without a reference; quality assessed by N50 and BUSCO
Systems biology uses ODE modeling, stochastic simulation, and network analysis to understand biological circuits
Machine learning (deep learning for sequences and images) and database APIs enable computational biology at scale

References

Alberts B, Johnson A, Lewis J, Morgan D, Raff M, Roberts K, Walter P. Molecular Biology of the Cell, 7th ed. New York: W.W. Norton; 2022. Chapter 8: Analyzing Cells, Molecules, and Systems.
Sanger F, Nicklen S, Coulson AR. DNA sequencing with chain-terminating inhibitors. Proc Natl Acad Sci USA. 1977;74(12):5463–5467.
Shendure J, Balasubramanian S, Church GM, et al. DNA sequencing at 40: past, present and future. Nature. 2017;550(7676):345–353.
Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009;25(14):1754–1760.
DePristo MA, Banks E, Poplin R, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011;43(5):491–498.
Cock PJA, Antao T, Chang JT, et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics. 2009;25(11):1422–1423.
LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521(7553):436–444.
Avsec Ž, Agarwal V, Visentin D, et al. Effective gene expression prediction from sequence by integrating long-range interactions. Nat Methods. 2021;18(10):1196–1203.

Powered by

cyanea-seq cyanea-stats

restriction enzymes gel electrophoresis cloning vectors DNA libraries PCR primers Sanger sequencing next-generation sequencing Illumina PacBio Oxford Nanopore FASTA FASTQ genome annotation hybridization CRISPR CRISPRi CRISPRa CRISPR screens RNAi reporter genes RT-PCR RNA-seq ChIP-seq ribosome profiling FastQC BWA GATK DeepVariant genome assembly SPAdes Hifiasm Prokka MAKER MAGeCK GSEA Primer3