Analyzing Proteins: Purification, Mass Spectrometry, and Structural Methods
Learn how proteins are purified, identified by mass spectrometry, and studied by structural methods — from chromatography and SDS-PAGE to shotgun proteomics, cross-linking MS, and structural bioinformatics.
Introduction
While DNA and RNA can be analyzed by sequencing, proteins require a different set of techniques. Proteins cannot be amplified like DNA (there is no “protein PCR”), so analysis depends on purifying sufficient quantities from cells, separating complex mixtures, and identifying individual proteins by mass spectrometry. Understanding these methods is essential because proteins are the primary functional molecules of the cell — they carry out virtually every cellular process.
This lesson covers cell culture methods (the starting point for most protein analysis), protein purification techniques, mass spectrometry-based proteomics, and structural methods for determining how proteins work at the atomic level.
Cells Can Be Isolated from Tissues and Grown in Culture
Before proteins can be studied, cells must be obtained. Cell isolation from tissues uses enzymatic digestion (trypsin, collagenase) to dissociate the extracellular matrix, followed by separation into different cell types by:
- Fluorescence-activated cell sorting (FACS) — separates cells labeled with fluorescent antibodies against surface markers
- Magnetic-activated cell sorting (MACS) — uses antibody-coated magnetic beads
- Density gradient centrifugation — separates cells by buoyant density
Cell Culture and Cell Lines
Cell culture — growing cells outside the organism in a controlled environment — provides the large quantities of homogeneous cells needed for biochemical analysis:
- Primary cells are isolated directly from tissue and have a limited lifespan in culture (they undergo senescence after a defined number of divisions)
- Cell lines are cells that have acquired the ability to proliferate indefinitely, either through spontaneous transformation or deliberate immortalization (e.g., by introduction of telomerase or viral oncogenes)
Widely used cell lines include HeLa (human cervical cancer, the first human cell line established in 1951), HEK 293 (human embryonic kidney, widely used for protein expression), and CHO (Chinese hamster ovary, the standard for therapeutic protein production).
Stem Cells and Organoids
Embryonic stem (ES) cells can differentiate into any cell type (they are pluripotent), making them powerful tools for studying development and disease.
Induced pluripotent stem cells (iPSCs) are produced by reprogramming adult cells with the Yamanaka factors (Oct4, Sox2, Klf4, Myc), avoiding the ethical concerns of ES cell derivation. Patient-derived iPSCs enable study of disease-specific cells from any individual.
Hybridoma cell lines are created by fusing antibody-producing B cells with immortal myeloma cells, producing unlimited quantities of a single monoclonal antibody — an indispensable tool for protein detection, purification, and therapy.
Organoids are three-dimensional cell cultures that self-organize into structures resembling miniature organs (intestinal crypts, brain regions, kidney tubules). They bridge the gap between 2D cell culture and in vivo experiments, providing more physiologically relevant models for drug testing and disease study.
Cells Can Be Broken Open and Their Contents Fractionated
To study proteins, cells are first lysed (broken open) by detergent, mechanical disruption (homogenization, sonication), or freeze-thaw cycles. The resulting cell extract (lysate) contains all soluble cellular components.
Differential centrifugation separates organelles by size and density:
- 1,000 g pellets nuclei and large debris
- 10,000 g pellets mitochondria, lysosomes, peroxisomes
- 100,000 g pellets microsomes (ER fragments) and small vesicles
- The supernatant contains soluble cytoplasmic proteins
Protein Purification by Chromatography
Proteins are separated based on their physical and chemical properties:
| Method | Separation basis | Principle |
|---|---|---|
| Ion exchange | Charge | Proteins bind a charged resin; eluted by increasing salt concentration |
| Gel filtration (size exclusion) | Size | Larger proteins elute first; smaller proteins enter pores and are delayed |
| Hydrophobic interaction | Hydrophobicity | Proteins bind a hydrophobic resin at high salt; eluted by decreasing salt |
| Affinity | Specific binding | Protein binds a ligand immobilized on beads; eluted by competing ligand or pH change |
Affinity chromatography is the most powerful single purification step. Common approaches include:
- Nickel-NTA for His-tagged proteins (a polyhistidine tag engineered onto the protein)
- Glutathione beads for GST-tagged proteins
- Anti-FLAG beads for FLAG-tagged proteins
- Antibody-conjugated beads for endogenous proteins (immunoprecipitation)
Genetically engineered tags have revolutionized protein purification. By fusing a small affinity tag to the protein of interest using recombinant DNA techniques, even rare proteins can be purified in a single step.
SDS-PAGE and 2D Gel Electrophoresis
SDS-PAGE (SDS polyacrylamide-gel electrophoresis) separates denatured proteins by molecular weight. The detergent SDS binds uniformly to proteins, giving them a negative charge proportional to their length. In the electric field, smaller proteins migrate faster through the polyacrylamide matrix.
Two-dimensional gel electrophoresis (2D-PAGE) achieves extraordinary resolution by combining two separation dimensions:
- Isoelectric focusing (separation by charge/pI) in the first dimension
- SDS-PAGE (separation by size) in the second dimension
More than 1,000 proteins can be resolved on a single 2D gel. Spots can be excised and identified by mass spectrometry.
Mass Spectrometry Identifies Unknown Proteins
Mass spectrometry (MS) is the workhorse of modern proteomics. The basic workflow:
- Protein digestion — proteins are cleaved into peptides by trypsin (which cuts after lysine and arginine), producing fragments of manageable size
- Ionization — peptides are ionized by electrospray ionization (ESI) or MALDI (matrix-assisted laser desorption/ionization)
- Mass analysis — the mass-to-charge ratio (m/z) of each peptide ion is measured
- Fragmentation (MS/MS or MS2) — individual peptide ions are fragmented, and the pattern of fragment masses reveals the amino acid sequence
- Database search — fragmentation spectra are matched to predicted spectra from protein sequence databases
Proteomics Data Analysis
MS data analysis involves several computational steps:
- Database search engines match experimental spectra to theoretical spectra: Mascot, X!Tandem, and MSFragger (the fastest, enabling open searches for unexpected modifications)
- MaxQuant is a comprehensive analysis platform for shotgun proteomics, providing peptide identification, protein quantification (LFQ, SILAC), and post-translational modification localization
- Proteome Discoverer (Thermo Fisher) is another widely used commercial platform
Label-free quantification compares peptide intensities or spectral counts (the number of spectra matched to a protein) across samples. While less precise than isotope-labeling methods, it requires no special sample preparation.
Protein Complex Identification
Understanding protein function requires knowing which proteins interact:
- AP-MS (affinity purification followed by mass spectrometry) — a tagged bait protein is purified along with its binding partners, which are identified by MS. This identifies stable protein complexes but may miss transient interactions
- BioID and TurboID (proximity labeling) — a promiscuous biotin ligase is fused to the bait protein. It biotinylates nearby proteins (within ~10 nm), which are captured on streptavidin beads and identified by MS. This captures transient and weak interactions that are lost during affinity purification
- Cross-linking mass spectrometry (XL-MS) — chemical cross-linkers covalently link proteins that are in close proximity. The cross-linked peptides are identified by MS, revealing distance constraints between residues that inform structural modeling
Single-Cell and Spatial Data Analysis
Modern cell biology increasingly requires analysis at the single-cell level:
- Single-cell RNA-seq clustering groups cells by expression profiles, revealing cell types. Marker gene discovery identifies genes that distinguish each cluster
- Cell type annotation maps clusters to known cell types using reference databases and marker gene lists
- Batch effect correction (Harmony, BBKNN, scVI) removes technical variation between samples or experiments while preserving biological variation
- Spatial transcriptomics (Visium, MERFISH, Slide-seq) measures gene expression while preserving tissue context
- Cell-cell communication inference (CellChat, CellPhoneDB) predicts signaling interactions between cell types based on ligand-receptor pair expression
let insulin = Struct.protein_props("MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKT")
print("Insulin properties:")
print(insulin)
The Struct.protein_props() function returns key physicochemical properties including molecular weight (MW), isoelectric point (pI), and GRAVY (grand average of hydropathicity) — exactly the parameters that determine how a protein migrates on SDS-PAGE and 2D gels.
let proteins = "[{\"name\":\"Insulin\",\"MW\":5808,\"pI\":5.4},{\"name\":\"Lysozyme\",\"MW\":14300,\"pI\":11.0},{\"name\":\"BSA\",\"MW\":66400,\"pI\":4.7},{\"name\":\"Hemoglobin\",\"MW\":64500,\"pI\":6.8},{\"name\":\"Cytochrome c\",\"MW\":12400,\"pI\":10.0},{\"name\":\"Myoglobin\",\"MW\":17000,\"pI\":7.2}]"
print("2D gel simulation — MW vs pI for common reference proteins:")
Viz.scatter(proteins, "pI", "MW")
This scatter plot simulates a 2D gel: each protein occupies a unique position defined by its pI (horizontal, from isoelectric focusing) and MW (vertical, from SDS-PAGE). Proteins with similar MW but different pI are resolved horizontally; proteins with similar pI but different MW are resolved vertically.
let hemoglobin_alpha = "MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSH"
let cytochrome_c = "MGDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQA"
let props_hemo = Struct.protein_props(hemoglobin_alpha)
let props_cyto = Struct.protein_props(cytochrome_c)
print("Hemoglobin alpha:")
print(props_hemo)
print("Cytochrome c:")
print(props_cyto)
// Which protein migrates faster on SDS-PAGE? Assign to result.
let result = "Cytochrome c"
print(result)
Structural Methods: X-ray, NMR, and Cryo-EM
Determining the three-dimensional structure of a protein reveals how it works at the atomic level:
X-ray crystallography remains the most common method. A protein crystal is bombarded with X-rays, and the diffraction pattern is used to calculate the electron density map, from which the atomic model is built. It requires protein crystals, which can be difficult to obtain.
NMR spectroscopy (nuclear magnetic resonance) determines structures in solution, providing information about protein dynamics as well as structure. It is limited to relatively small proteins (<~40 kDa).
Cryo-electron microscopy (cryo-EM) has undergone a “resolution revolution.” Protein particles are flash-frozen in vitreous ice, imaged in the electron microscope, and thousands of 2D images are computationally combined into a 3D reconstruction. Cryo-EM can determine structures of large complexes (ribosomes, spliceosomes, ion channels) without crystallization and is increasingly achieving near-atomic resolution (<3 Å).
Structural Bioinformatics
Computational tools for structural analysis include:
- PDB (Protein Data Bank) — the central repository for experimentally determined 3D structures
- Structure-based drug design uses molecular docking (AutoDock, Glide) to predict how small molecules bind to protein targets, guiding drug optimization
- Molecular dynamics simulations (GROMACS, OpenMM) model protein motion and flexibility at atomistic or coarse-grained resolution, revealing conformational changes, binding mechanisms, and allosteric pathways
- AlphaFold Protein Structure Database provides predicted structures for nearly every known protein sequence; ColabFold enables fast AlphaFold predictions on local hardware
- Protein design tools (RFdiffusion, ProteinMPNN) use deep learning to design novel proteins with desired structures and functions
- Structure comparison (TM-align, DALI) identifies structural similarities even when sequence similarity is low
let globin = "MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSH"
let props = Struct.protein_props(globin)
print("Hemoglobin alpha-chain properties:")
print(props)
let ss = Struct.secondary_structure(globin)
print("Predicted secondary structure:")
print(ss)
Protein families like the globins share detectable sequence similarity that reflects shared 3D structure. Structural comparison often reveals evolutionary relationships that are invisible at the sequence level. The secondary structure prediction shows which residues form alpha-helices (H), beta-strands (E), or coils (C) — globins are famously almost entirely alpha-helical.
Cryo-EM Data Processing
Cryo-EM data analysis involves specialized computational pipelines:
- RELION and cryoSPARC — the two dominant software packages for single-particle cryo-EM, handling particle picking, 2D classification, 3D reconstruction, and refinement
- Model building fits atomic coordinates into the electron density map (using Coot, phenix)
- Electron tomography reconstructs 3D volumes from tilt series of thick specimens (cells, organelles)
Exercise: Proteomics Dimensionality Reduction with PCA
Shotgun proteomics experiments quantify thousands of proteins across many samples. Use PCA to reduce a high-dimensional protein abundance dataset to its principal components, revealing sample groupings.
let abundance_data = "[{\"s\":\"Tumor_1\",\"EGFR\":8.2,\"TP53\":3.1,\"BRCA1\":2.5,\"MYC\":7.8},{\"s\":\"Tumor_2\",\"EGFR\":7.9,\"TP53\":3.4,\"BRCA1\":2.2,\"MYC\":8.1},{\"s\":\"Normal_1\",\"EGFR\":2.1,\"TP53\":6.8,\"BRCA1\":5.9,\"MYC\":1.9},{\"s\":\"Normal_2\",\"EGFR\":2.4,\"TP53\":7.1,\"BRCA1\":6.2,\"MYC\":2.1}]"
let pca_result = ML.pca(abundance_data, 4, 2)
print("PCA of proteomics data (4 proteins, 2 components):")
print(pca_result)
Viz.scatter(pca_result, "PC1", "PC2")
let result = "2"
print(result)
Exercise: Compare Protein Properties Across a Family
Compute physicochemical properties for three related proteins and use descriptive statistics to summarize the distribution of molecular weights across the family.
let alpha = Struct.protein_props("MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSH")
let beta = Struct.protein_props("MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLST")
let myo = Struct.protein_props("MGLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLK")
print("Alpha-globin:")
print(alpha)
print("Beta-globin:")
print(beta)
print("Myoglobin:")
print(myo)
let mw_values = "[5600, 5900, 5800]"
let stats = Stats.describe(mw_values)
print("MW distribution across globin family:")
print(stats)
let result = "3"
print(result)
In proteomics, we often want to know whether two measured properties are correlated across a set of proteins. Spearman rank correlation is robust to outliers and non-linear relationships, making it well-suited for comparing ranked abundance data with physicochemical properties like molecular weight.
let abundance_ranks = "[8.2, 7.9, 2.1, 2.4, 5.5, 3.8]"
let mw_ranks = "[5808, 14300, 66400, 64500, 12400, 17000]"
let rho = Stats.spearman(abundance_ranks, mw_ranks)
print("Spearman correlation (abundance vs MW):")
print(rho)
Knowledge Check
Summary
In this lesson you covered protein analysis and structural methods:
- Cell isolation (FACS, MACS) and cell culture (primary cells, cell lines, organoids) provide material for protein studies
- ES cells and iPSCs can differentiate into any cell type; hybridomas produce monoclonal antibodies
- Cell lysis and differential centrifugation fractionate cellular contents by organelle
- Chromatography (ion exchange, gel filtration, affinity) separates proteins by charge, size, and binding specificity
- Affinity tags (His, GST, FLAG) enable single-step purification; SDS-PAGE separates by molecular weight; 2D-PAGE resolves >1,000 proteins
- Mass spectrometry identifies proteins from peptide fragmentation spectra; database search engines (Mascot, X!Tandem, MSFragger) match spectra to sequences
- MaxQuant and Proteome Discoverer provide comprehensive proteomics analysis pipelines
- AP-MS identifies stable complexes; BioID/TurboID captures transient interactions; XL-MS provides structural distance constraints
- scRNA-seq analysis with batch correction (Harmony, BBKNN) and spatial transcriptomics reveal cell-type diversity in tissue context
- Cell-cell communication tools (CellChat, CellPhoneDB) predict signaling interactions
- X-ray crystallography, NMR, and cryo-EM determine protein structures at atomic resolution
- PDB stores structures; AlphaFold predicts them; RFdiffusion and ProteinMPNN design novel proteins
- Molecular docking (AutoDock, Glide) and MD simulations (GROMACS) model protein-ligand interactions and dynamics
- Cryo-EM pipelines (RELION, cryoSPARC) achieve near-atomic resolution without crystallization
References
- Alberts B, Johnson A, Lewis J, Morgan D, Raff M, Roberts K, Walter P. Molecular Biology of the Cell, 7th ed. New York: W.W. Norton; 2022. Chapter 8: Analyzing Cells, Molecules, and Systems.
- Aebersold R, Mann M. Mass spectrometry-based proteomics. Nature. 2003;422(6928):198–207.
- Scheres SHW. RELION: implementation of a Bayesian approach to cryo-EM structure determination. J Struct Biol. 2012;180(3):519–530.
- Punjani A, Rubinstein JL, Fleet DJ, Brubaker MA. cryoSPARC: algorithms for rapid unsupervised cryo-EM structure determination. Nat Methods. 2017;14(3):290–296.
- Cox J, Mann M. MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification. Nat Biotechnol. 2008;26(12):1367–1372.
- Perez-Riverol Y, Csordas A, Bai J, et al. The PRIDE database and related tools and resources in 2019: improving support for quantification data. Nucleic Acids Res. 2019;47(D1):D419–D426.
- Nesvizhskii AI. Proteogenomics: concepts, applications and computational strategies. Nat Methods. 2014;11(11):1114–1125.
- Elias JE, Gygi SP. Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nat Methods. 2007;4(3):207–214.