Analyzing Proteins: Purification, Mass Spectrometry, and Structural Methods

Intermediate Experimental Methods ~30 min

← Previous Next →

Learn how proteins are purified, identified by mass spectrometry, and studied by structural methods — from chromatography and SDS-PAGE to shotgun proteomics, cross-linking MS, and structural bioinformatics.

Introduction

While DNA and RNA can be analyzed by sequencing, proteins require a different set of techniques. Proteins cannot be amplified like DNA (there is no “protein PCR”), so analysis depends on purifying sufficient quantities from cells, separating complex mixtures, and identifying individual proteins by mass spectrometry. Understanding these methods is essential because proteins are the primary functional molecules of the cell — they carry out virtually every cellular process.

This lesson covers cell culture methods (the starting point for most protein analysis), protein purification techniques, mass spectrometry-based proteomics, and structural methods for determining how proteins work at the atomic level.

Cells Can Be Isolated from Tissues and Grown in Culture

Before proteins can be studied, cells must be obtained. Cell isolation from tissues uses enzymatic digestion (trypsin, collagenase) to dissociate the extracellular matrix, followed by separation into different cell types by:

Fluorescence-activated cell sorting (FACS) — separates cells labeled with fluorescent antibodies against surface markers
Magnetic-activated cell sorting (MACS) — uses antibody-coated magnetic beads
Density gradient centrifugation — separates cells by buoyant density

Cell Culture and Cell Lines

Cell culture — growing cells outside the organism in a controlled environment — provides the large quantities of homogeneous cells needed for biochemical analysis:

Primary cells are isolated directly from tissue and have a limited lifespan in culture (they undergo senescence after a defined number of divisions)
Cell lines are cells that have acquired the ability to proliferate indefinitely, either through spontaneous transformation or deliberate immortalization (e.g., by introduction of telomerase or viral oncogenes)

Widely used cell lines include HeLa (human cervical cancer, the first human cell line established in 1951), HEK 293 (human embryonic kidney, widely used for protein expression), and CHO (Chinese hamster ovary, the standard for therapeutic protein production).

Stem Cells and Organoids

Embryonic stem (ES) cells can differentiate into any cell type (they are pluripotent), making them powerful tools for studying development and disease.

Induced pluripotent stem cells (iPSCs) are produced by reprogramming adult cells with the Yamanaka factors (Oct4, Sox2, Klf4, Myc), avoiding the ethical concerns of ES cell derivation. Patient-derived iPSCs enable study of disease-specific cells from any individual.

Hybridoma cell lines are created by fusing antibody-producing B cells with immortal myeloma cells, producing unlimited quantities of a single monoclonal antibody — an indispensable tool for protein detection, purification, and therapy.

Organoids are three-dimensional cell cultures that self-organize into structures resembling miniature organs (intestinal crypts, brain regions, kidney tubules). They bridge the gap between 2D cell culture and in vivo experiments, providing more physiologically relevant models for drug testing and disease study.

Cells Can Be Broken Open and Their Contents Fractionated

To study proteins, cells are first lysed (broken open) by detergent, mechanical disruption (homogenization, sonication), or freeze-thaw cycles. The resulting cell extract (lysate) contains all soluble cellular components.

Differential centrifugation separates organelles by size and density:

1,000 g pellets nuclei and large debris
10,000 g pellets mitochondria, lysosomes, peroxisomes
100,000 g pellets microsomes (ER fragments) and small vesicles
The supernatant contains soluble cytoplasmic proteins

Protein Purification by Chromatography

Proteins are separated based on their physical and chemical properties:

Method	Separation basis	Principle
Ion exchange	Charge	Proteins bind a charged resin; eluted by increasing salt concentration
Gel filtration (size exclusion)	Size	Larger proteins elute first; smaller proteins enter pores and are delayed
Hydrophobic interaction	Hydrophobicity	Proteins bind a hydrophobic resin at high salt; eluted by decreasing salt
Affinity	Specific binding	Protein binds a ligand immobilized on beads; eluted by competing ligand or pH change

Affinity chromatography is the most powerful single purification step. Common approaches include:

Nickel-NTA for His-tagged proteins (a polyhistidine tag engineered onto the protein)
Glutathione beads for GST-tagged proteins
Anti-FLAG beads for FLAG-tagged proteins
Antibody-conjugated beads for endogenous proteins (immunoprecipitation)

Genetically engineered tags have revolutionized protein purification. By fusing a small affinity tag to the protein of interest using recombinant DNA techniques, even rare proteins can be purified in a single step.

SDS-PAGE and 2D Gel Electrophoresis

SDS-PAGE (SDS polyacrylamide-gel electrophoresis) separates denatured proteins by molecular weight. The detergent SDS binds uniformly to proteins, giving them a negative charge proportional to their length. In the electric field, smaller proteins migrate faster through the polyacrylamide matrix.

Two-dimensional gel electrophoresis (2D-PAGE) achieves extraordinary resolution by combining two separation dimensions:

Isoelectric focusing (separation by charge/pI) in the first dimension
SDS-PAGE (separation by size) in the second dimension

More than 1,000 proteins can be resolved on a single 2D gel. Spots can be excised and identified by mass spectrometry.

Mass Spectrometry Identifies Unknown Proteins

Mass spectrometry (MS) is the workhorse of modern proteomics. The basic workflow:

Protein digestion — proteins are cleaved into peptides by trypsin (which cuts after lysine and arginine), producing fragments of manageable size
Ionization — peptides are ionized by electrospray ionization (ESI) or MALDI (matrix-assisted laser desorption/ionization)
Mass analysis — the mass-to-charge ratio (m/z) of each peptide ion is measured
Fragmentation (MS/MS or MS2) — individual peptide ions are fragmented, and the pattern of fragment masses reveals the amino acid sequence
Database search — fragmentation spectra are matched to predicted spectra from protein sequence databases

Proteomics Data Analysis

MS data analysis involves several computational steps:

Database search engines match experimental spectra to theoretical spectra: Mascot, X!Tandem, and MSFragger (the fastest, enabling open searches for unexpected modifications)
MaxQuant is a comprehensive analysis platform for shotgun proteomics, providing peptide identification, protein quantification (LFQ, SILAC), and post-translational modification localization
Proteome Discoverer (Thermo Fisher) is another widely used commercial platform

Label-free quantification compares peptide intensities or spectral counts (the number of spectra matched to a protein) across samples. While less precise than isotope-labeling methods, it requires no special sample preparation.

Protein Complex Identification

Understanding protein function requires knowing which proteins interact:

AP-MS (affinity purification followed by mass spectrometry) — a tagged bait protein is purified along with its binding partners, which are identified by MS. This identifies stable protein complexes but may miss transient interactions
BioID and TurboID (proximity labeling) — a promiscuous biotin ligase is fused to the bait protein. It biotinylates nearby proteins (within ~10 nm), which are captured on streptavidin beads and identified by MS. This captures transient and weak interactions that are lost during affinity purification
Cross-linking mass spectrometry (XL-MS) — chemical cross-linkers covalently link proteins that are in close proximity. The cross-linked peptides are identified by MS, revealing distance constraints between residues that inform structural modeling

Single-Cell and Spatial Data Analysis

Modern cell biology increasingly requires analysis at the single-cell level:

Single-cell RNA-seq clustering groups cells by expression profiles, revealing cell types. Marker gene discovery identifies genes that distinguish each cluster
Cell type annotation maps clusters to known cell types using reference databases and marker gene lists
Batch effect correction (Harmony, BBKNN, scVI) removes technical variation between samples or experiments while preserving biological variation
Spatial transcriptomics (Visium, MERFISH, Slide-seq) measures gene expression while preserving tissue context
Cell-cell communication inference (CellChat, CellPhoneDB) predicts signaling interactions between cell types based on ligand-receptor pair expression

let insulin = Struct.protein_props("MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKT")
print("Insulin properties:")
print(insulin)

The Struct.protein_props() function returns key physicochemical properties including molecular weight (MW), isoelectric point (pI), and GRAVY (grand average of hydropathicity) — exactly the parameters that determine how a protein migrates on SDS-PAGE and 2D gels.

let proteins = "[{\"name\":\"Insulin\",\"MW\":5808,\"pI\":5.4},{\"name\":\"Lysozyme\",\"MW\":14300,\"pI\":11.0},{\"name\":\"BSA\",\"MW\":66400,\"pI\":4.7},{\"name\":\"Hemoglobin\",\"MW\":64500,\"pI\":6.8},{\"name\":\"Cytochrome c\",\"MW\":12400,\"pI\":10.0},{\"name\":\"Myoglobin\",\"MW\":17000,\"pI\":7.2}]"
print("2D gel simulation — MW vs pI for common reference proteins:")
Viz.scatter(proteins, "pI", "MW")

This scatter plot simulates a 2D gel: each protein occupies a unique position defined by its pI (horizontal, from isoelectric focusing) and MW (vertical, from SDS-PAGE). Proteins with similar MW but different pI are resolved horizontally; proteins with similar pI but different MW are resolved vertically.

let hemoglobin_alpha = "MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSH"
let cytochrome_c = "MGDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQA"
let props_hemo = Struct.protein_props(hemoglobin_alpha)
let props_cyto = Struct.protein_props(cytochrome_c)
print("Hemoglobin alpha:")
print(props_hemo)
print("Cytochrome c:")
print(props_cyto)
// Which protein migrates faster on SDS-PAGE? Assign to result.
let result = "Cytochrome c"
print(result)

Structural Methods: X-ray, NMR, and Cryo-EM

Determining the three-dimensional structure of a protein reveals how it works at the atomic level:

X-ray crystallography remains the most common method. A protein crystal is bombarded with X-rays, and the diffraction pattern is used to calculate the electron density map, from which the atomic model is built. It requires protein crystals, which can be difficult to obtain.

NMR spectroscopy (nuclear magnetic resonance) determines structures in solution, providing information about protein dynamics as well as structure. It is limited to relatively small proteins (<~40 kDa).

Cryo-electron microscopy (cryo-EM) has undergone a “resolution revolution.” Protein particles are flash-frozen in vitreous ice, imaged in the electron microscope, and thousands of 2D images are computationally combined into a 3D reconstruction. Cryo-EM can determine structures of large complexes (ribosomes, spliceosomes, ion channels) without crystallization and is increasingly achieving near-atomic resolution (<3 Å).

Structural Bioinformatics

Computational tools for structural analysis include:

PDB (Protein Data Bank) — the central repository for experimentally determined 3D structures
Structure-based drug design uses molecular docking (AutoDock, Glide) to predict how small molecules bind to protein targets, guiding drug optimization
Molecular dynamics simulations (GROMACS, OpenMM) model protein motion and flexibility at atomistic or coarse-grained resolution, revealing conformational changes, binding mechanisms, and allosteric pathways
AlphaFold Protein Structure Database provides predicted structures for nearly every known protein sequence; ColabFold enables fast AlphaFold predictions on local hardware
Protein design tools (RFdiffusion, ProteinMPNN) use deep learning to design novel proteins with desired structures and functions
Structure comparison (TM-align, DALI) identifies structural similarities even when sequence similarity is low

let globin = "MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSH"
let props = Struct.protein_props(globin)
print("Hemoglobin alpha-chain properties:")
print(props)
let ss = Struct.secondary_structure(globin)
print("Predicted secondary structure:")
print(ss)

Protein families like the globins share detectable sequence similarity that reflects shared 3D structure. Structural comparison often reveals evolutionary relationships that are invisible at the sequence level. The secondary structure prediction shows which residues form alpha-helices (H), beta-strands (E), or coils (C) — globins are famously almost entirely alpha-helical.

Cryo-EM Data Processing

Cryo-EM data analysis involves specialized computational pipelines:

RELION and cryoSPARC — the two dominant software packages for single-particle cryo-EM, handling particle picking, 2D classification, 3D reconstruction, and refinement
Model building fits atomic coordinates into the electron density map (using Coot, phenix)
Electron tomography reconstructs 3D volumes from tilt series of thick specimens (cells, organelles)

Exercise: Proteomics Dimensionality Reduction with PCA

Shotgun proteomics experiments quantify thousands of proteins across many samples. Use PCA to reduce a high-dimensional protein abundance dataset to its principal components, revealing sample groupings.

let abundance_data = "[{\"s\":\"Tumor_1\",\"EGFR\":8.2,\"TP53\":3.1,\"BRCA1\":2.5,\"MYC\":7.8},{\"s\":\"Tumor_2\",\"EGFR\":7.9,\"TP53\":3.4,\"BRCA1\":2.2,\"MYC\":8.1},{\"s\":\"Normal_1\",\"EGFR\":2.1,\"TP53\":6.8,\"BRCA1\":5.9,\"MYC\":1.9},{\"s\":\"Normal_2\",\"EGFR\":2.4,\"TP53\":7.1,\"BRCA1\":6.2,\"MYC\":2.1}]"
let pca_result = ML.pca(abundance_data, 4, 2)
print("PCA of proteomics data (4 proteins, 2 components):")
print(pca_result)
Viz.scatter(pca_result, "PC1", "PC2")
let result = "2"
print(result)

Exercise: Compare Protein Properties Across a Family

Compute physicochemical properties for three related proteins and use descriptive statistics to summarize the distribution of molecular weights across the family.

let alpha = Struct.protein_props("MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSH")
let beta = Struct.protein_props("MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLST")
let myo = Struct.protein_props("MGLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLK")
print("Alpha-globin:")
print(alpha)
print("Beta-globin:")
print(beta)
print("Myoglobin:")
print(myo)
let mw_values = "[5600, 5900, 5800]"
let stats = Stats.describe(mw_values)
print("MW distribution across globin family:")
print(stats)
let result = "3"
print(result)

In proteomics, we often want to know whether two measured properties are correlated across a set of proteins. Spearman rank correlation is robust to outliers and non-linear relationships, making it well-suited for comparing ranked abundance data with physicochemical properties like molecular weight.

let abundance_ranks = "[8.2, 7.9, 2.1, 2.4, 5.5, 3.8]"
let mw_ranks = "[5808, 14300, 66400, 64500, 12400, 17000]"
let rho = Stats.spearman(abundance_ranks, mw_ranks)
print("Spearman correlation (abundance vs MW):")
print(rho)

Knowledge Check

Summary

In this lesson you covered protein analysis and structural methods:

Cell isolation (FACS, MACS) and cell culture (primary cells, cell lines, organoids) provide material for protein studies
ES cells and iPSCs can differentiate into any cell type; hybridomas produce monoclonal antibodies
Cell lysis and differential centrifugation fractionate cellular contents by organelle
Chromatography (ion exchange, gel filtration, affinity) separates proteins by charge, size, and binding specificity
Affinity tags (His, GST, FLAG) enable single-step purification; SDS-PAGE separates by molecular weight; 2D-PAGE resolves >1,000 proteins
Mass spectrometry identifies proteins from peptide fragmentation spectra; database search engines (Mascot, X!Tandem, MSFragger) match spectra to sequences
MaxQuant and Proteome Discoverer provide comprehensive proteomics analysis pipelines
AP-MS identifies stable complexes; BioID/TurboID captures transient interactions; XL-MS provides structural distance constraints
scRNA-seq analysis with batch correction (Harmony, BBKNN) and spatial transcriptomics reveal cell-type diversity in tissue context
Cell-cell communication tools (CellChat, CellPhoneDB) predict signaling interactions
X-ray crystallography, NMR, and cryo-EM determine protein structures at atomic resolution
PDB stores structures; AlphaFold predicts them; RFdiffusion and ProteinMPNN design novel proteins
Molecular docking (AutoDock, Glide) and MD simulations (GROMACS) model protein-ligand interactions and dynamics
Cryo-EM pipelines (RELION, cryoSPARC) achieve near-atomic resolution without crystallization

References

Alberts B, Johnson A, Lewis J, Morgan D, Raff M, Roberts K, Walter P. Molecular Biology of the Cell, 7th ed. New York: W.W. Norton; 2022. Chapter 8: Analyzing Cells, Molecules, and Systems.
Aebersold R, Mann M. Mass spectrometry-based proteomics. Nature. 2003;422(6928):198–207.
Scheres SHW. RELION: implementation of a Bayesian approach to cryo-EM structure determination. J Struct Biol. 2012;180(3):519–530.
Punjani A, Rubinstein JL, Fleet DJ, Brubaker MA. cryoSPARC: algorithms for rapid unsupervised cryo-EM structure determination. Nat Methods. 2017;14(3):290–296.
Cox J, Mann M. MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification. Nat Biotechnol. 2008;26(12):1367–1372.
Perez-Riverol Y, Csordas A, Bai J, et al. The PRIDE database and related tools and resources in 2019: improving support for quantification data. Nucleic Acids Res. 2019;47(D1):D419–D426.
Nesvizhskii AI. Proteogenomics: concepts, applications and computational strategies. Nat Methods. 2014;11(11):1114–1125.
Elias JE, Gygi SP. Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nat Methods. 2007;4(3):207–214.

Powered by

cyanea-struct cyanea-ml cyanea-stats

protein purification chromatography affinity purification SDS-PAGE 2D gel electrophoresis mass spectrometry proteomics Mascot MSFragger MaxQuant AP-MS BioID proximity labeling cross-linking MS cell culture cell lines stem cells iPSC organoids hybridoma monoclonal antibodies single-cell analysis spatial transcriptomics CellChat batch correction