Protein Shape and Structure
Learn how amino acid sequences fold into three-dimensional protein structures — from primary sequence through α helices, β sheets, and domains to the computational tools for predicting and classifying protein folds.
Introduction
Proteins are the molecular machines of the cell. They catalyze reactions, transmit signals, provide structural support, transport molecules, and defend against pathogens. This extraordinary functional diversity arises from a single principle: the amino acid sequence of a protein determines its three-dimensional shape, and that shape determines its function.
This lesson explores how proteins are built from amino acids, how they fold into specific structures, and how those structures are classified and predicted. We will also introduce the computational tools that have revolutionized our ability to understand protein structure — from sequence databases and substitution matrices to the deep-learning revolution of AlphaFold.
Amino Acids: The Building Blocks
Proteins are linear polymers of amino acids, linked by peptide bonds. There are 20 standard amino acids, each with the same backbone (amino group, central α-carbon, carboxyl group) but a distinctive side chain (R group) that determines its chemical personality.
| Category | Amino acids | Properties |
|---|---|---|
| Nonpolar (hydrophobic) | Gly (G), Ala (A), Val (V), Leu (L), Ile (I), Pro (P), Met (M) | Tend to cluster in the protein interior |
| Aromatic | Phe (F), Trp (W), Tyr (Y) | Bulky rings; Tyr and Trp can be polar |
| Polar uncharged | Ser (S), Thr (T), Asn (N), Gln (Q) | Form hydrogen bonds; on protein surfaces |
| Positively charged (basic) | Lys (K), Arg (R), His (H) | Carry positive charge at physiological pH |
| Negatively charged (acidic) | Asp (D), Glu (E) | Carry negative charge at physiological pH |
| Special | Cys (C) | Forms disulfide bonds between chains |
The identity of the side chain at each position along the chain determines how the protein folds and what it can do. A single amino acid change — as in sickle-cell disease, where a glutamic acid (E) is replaced by valine (V) in hemoglobin — can profoundly alter a protein’s behavior.
let gene = "ATGGCTAGCAAAGACTTCACCGAGTACCTGCAGAACCTGATCGGCAAATGA"
let protein = Seq.translate(gene)
print("Protein: " + protein)
print("Length: " + Seq.length(protein) + " amino acids")
let protein = "MASKDFTEYLQNLIGK"
print("Amino acid composition and properties:")
print(Struct.protein_props(protein))
The Shape of a Protein Is Specified by Its Amino Acid Sequence
The central insight of structural biology is that the primary structure (amino acid sequence) of a protein contains all the information needed to specify its three-dimensional shape. This was demonstrated by Christian Anfinsen in the 1960s: when the enzyme ribonuclease was unfolded (denatured) by removing disulfide bonds and adding a denaturing agent, it spontaneously refolded into its native, active structure when the denaturant was removed. The sequence alone was sufficient.
This means that the three-dimensional structure of every protein is, in principle, encoded in the genome. The challenge is decoding that information — predicting the folded structure from the sequence.
Proteins Fold into a Conformation of Lowest Energy
A protein folds into the conformation that minimizes its free energy — the thermodynamically most stable state. The driving forces include:
- The hydrophobic effect — the dominant force. Nonpolar side chains are buried in the protein interior, away from water. This releases ordered water molecules from the protein surface, increasing entropy.
- Hydrogen bonds — between backbone atoms (stabilizing secondary structures) and between polar side chains and water or other side chains
- Van der Waals interactions — close packing of atoms in the interior contributes favorable contacts
- Electrostatic interactions — salt bridges between oppositely charged residues
- Disulfide bonds — covalent S-S bonds between cysteine residues, especially important for secreted proteins
Folding is not a random search through all possible conformations — that would take longer than the age of the universe (the Levinthal paradox). Instead, proteins fold along energy funnels, rapidly collapsing into near-native states guided by the formation of local secondary structures and hydrophobic collapse, then fine-tuning to the final native conformation.
In the cell, molecular chaperones assist folding by preventing aggregation. Hsp70 binds exposed hydrophobic patches on nascent proteins, while the barrel-shaped chaperonin GroEL/GroES provides an enclosed chamber where proteins can fold in isolation. When proteins fail to fold correctly, they may be targeted for destruction by the proteasome — or, if they aggregate, they can cause diseases such as Alzheimer’s, Parkinson’s, and prion diseases.
α Helices and β Sheets
The two most common secondary structure elements are the α helix and the β sheet, both stabilized by hydrogen bonds between backbone atoms.
The α helix is a right-handed coil with 3.6 amino acids per turn. Each backbone N−H group forms a hydrogen bond with the C=O group four residues earlier in the sequence. The side chains project outward from the helix surface. α helices are common in membrane-spanning regions (where hydrophobic side chains face the lipid bilayer) and in structural proteins like α-keratin (hair, nails).
The β sheet consists of extended strands (called β strands) lying side by side, connected by hydrogen bonds between adjacent strands. Sheets can be parallel (all strands run in the same N→C direction), antiparallel (alternating directions), or mixed. Antiparallel sheets are somewhat more stable because the hydrogen bonds are better aligned. β sheets form the core of many globular proteins and are the dominant structure in silk fibroin and amyloid fibrils.
Other secondary structure elements include turns (tight loops connecting helices and strands), 310 helices (tighter than α helices, with hydrogen bonds spanning three residues), and coils or loops (irregular regions that are nonetheless defined in three-dimensional space).
let pdb_data = "HEADER LYSOZYME\nATOM 1 N ALA A 1 1.000 2.000 3.000 1.00 30.00\nATOM 2 CA ALA A 1 2.000 3.000 4.000 1.00 30.00\nATOM 3 C ALA A 1 3.000 4.000 5.000 1.00 30.00\nATOM 4 N LEU A 2 4.000 5.000 6.000 1.00 28.00\nATOM 5 CA LEU A 2 5.000 6.000 7.000 1.00 28.00\nATOM 6 C LEU A 2 6.000 7.000 8.000 1.00 28.00\nATOM 7 N GLU A 3 7.000 8.000 9.000 1.00 25.00\nATOM 8 CA GLU A 3 8.000 9.000 10.000 1.00 25.00\nATOM 9 C GLU A 3 9.000 10.000 11.000 1.00 25.00\nEND"
print("Secondary structure prediction:")
print(Struct.secondary_structure(pdb_data))
Protein Domains
Most proteins larger than about 200 amino acids are organized into domains — compact, semi-independent folding units, each typically 50–350 residues. Domains are the fundamental units of protein structure, function, and evolution.
Each domain folds independently and often has a specific function:
- DNA-binding domains — recognize specific DNA sequences (e.g., zinc fingers, helix-turn-helix, leucine zippers)
- Catalytic domains — contain the enzyme active site
- Protein-interaction domains — mediate binding to other proteins (e.g., SH2 recognizes phosphotyrosine, SH3 recognizes proline-rich motifs)
- Membrane-binding domains — anchor proteins to lipid bilayers (e.g., PH domains)
Crucially, some domains form stable structures that can function independently when isolated from the rest of the protein. This modularity means that evolution can create new proteins by shuffling existing domains — combining a DNA-binding domain from one protein with a catalytic domain from another to produce a new transcription-regulated enzyme.
Protein Families and Classification
The universe of known protein structures reveals that proteins can be classified into a manageable number of families based on sequence similarity, and into a smaller number of fold types based on the arrangement of secondary structure elements.
Few of the astronomically many possible polypeptide chains would be useful to cells. A random sequence of 300 amino acids (20³⁰⁰ ≈ 10³⁹⁰ possibilities) would almost certainly not fold into a stable, functional structure. Evolution has explored only a tiny fraction of sequence space, but through gene duplication and divergence, it has generated families of related proteins that share detectable sequence similarity.
Sequence homology searches — using tools like BLAST and HMMER — can identify close relatives of any protein. If two proteins share more than ~30% sequence identity over a substantial length, they are almost certainly homologous and likely share the same fold. Below ~20% identity (“the twilight zone”), sequence-based methods become unreliable, and structural comparisons are needed.
Intrinsically Disordered Proteins
Not all functional proteins fold into stable three-dimensional structures. Intrinsically disordered proteins (IDPs) or regions (IDRs) lack a fixed structure under physiological conditions, existing instead as dynamic ensembles of conformations. An estimated 30–40% of eukaryotic proteins contain significant disordered regions.
Disordered regions are often enriched in charged and polar residues (Glu, Lys, Ser, Gln) and depleted in hydrophobic residues that would drive folding. Far from being non-functional, IDPs play critical roles in signaling, transcriptional regulation, and molecular assembly. Their flexibility allows them to bind multiple partners, to fold upon binding (the “fly-casting” mechanism), and to form dynamic hubs in regulatory networks.
Many disease-associated proteins, including p53 (the “guardian of the genome”) and α-synuclein (implicated in Parkinson’s disease), contain extensive disordered regions.
Multi-Subunit Proteins: Quaternary Structure
Many functional proteins are not single polypeptide chains but assemblies of multiple subunits. The arrangement of subunits is called quaternary structure.
Hemoglobin, for example, is a tetramer of two α-subunits and two β-subunits (α&sub2;β&sub2;). The subunits interact through noncovalent bonds at specific interfaces. This multi-subunit architecture enables cooperativity: the binding of oxygen to one subunit increases the affinity of the remaining subunits, producing the sigmoidal oxygen-binding curve essential for efficient oxygen delivery from lungs to tissues.
Other examples include the proteasome (28 subunits), the ribosome (over 50 protein subunits plus rRNA), and viral capsids (built from many copies of one or a few protein types). In each case, the quaternary assembly creates functions that transcend those of the individual subunits.
Protein Sequence Databases
Two databases form the foundation of protein bioinformatics:
UniProt is the definitive protein sequence database, containing over 250 million sequences. It has two components: Swiss-Prot (manually reviewed and annotated, ~570,000 entries) and TrEMBL (automatically annotated, the vast majority of entries). Each entry includes the sequence, function, domain architecture, post-translational modifications, disease associations, and cross-references to other databases.
The Protein Data Bank (PDB) stores experimentally determined three-dimensional structures solved by X-ray crystallography, NMR spectroscopy, and cryo-electron microscopy. With over 200,000 structures, the PDB is the primary resource for understanding how proteins look and work at atomic resolution.
PDB files encode atomic coordinates in a structured text format. We can extract summary information — such as the number of residues, chains, and atoms — directly from PDB data:
let pdb_data = "HEADER LYSOZYME\nATOM 1 N ALA A 1 1.000 2.000 3.000 1.00 30.00\nATOM 2 CA ALA A 1 2.000 3.000 4.000 1.00 30.00\nATOM 3 C ALA A 1 3.000 4.000 5.000 1.00 30.00\nATOM 4 N LEU A 2 4.000 5.000 6.000 1.00 28.00\nATOM 5 CA LEU A 2 5.000 6.000 7.000 1.00 28.00\nATOM 6 C LEU A 2 6.000 7.000 8.000 1.00 28.00\nATOM 7 N GLU A 3 7.000 8.000 9.000 1.00 25.00\nATOM 8 CA GLU A 3 8.000 9.000 10.000 1.00 25.00\nATOM 9 C GLU A 3 9.000 10.000 11.000 1.00 25.00\nEND"
let info = Struct.pdb_info(pdb_data)
print("PDB structure summary:")
print(info)
Substitution Matrices: BLOSUM and PAM
When comparing protein sequences, not all amino acid substitutions are equal. A leucine-to-isoleucine change (both hydrophobic) is far more common and less disruptive than a glycine-to-tryptophan change. Substitution matrices capture these differences quantitatively.
PAM matrices (Point Accepted Mutation) were developed by Margaret Dayhoff in the 1970s from observed substitutions in closely related proteins, then extrapolated to longer evolutionary distances. PAM1 represents 1% divergence; PAM250 represents ~80% divergence.
BLOSUM matrices (BLOcks SUbstitution Matrix) were derived by Steven and Jorja Henikoff in 1992 from ungapped alignments of protein blocks. BLOSUM62 (derived from sequences with ≤62% identity) is the default matrix in BLAST and the most widely used substitution matrix in bioinformatics.
Each matrix entry represents the log-odds score for a particular amino acid substitution: positive values indicate substitutions observed more often than expected by chance (conservative changes), negative values indicate substitutions observed less often (disruptive changes).
let prot_a = "MASKDFTEYLQNLIGK"
let prot_b = "MASKDFTQYLQNLIGK"
print("Protein A (wild-type): " + prot_a)
print("Protein B (E→Q variant): " + prot_b)
print("Properties of wild-type:")
print(Struct.protein_props(prot_a))
print("Properties of E→Q variant:")
print(Struct.protein_props(prot_b))
Exercise: Alpha-Helix vs Beta-Sheet Preferences
Different amino acids have different propensities for forming alpha helices vs beta sheets. Alanine (A), leucine (L), and glutamate (E) strongly favor helices, while valine (V), isoleucine (I), and tyrosine (Y) favor sheets. Compare the properties of a helix-favoring peptide and a sheet-favoring peptide.
let helix_peptide = "AELKALAKELG"
let sheet_peptide = "VIYVSIVTYVI"
print("Helix-favoring peptide: " + helix_peptide)
print(Struct.protein_props(helix_peptide))
print("Sheet-favoring peptide: " + sheet_peptide)
print(Struct.protein_props(sheet_peptide))
// Which peptide has more charged residues — the helix-former or the sheet-former?
let answer = "helix"
print(answer)
The helix-forming peptide contains charged residues (E, K) that can form salt bridges along the helix surface, while the sheet-former is dominated by branched hydrophobic residues (V, I) that pack well in extended conformations.
Secondary Structure Prediction
Given an amino acid sequence, secondary structure prediction algorithms estimate which residues form α helices, β strands, or coils. Modern methods achieve ~85% accuracy per residue.
PSIPRED uses position-specific scoring matrices derived from PSI-BLAST searches and neural networks to predict secondary structure. JPred combines multiple alignment profiles with a consensus approach. These tools are typically the first step in analyzing a new protein sequence, providing a rough outline of the structural elements before attempting full 3D prediction.
Protein Fold Classification: SCOP and CATH
The relatively small number of distinct protein folds (~1,500 known) has been systematically classified by two major databases:
SCOP (Structural Classification of Proteins) organizes structures hierarchically: Class (all-α, all-β, α/β, α+β) → Fold (similar arrangement of secondary structures) → Superfamily (probable common ancestor) → Family (clear sequence similarity).
CATH (Class, Architecture, Topology, Homologous superfamily) uses a similar hierarchy but with more emphasis on automated classification. Both databases reveal that the same folds appear again and again across unrelated protein families, suggesting that the number of stable folds is limited and much of fold space has been explored by evolution.
From Homology Modeling to AlphaFold
Predicting the three-dimensional structure of a protein from its sequence alone has been one of biology’s grand challenges. Several approaches have been developed:
Homology modeling (comparative modeling) builds a 3D model of a target protein based on the known structure of a homologous protein (the template). If a template with >30% sequence identity is available, homology modeling can produce highly accurate models. Programs like MODELLER and SWISS-MODEL automate this process.
Threading (fold recognition) goes further: it fits a target sequence onto each of the ~1,500 known protein folds and evaluates which fold best accommodates the sequence, even when detectable sequence similarity is low.
Ab initio prediction attempts to predict structure from physical principles alone, without templates. This remained unreliable for decades until the advent of deep learning.
AlphaFold (DeepMind, 2020) and AlphaFold2 represented a breakthrough, achieving near-experimental accuracy for most protein structures. Trained on known structures from the PDB and evolutionary information from multiple sequence alignments, AlphaFold predicts 3D coordinates with median accuracy comparable to experimental methods. The AlphaFold Protein Structure Database now provides predicted structures for over 200 million proteins — essentially every protein with a known sequence — transforming structural biology from an experimental bottleneck into a computational resource.
RoseTTAFold and other deep-learning models have similarly advanced the field, making structure prediction accessible to any researcher with a protein sequence.
Structure Visualization and Validation
Protein structures are visualized using molecular graphics software:
- PyMOL — powerful, scriptable, widely used in publications
- Mol* (Molstar) — web-based, used by the PDB and AlphaFold database
- ChimeraX — successor to UCSF Chimera, excellent for cryo-EM data
Structure quality is assessed using Ramachandran plots, which show the backbone dihedral angles (φ and ψ) for each residue. Most residues fall in well-defined allowed regions of the plot (corresponding to helices, sheets, and turns). Residues in disallowed regions suggest errors in the structure. Other validation metrics include R-factor (for crystallographic structures) and per-residue confidence scores (pLDDT) for AlphaFold predictions.
let pdb_data = "HEADER LYSOZYME\nATOM 1 N ALA A 1 1.000 2.000 3.000 1.00 30.00\nATOM 2 CA ALA A 1 2.000 3.000 4.000 1.00 30.00\nATOM 3 C ALA A 1 3.000 4.000 5.000 1.00 30.00\nATOM 4 N LEU A 2 4.000 5.000 6.000 1.00 28.00\nATOM 5 CA LEU A 2 5.000 6.000 7.000 1.00 28.00\nATOM 6 C LEU A 2 6.000 7.000 8.000 1.00 28.00\nATOM 7 N GLU A 3 7.000 8.000 9.000 1.00 25.00\nATOM 8 CA GLU A 3 8.000 9.000 10.000 1.00 25.00\nATOM 9 C GLU A 3 9.000 10.000 11.000 1.00 25.00\nEND"
print("Ramachandran analysis:")
print(Struct.ramachandran(pdb_data))
let pdb_data = "HEADER LYSOZYME\nATOM 1 N ALA A 1 1.000 2.000 3.000 1.00 30.00\nATOM 2 CA ALA A 1 2.000 3.000 4.000 1.00 30.00\nATOM 3 C ALA A 1 3.000 4.000 5.000 1.00 30.00\nATOM 4 N LEU A 2 4.000 5.000 6.000 1.00 28.00\nATOM 5 CA LEU A 2 5.000 6.000 7.000 1.00 28.00\nATOM 6 C LEU A 2 6.000 7.000 8.000 1.00 28.00\nATOM 7 N GLU A 3 7.000 8.000 9.000 1.00 25.00\nATOM 8 CA GLU A 3 8.000 9.000 10.000 1.00 25.00\nATOM 9 C GLU A 3 9.000 10.000 11.000 1.00 25.00\nEND"
print("Contact map (8Å threshold):")
print(Struct.contact_map(pdb_data, "8.0"))
Intrinsically Disordered Region Prediction
Computational tools predict intrinsically disordered regions from sequence alone:
- IUPred uses energy estimation based on pairwise amino acid interactions — regions with unfavorable folding energetics are predicted as disordered
- PONDR (Predictor Of Natural Disordered Regions) uses neural networks trained on known disordered and ordered sequences
- MobiDB aggregates predictions from multiple tools to provide consensus disorder annotations
These predictions are increasingly important for understanding signaling proteins, transcription factors, and the “dark proteome” of proteins that lack stable structures.
Exercise: Hydrophobic Core vs Surface Residues
Proteins have a hydrophobic core (nonpolar residues buried inside) and a hydrophilic surface (charged and polar residues exposed to water). Compare the properties of a sequence representing core residues to one representing surface residues to see how they differ.
let core_residues = "VLLFIAVLIF"
let surface_residues = "EKDRSNQEKT"
print("Hydrophobic core residues:")
print(Struct.protein_props(core_residues))
print("Surface-exposed residues:")
print(Struct.protein_props(surface_residues))
// Which sequence has higher hydrophobicity — "core" or "surface"?
let answer = "core"
print(answer)
Exercise: Soluble or Aggregation-Prone?
Proteins enriched in hydrophobic residues with few charged residues tend to aggregate, while balanced sequences with mixed polar and nonpolar residues are typically soluble. Analyze two peptide sequences and predict which one is more likely to be soluble.
let peptide_a = "VVFILLVVFILAVVF"
let peptide_b = "AEKDLSKVGEAMRSK"
print("Peptide A (hydrophobic-rich):")
print(Struct.protein_props(peptide_a))
print("Peptide B (mixed polar/nonpolar):")
print(Struct.protein_props(peptide_b))
// Which peptide is more likely soluble — "peptide_a" or "peptide_b"?
let answer = "peptide_b"
print(answer)
Knowledge Check
Summary
In this lesson you covered protein shape, structure, and the computational tools for analyzing them:
- 20 amino acids with diverse side chains determine protein properties; sequence specifies structure (Anfinsen’s experiment)
- Proteins fold to minimize free energy — driven primarily by the hydrophobic effect, with chaperones assisting in the cell
- α helices and β sheets are the dominant secondary structures, stabilized by backbone hydrogen bonds
- Domains are modular, independently folding units that can be shuffled by evolution to create new proteins
- Few polypeptide sequences are useful — evolution has explored a small fraction of sequence space, creating protein families through duplication and divergence
- Sequence homology searches (BLAST, HMMER) identify related proteins; >30% identity reliably indicates shared fold
- Intrinsically disordered proteins lack stable structure yet play critical roles in signaling and regulation
- Quaternary structure — multi-subunit assemblies (like hemoglobin) enable cooperativity and complex functions
- UniProt (sequences) and PDB (3D structures) are the foundational protein databases
- Substitution matrices (BLOSUM62, PAM) quantify the likelihood of amino acid changes during evolution
- Secondary structure prediction (PSIPRED, JPred) achieves ~85% per-residue accuracy
- Fold classification (SCOP, CATH) reveals ~1,500 distinct protein folds
- Structure prediction has progressed from homology modeling and threading to AlphaFold, which achieves near-experimental accuracy using deep learning
- Visualization (PyMOL, Mol*, ChimeraX) and validation (Ramachandran plots, pLDDT) assess structure quality
- Disorder prediction (IUPred, PONDR) identifies intrinsically disordered regions from sequence
References
- Alberts B, Johnson A, Lewis J, Morgan D, Raff M, Roberts K, Walter P. Molecular Biology of the Cell, 7th ed. New York: W.W. Norton; 2022. Chapter 3: Proteins.
- Anfinsen CB. Principles that govern the folding of protein chains. Science. 1973;181(4096):223–230.
- Berman HM, Westbrook J, Feng Z, et al. The Protein Data Bank. Nucleic Acids Res. 2000;28(1):235–242. https://www.rcsb.org/
- Jumper J, Evans R, Pritzel A, et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021;596(7873):583–589.
- Kabsch W, Sander C. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers. 1983;22(12):2577–2637.
- Altschul SF, Madden TL, Schäffer AA, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25(17):3389–3402.
- Mészáros B, Erdős G, Dosztányi Z. IUPred2A: context-dependent prediction of protein disorder as a function of redox state and protein binding. Nucleic Acids Res. 2018;46(W1):W329–W337.
- Varadi M, Anyango S, Deshpande M, et al. AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Res. 2022;50(D1):D439–D444.