Skip to main content
Alpha Cyanea is in public alpha. We're building in the open — expect rough edges and rapid iteration. See what's live

Eukaryotic Genome Organization

Beginner Cell Biology ~35 min

Discover how eukaryotic genomes are structured — from endosymbiosis and hybrid genomes to introns, regulatory DNA, repetitive elements, and the computational methods used to assemble and annotate them.

Introduction

The previous two lessons explored what all cells share and how genomes differ across the three domains of life. This lesson focuses on the eukaryotic branch — the domain that includes fungi, plants, animals, and protists. Eukaryotic genomes are vastly larger than those of prokaryotes, but the extra DNA is not simply “more genes.” Instead, eukaryotic genomes are distinguished by their complex architecture: interrupted genes, vast regulatory regions, armies of repetitive elements, and a hierarchical organization that enables a single genome to direct the development of hundreds of distinct cell types.

We will also introduce the computational methods used to make sense of these genomes: how raw sequencing reads are assembled into chromosomes, how genes are predicted, and how comparing genomes across model organisms reveals the logic of eukaryotic biology.

Eukaryotic Cells May Have Originated as Predators

How did the first eukaryotic cell arise? One influential hypothesis proposes that the ancestral eukaryote was a predatory cell — larger than typical prokaryotes, capable of engulfing other cells by phagocytosis (cell eating). This predatory lifestyle would have given it access to new food sources and, critically, set the stage for one of the most consequential events in the history of life: endosymbiosis.

Unlike most prokaryotes, which have rigid cell walls, the proto-eukaryote likely had a flexible membrane that could deform and surround prey. The loss of a rigid wall, the acquisition of an internal cytoskeleton for shape and movement, and the development of an endomembrane system (endoplasmic reticulum, Golgi apparatus) may all trace back to this predatory ancestor. The nucleus itself may have evolved as a way to protect the genome from the DNA-damaging enzymes of engulfed prey.

Modern Eukaryotic Cells Evolved from a Symbiosis

About two billion years ago, the predatory proto-eukaryote engulfed an aerobic α-proteobacterium that it failed to digest. Instead, the bacterium survived inside the host and eventually became the mitochondrion — the organelle responsible for oxidative phosphorylation, the process that generates most of a eukaryotic cell’s ATP.

This endosymbiotic origin is supported by compelling evidence: mitochondria have their own circular DNA genome (similar to bacterial chromosomes), their own ribosomes (which resemble bacterial ribosomes, not eukaryotic ones), a double membrane (the inner membrane derived from the bacterium, the outer from the host’s engulfing membrane), and they replicate by binary fission, just like bacteria.

A second endosymbiotic event occurred later in the lineage leading to plants and algae: the engulfment of a cyanobacterium, which became the chloroplast. Like mitochondria, chloroplasts retain their own genome, bacterial-type ribosomes, and double membrane. Some algae have even undergone secondary endosymbiosis — engulfing a eukaryote that already contained chloroplasts — resulting in organelles with three or four membranes.

Eukaryotes Have Hybrid Genomes

Over evolutionary time, most of the genes originally carried by the endosymbiont genomes have been transferred to the nucleus. The human mitochondrial genome retains only 37 genes (13 protein-coding), while the nuclear genome encodes roughly 1,500 proteins that are synthesized in the cytoplasm and imported into mitochondria. Plant nuclear genomes similarly contain hundreds of genes of chloroplast origin.

This makes eukaryotic genomes fundamentally hybrid: a mosaic of genes inherited vertically from the archaeal-like host and genes acquired horizontally from the bacterial endosymbiont. Phylogenetic analysis of individual genes often reveals this dual ancestry — informational genes (those involved in DNA replication, transcription, and translation) tend to resemble archaeal homologs, while operational genes (those involved in metabolism) frequently resemble bacterial homologs.

Eukaryotic Genomes Are Big

The most immediately striking feature of eukaryotic genomes is their size. Prokaryotic genomes are compact — E. coli has 4.6 megabases (Mb) encoding ~4,300 genes, with roughly one gene per kilobase. Eukaryotic genomes are orders of magnitude larger, but the relationship between genome size and gene number is surprisingly weak.

OrganismGenome sizeProtein-coding genesGene density
E. coli (bacterium)4.6 Mb~4,300~1 gene/kb
S. cerevisiae (yeast)12 Mb~6,000~1 gene/2 kb
C. elegans (worm)100 Mb~20,000~1 gene/5 kb
D. melanogaster (fruit fly)140 Mb~14,000~1 gene/10 kb
A. thaliana (plant)135 Mb~27,000~1 gene/5 kb
H. sapiens (human)3,200 Mb~20,000~1 gene/150 kb
P. japonicus (lungfish)43,000 Mb~20,000~1 gene/2,000 kb

A scatter plot of genome size versus gene count across these organisms makes the C-value paradox unmistakable — genome size spans four orders of magnitude, but gene count varies by only about 4-fold:

let data = '[{"x": 4.6, "y": 4300, "label": "E. coli"}, {"x": 12, "y": 6000, "label": "Yeast"}, {"x": 100, "y": 20000, "label": "Worm"}, {"x": 140, "y": 14000, "label": "Fruit fly"}, {"x": 135, "y": 27000, "label": "Arabidopsis"}, {"x": 3200, "y": 20000, "label": "Human"}, {"x": 43000, "y": 20000, "label": "Lungfish"}]'
let plot = Viz.scatter(data, '{"title": "Genome Size vs Gene Count — The C-Value Paradox", "x_label": "Genome size (Mb)", "y_label": "Protein-coding genes", "color": "#8b5cf6"}')
print(plot)

Where does all the extra DNA go? In the human genome:

ComponentApproximate share
Protein-coding exons~1.5%
Introns~25%
Repetitive elements (transposons, SINEs, LINEs)~45%
Regulatory and other sequences~28.5%
let comp = '[{"label": "Exons", "value": 1.5}, {"label": "Introns", "value": 25}, {"label": "Repeats", "value": 45}, {"label": "Other", "value": 28.5}]'
let chart = Viz.bar(comp, '{"title": "Human Genome Composition (%)", "color": "#8b5cf6"}')
print(chart)

Only about 1.5% of the human genome directly encodes protein. Understanding what the other 98.5% does — and whether some of it is functionally inert — remains one of the central questions of genome biology.

How does genome composition compare across species? A heatmap reveals how the balance between exons, introns, repeats, and other sequences shifts as genomes grow larger:

let composition = '[{"organism": "Yeast", "Exons": 70, "Introns": 5, "Repeats": 3, "Other": 22}, {"organism": "Fruit fly", "Exons": 20, "Introns": 20, "Repeats": 15, "Other": 45}, {"organism": "Human", "Exons": 1.5, "Introns": 25, "Repeats": 45, "Other": 28.5}, {"organism": "Lungfish", "Exons": 0.5, "Introns": 20, "Repeats": 55, "Other": 24.5}]'
let heatmap = Viz.heatmap(composition, '{"title": "Genome Composition Across Species (%)", "x_label": "Component", "y_label": "Organism", "color": "#8b5cf6"}')
print(heatmap)

Exercise: Genome Composition — Exon vs. Repeat

Exonic sequences tend to have higher GC content than repetitive elements in many genomes. Compute the GC content of a short exon sequence and a repetitive element sequence to see which is higher.

let exon = "ATGGCGCCGCTGACCGCGGCGATCCGG"
let repeat_elem = "AATAAAGATAATAAATTTGATACTAAA"
print("Exon GC: " + Seq.gc_content(exon))
print("Repeat GC: " + Seq.gc_content(repeat_elem))
let higher_gc = "Exon"
print(higher_gc)

Introns, Exons, and Alternative Splicing

Most eukaryotic protein-coding genes are interrupted: the coding sequence (exons) is broken up by non-coding intervening sequences (introns). After transcription, the introns are removed by the spliceosome, a large RNA-protein complex, and the exons are joined together to form the mature mRNA.

A typical human gene has 8–10 exons and can span tens to hundreds of kilobases of genomic DNA, even though its coding sequence may be only a few thousand bases. The dystrophin gene, one of the largest known, spans 2.4 Mb but produces an mRNA of only 14 kb.

The split structure of eukaryotic genes enables alternative splicing — the joining of different combinations of exons to produce multiple mRNA variants (and therefore multiple protein isoforms) from a single gene. It is estimated that over 90% of human multi-exon genes undergo some form of alternative splicing. This massively expands the protein repertoire: the ~20,000 human genes can produce an estimated 100,000 or more distinct protein isoforms.

Alternative splicing can include or skip entire exons, retain introns, or use alternative splice sites, generating proteins with different domains, binding properties, or enzymatic activities. This is a key mechanism for generating tissue-specific protein variants — the same gene can produce one isoform in the brain and a different one in the heart.

Eukaryotic Genomes Are Rich in Regulatory DNA

Eukaryotic genomes devote a far larger fraction of their sequence to gene regulation than prokaryotes do. A typical eukaryotic gene is controlled by a complex array of regulatory elements:

  • Promoters — immediately upstream of the gene, where the transcription machinery assembles
  • Enhancers — sequences that can be located thousands or even millions of base pairs away but still activate transcription of their target gene by looping through three-dimensional space
  • Silencers — repress transcription of specific genes
  • Insulators — boundary elements that prevent regulatory signals from affecting the wrong genes

This elaborate regulatory architecture is what makes multicellular development possible. A neuron and a liver cell in the same organism contain identical genomes, but their gene expression profiles are radically different. The regulatory DNA, together with the transcription factors and chromatin-modifying enzymes that read it, constitutes the program that directs cells to adopt distinct identities during development.

The Genome Defines the Program of Multicellular Development

A fertilized human egg and a mature neuron contain the same 3.2 billion base pairs of DNA, yet they are utterly different in form and function. The genome does not merely encode proteins — it encodes a developmental program: a set of instructions that specifies which genes are expressed in which cells, at which times, and at what levels.

This program operates through cascades of transcription factors — proteins that bind to regulatory DNA and activate or repress target genes. During development, successive waves of transcription factor activity progressively narrow a cell’s fate, from a totipotent zygote to a terminally differentiated cell type. Master regulators like the Hox genes (conserved from flies to humans) control body patterning, while lineage-specific factors drive the formation of particular tissues.

The nematode Caenorhabditis elegans — with exactly 959 somatic cells in the adult — provided the first complete map of a developmental lineage: every cell division, every fate decision, traced from egg to adult. This landmark work demonstrated that development is a genetic program executed with remarkable precision.

Many Eukaryotes Live as Solitary Cells

It is tempting to think of eukaryotes as multicellular organisms, but the majority of eukaryotic species are unicellular. Protists — a catchall term for eukaryotes that are not animals, plants, or fungi — include an enormous diversity of single-celled organisms: amoebae, ciliates, flagellates, diatoms, and many more.

These organisms can be astonishingly complex despite being single cells. Paramecium, for example, has a mouth-like oral groove, contractile vacuoles for water regulation, and two kinds of nuclei (a macronucleus for gene expression and a micronucleus for sexual reproduction). The single-celled predator Didinium can hunt and engulf other protists.

Even among organisms we think of as multicellular, many have important unicellular phases. Yeasts are single-celled fungi. Many algae alternate between single-celled and colonial forms. Slime molds spend most of their lives as individual amoebae but can aggregate into a multicellular slug when food is scarce.

Yeast: A Minimal Model Eukaryote

Saccharomyces cerevisiae (baker’s yeast) is the simplest and best-studied eukaryotic model organism. Its genome of 12 Mb distributed across 16 chromosomes encodes approximately 6,000 genes — the minimum toolkit for a free-living eukaryote.

In 1996, yeast became the first eukaryote to have its complete genome sequenced. Because yeast is easy to grow, genetically tractable, and shares fundamental cell biology with all eukaryotes (cell cycle, DNA repair, protein secretion, signal transduction), it has served as a discovery platform for countless basic biological processes. Many human disease genes were first understood through their yeast homologs.

A comprehensive effort to delete every yeast gene one by one (the Yeast Deletion Project) revealed that roughly 1,000 of the 6,000 genes are individually essential for growth under laboratory conditions. The remaining ~5,000 genes are dispensable individually, though many become essential under stress or in combination with other deletions.

let yeast_gene = "ATGTCTGCCCCAGGAACTGCTGTG"
let human_homolog = "ATGTCGGCCCCTGGCACTGCCGTG"
let result = Align.global(yeast_gene, human_homolog)
print("Yeast vs Human homolog:")
print("Score: " + result.score)
print(result.alignment)

Monitoring Gene Expression: Transcriptomics

One of the most transformative developments in genome biology is the ability to measure the expression levels of all genes simultaneously. This field, known as transcriptomics, reveals which genes are active in a given cell type, developmental stage, or disease state.

Early approaches used DNA microarrays (gene chips): thousands of DNA probes fixed to a glass slide, each corresponding to a different gene. Fluorescently labeled mRNA from a sample is washed over the array, and the intensity of fluorescence at each spot indicates the expression level of that gene.

Modern transcriptomics is dominated by RNA sequencing (RNA-seq), which sequences all the mRNA molecules in a sample. RNA-seq has largely replaced microarrays because it provides a more quantitative, unbiased, and comprehensive view of the transcriptome with greater dynamic range. The most recent advance, single-cell RNA-seq (scRNA-seq), measures gene expression in individual cells, revealing the heterogeneity hidden within seemingly uniform cell populations.

These technologies have revealed that even genetically identical cells can have surprisingly different expression profiles, that most of the genome is transcribed at some level (pervasive transcription), and that non-coding RNAs are far more numerous and diverse than previously appreciated.

Repetitive Elements and Transposons

Nearly half the human genome is composed of repetitive DNA, most derived from transposable elements (transposons) — sequences that can copy themselves and insert into new genomic locations. Major families include:

  • LINEs (Long Interspersed Nuclear Elements) — the most abundant class, especially LINE-1 (L1), which at ~500,000 copies accounts for about 17% of the human genome
  • SINEs (Short Interspersed Nuclear Elements) — including Alu elements, which are primate-specific and number over 1 million copies
  • DNA transposons — move by a cut-and-paste mechanism; mostly inactive in the human genome but active in some other organisms
  • LTR retrotransposons — resemble retroviruses and include human endogenous retroviruses (HERVs)

Most transposable elements in the human genome are now inactive — molecular fossils of ancient transposition events. However, some LINE-1 elements and Alu elements remain active and occasionally cause disease by inserting into genes. Transposon insertions have also been a creative force in evolution, sometimes donating regulatory elements or exons to nearby genes.

The C-Value Paradox and Genome Size Variation

Genome size varies enormously among eukaryotes and does not correlate well with organism complexity — an observation known as the C-value paradox (C for the constant amount of DNA in a haploid genome).

OrganismGenome size
S. cerevisiae (yeast)12 Mb
A. thaliana (plant)135 Mb
D. melanogaster (fruit fly)140 Mb
H. sapiens (human)3,200 Mb
Allium cepa (onion)16,000 Mb
Protopterus annectens (lungfish)43,000 Mb
Paris japonica (plant)149,000 Mb

The lungfish genome is 13 times larger than the human genome, and the plant Paris japonica holds the current record at ~149 billion base pairs — roughly 50 times our genome. The variation is explained primarily by differences in repetitive element content and polyploidy (whole-genome duplication), not by differences in gene number. Most eukaryotes have roughly 15,000–30,000 protein-coding genes regardless of genome size.

let sizes = '[{"label": "Yeast", "value": 12}, {"label": "Arabidopsis", "value": 135}, {"label": "Fruit fly", "value": 140}, {"label": "Human", "value": 3200}, {"label": "Onion", "value": 16000}, {"label": "Lungfish", "value": 43000}]'
let chart = Viz.bar(sizes, '{"title": "Eukaryotic Genome Sizes (Mb) — The C-Value Paradox", "color": "#8b5cf6"}')
print(chart)

Toward Quantitative Biology

Understanding eukaryotic genomes — with their intricate regulatory networks, alternative splicing, pervasive non-coding transcription, and three-dimensional chromatin organization — requires more than qualitative descriptions. It demands mathematics, computational modeling, and large-scale quantitative data.

Systems biology approaches integrate transcriptomic, proteomic, and metabolomic data into computational models of cellular behavior. Machine learning algorithms predict gene regulatory networks from expression data. Statistical methods quantify the significance of biological signals against genomic noise. The era of purely bench-driven biology has given way to an interdisciplinary science where bioinformatics and computational biology are essential partners of experimental work.

Gene Prediction in Eukaryotic Genomes

After a genome is sequenced and assembled, the next challenge is finding the genes within it. Gene prediction in eukaryotes is far more difficult than in prokaryotes because of introns, large intergenic regions, and complex gene structures.

Two complementary approaches are used:

Ab initio (from first principles) prediction uses statistical models trained on known gene structures to identify patterns characteristic of genes: splice site signals, codon usage bias, promoter motifs, and the statistical properties of coding vs. non-coding DNA. Programs like Augustus, GenScan, and SNAP scan the raw sequence for these patterns. Ab initio methods can find genes with no known homologs but have significant false-positive and false-negative rates.

Homology-based prediction aligns the genome against known proteins or mRNA sequences from the same or related organisms. If a region of the genome aligns well to a known protein, it almost certainly encodes a gene. Programs like GeneWise and Exonerate use protein-to-genome alignment to define exon-intron boundaries. This approach is highly accurate where homologs exist but cannot find truly novel genes.

Modern annotation pipelines combine both approaches, along with transcript evidence from RNA-seq, into a unified gene model. The NCBI RefSeq and Ensembl annotation pipelines are the most widely used.

Repeat Element Identification: RepeatMasker

Before gene prediction can proceed accurately, repetitive elements must be identified and masked (replaced with N characters or lowercase letters) so that they do not confuse gene-finding algorithms or sequence alignments.

RepeatMasker is the standard tool for this task. It compares the genome sequence against a library of known repeat element sequences (the Repbase or Dfam databases) and annotates each instance of a repeat with its family, class, and degree of divergence from the consensus sequence. The output reveals what fraction of the genome is repetitive and which families of transposable elements are most abundant.

RepeatMasker results are essential metadata for any genome project. They inform genome assembly (repeats cause most assembly errors), gene annotation (repeats within introns must be distinguished from exons), and evolutionary analysis (the age and activity of different transposon families).

Genome Assembly: From Reads to Chromosomes

Modern DNA sequencing instruments do not read entire chromosomes at once. Instead, they produce millions of short reads (typically 100–300 bp for Illumina, or 10–100 kb for PacBio and Oxford Nanopore). The computational challenge of genome assembly is reconstructing the complete chromosome sequences from these overlapping fragments.

Assembly proceeds in stages:

  1. Reads are compared to find overlaps between them
  2. Overlapping reads are merged into longer contigs (contiguous sequences)
  3. Contigs are ordered and oriented into scaffolds using additional information (mate-pair reads, Hi-C data, or optical maps), with gaps represented by runs of N characters
  4. Scaffolds are assigned to chromosomes using genetic maps or reference genomes

The quality of an assembly is measured by the N50 statistic: the length of the shortest contig (or scaffold) such that 50% of the total assembly is contained in contigs of that length or longer. A higher N50 indicates a more contiguous assembly. For reference, the human genome assembly GRCh38 has a scaffold N50 of over 67 Mb, approaching complete chromosome arms.

Repetitive DNA is the primary obstacle to assembly: when a repeat is longer than the read length, the assembler cannot determine which copy of the repeat a given read belongs to. Long-read sequencing technologies (PacBio HiFi, Oxford Nanopore) have dramatically improved this situation by spanning most repeats in a single read, enabling telomere-to-telomere assemblies of complex genomes.

Comparative Genomics Across Model Organisms

Comparing genomes across eukaryotic model organisms reveals both the deep conservation of core biology and the lineage-specific innovations that make each organism unique.

The major eukaryotic model organisms and their contributions:

Model organismKey strengths
S. cerevisiae (yeast)Cell cycle, DNA repair, protein secretion
C. elegans (worm)Development, cell lineage, apoptosis, RNAi
D. melanogaster (fruit fly)Genetics, development, neurobiology
D. rerio (zebrafish)Vertebrate development, organogenesis
M. musculus (mouse)Mammalian physiology, disease models
A. thaliana (plant)Plant biology, photosynthesis, development

Comparative genomics has shown that the core eukaryotic gene set — cell cycle regulators, signaling pathways, transcription factors — is remarkably conserved. About 50% of human genes have recognizable homologs in yeast, and the fraction rises to over 80% for the mouse. This conservation is why discoveries in model organisms are directly relevant to human biology and medicine.

let yeast = "ATGTCTGCCCCAGGAACTGCTGTG"
let fly = "ATGTCAGCACCGGGCACTGCAGTG"
let human = "ATGTCGGCCCCTGGCACTGCCGTG"
print("Yeast GC:  " + Seq.gc_content(yeast))
print("Fly GC:    " + Seq.gc_content(fly))
print("Human GC:  " + Seq.gc_content(human))
let gc_data = '[{"label": "Yeast", "value": 58}, {"label": "Fly", "value": 54}, {"label": "Human", "value": 62}]'
let gc_chart = Viz.bar(gc_data, '{"title": "GC Content Across Model Organisms (%)", "color": "#8b5cf6"}')
print(gc_chart)

GC content varies systematically across model organisms, reflecting different mutational biases and selection pressures. These differences affect codon usage, gene prediction accuracy, and primer design in experimental work.

let yeast_gene = "ATGTCTGCCCCAGGAACTGCTGTGTGA"
let human_gene = "ATGTCGGCCCCTGGCACTGCCGTGTGA"
print("Yeast codon usage:")
print(Seq.codon_usage(yeast_gene))
print("Human codon usage:")
print(Seq.codon_usage(human_gene))

Exercise: Comparative K-mer Analysis

Use Seq.kmer_count() to compare dinucleotide usage between the yeast and human gene fragments. Align them to see how sequence-level similarity compares with composition-level similarity.

let yeast = "ATGTCTGCCCCAGGAACTGCTGTGTGA"
let human = "ATGTCGGCCCCTGGCACTGCCGTGTGA"
print("Yeast dinucleotides:")
print(Seq.kmer_count(yeast, 2))
print("Human dinucleotides:")
print(Seq.kmer_count(human, 2))
let alignment = Align.global(yeast, human)
print("Alignment score: " + alignment.score)
let more_cc = "Yeast"
print(more_cc)

Exercise: Identify a Conserved Gene Across Species

Align a gene fragment from yeast against two candidate homologs — one from a fly and one from a bacterium. Which shows higher similarity, indicating the closer evolutionary relationship?

let yeast = "ATGTCTGCCCCAGGAACTGCTGTG"
let fly = "ATGTCAGCACCGGGCACTGCAGTG"
let bacterium = "ATGAGCGCGCCGGAAACCGCGGTG"
let score_fly = Align.global(yeast, fly).score
let score_bact = Align.global(yeast, bacterium).score
print("Yeast vs Fly:       " + score_fly)
print("Yeast vs Bacterium: " + score_bact)
let answer = "Fly"
print(answer)

Exercise: Compare Coding Density

The human genome has ~20,000 genes in 3,200 Mb, while yeast has ~6,000 genes in 12 Mb. Calculate the approximate number of kilobases per gene for each organism to compare their coding densities.

let yeast_kb_per_gene = "2"
let human_kb_per_gene = "160"
print("Yeast: ~" + yeast_kb_per_gene + " kb per gene")
print("Human: ~" + human_kb_per_gene + " kb per gene")
print(human_kb_per_gene)

Knowledge Check

Summary

In this lesson you covered the organization and analysis of eukaryotic genomes:

  • Eukaryotes may have originated as predators — a flexible, phagocytic ancestor set the stage for endosymbiosis
  • Endosymbiosis gave eukaryotes mitochondria (from an α-proteobacterium) and chloroplasts (from a cyanobacterium)
  • Hybrid genomes — eukaryotic nuclei contain genes of both archaeal and bacterial origin due to endosymbiont gene transfer
  • Eukaryotic genomes are large — the human genome is 3.2 Gb, but only ~1.5% encodes protein
  • Introns and alternative splicing — most genes are interrupted, and differential exon joining produces ~100,000+ protein isoforms from ~20,000 genes
  • Rich regulatory DNA — promoters, enhancers, silencers, and insulators control cell-type-specific gene expression
  • The genome encodes a developmental program — cascades of transcription factors direct cell fate decisions from zygote to adult
  • Many eukaryotes are unicellular — protists display remarkable complexity in single cells
  • Yeast as minimal model — 6,000 genes across 16 chromosomes, with ~1,000 individually essential
  • Transcriptomics (microarrays, RNA-seq, scRNA-seq) monitors the expression of all genes simultaneously
  • Repetitive elements — transposons and their remnants account for ~45% of the human genome
  • The C-value paradox — genome size varies enormously (12 Mb to 149 Gb) but gene number does not
  • Gene prediction uses ab initio methods (Augustus, GenScan) and homology-based approaches, combined in annotation pipelines
  • RepeatMasker identifies and classifies transposable elements before gene annotation
  • Genome assembly reconstructs chromosomes from reads → contigs → scaffolds, measured by N50
  • Comparative genomics across model organisms (yeast, worm, fly, mouse) reveals deep conservation of core eukaryotic biology

References

  1. Alberts B, Johnson A, Lewis J, Morgan D, Raff M, Roberts K, Walter P. Molecular Biology of the Cell, 7th ed. New York: W.W. Norton; 2022. Chapter 1: Cells and Genomes.
  2. Margulis L. On the origin of mitosing cells. J Theor Biol. 1967;14(3):225–274.
  3. Lander ES, Linton LM, Birren B, et al. Initial sequencing and analysis of the human genome. Nature. 2001;409(6822):860–921.
  4. International Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome. Nature. 2004;431(7011):931–945.
  5. ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489(7414):57–74.
  6. Nurk S, Koren S, Rhie A, et al. The complete sequence of a human genome. Science. 2022;376(6588):44–53.
  7. Smit AFA, Hubley R, Green P. RepeatMasker Open-4.0. https://www.repeatmasker.org/
  8. Goffeau A, Barrell BG, Bussey H, et al. Life with 6000 genes. Science. 1996;274(5287):546–567.

Powered by

cyanea-seq cyanea-align cyanea-viz
eukaryotes genome organization introns regulatory DNA endosymbiosis model organisms gene prediction genome assembly transposons C-value paradox transcriptomics comparative genomics