Biological Sequences
Understand DNA, RNA, and protein sequences — the foundational data types of bioinformatics.
What Are Biological Sequences?
At the heart of bioinformatics is a simple idea: biological information can be represented as sequences of characters. Just as a sentence is a string of letters, a gene is a string of nucleotides, and a protein is a string of amino acids.
There are three primary types of biological sequences:
| Type | Alphabet | Example |
|---|---|---|
| DNA | A, T, G, C | ATGGCTAGCAAAGAC |
| RNA | A, U, G, C | AUGGCUAGCAAAGAC |
| Protein | 20 amino acids | MASKD |
DNA Sequences
DNA (deoxyribonucleic acid) uses a four-letter alphabet: Adenine, Thymine, Guanine, and Cytosine. These bases pair in a specific way: A with T, and G with C.
Let’s start by working with a simple DNA sequence:
let dna = "ATGGCTAGCAAAGACTGA"
print("Sequence: " + dna)
print("Length: " + Seq.length(dna))
Complementary Strands
DNA is double-stranded. The complement of a sequence replaces each base with its pair: A↔T and G↔C.
let dna = "ATGGCTAGCAAAGAC"
let comp = Seq.complement(dna)
print("Original: " + dna)
print("Complement: " + comp)
The reverse complement is the complement read in reverse — this is the sequence of the opposite strand in the 5’→3’ direction.
let dna = "ATGGCTAGCAAAGAC"
let rc = Seq.reverse_complement(dna)
print("Original: " + dna)
print("Reverse complement: " + rc)
From DNA to RNA: Transcription
During transcription, DNA is copied into RNA. The key change is that thymine (T) is replaced by uracil (U).
let dna = "ATGGCTAGCAAAGAC"
let rna = Seq.transcribe(dna)
print("DNA: " + dna)
print("RNA: " + rna)
Exercise: Transcribe and Translate
Given a DNA sequence, transcribe it to RNA and then translate it to determine the protein it encodes.
let dna = "ATGGCTAGCATTTGA"
let rna = Seq.transcribe(dna)
let protein = Seq.translate(dna)
print("RNA: " + rna)
print(protein)
From RNA to Protein: Translation
The genetic code maps three-nucleotide codons to amino acids. Translation reads the mRNA in triplets, starting from a start codon (AUG) and ending at a stop codon (UAA, UAG, or UGA).
let dna = "ATGGCTAGCAAAGACTGA"
let protein = Seq.translate(dna)
print("DNA: " + dna)
print("Protein: " + protein)
GC Content
The GC content is the proportion of G and C bases in a DNA sequence. It affects DNA stability (G-C pairs have three hydrogen bonds vs. two for A-T) and is characteristic of different organisms.
let dna = "ATGGCTAGCAAAGACTGA"
let gc = Seq.gc_content(dna)
print("GC content: " + gc)
Exercise: Dinucleotide Frequencies
K-mers are subsequences of length k. Dinucleotides (k=2) reveal patterns that simple base frequencies miss. Use Seq.kmer_count() to find the most common dinucleotide in this AT-rich sequence.
let dna = "AATAAAGATAATAAATTTAA"
let kmers = Seq.kmer_count(dna, 2)
print(kmers)
let most_common = "AA"
print(most_common)
Exercise: Analyze a Sequence
Given a DNA sequence, compute its reverse complement and GC content.
let dna = "ATGGCTAGCGACTGGAAGC"
let rc = Seq.reverse_complement(dna)
print(rc)
Knowledge Check
Summary
In this unit you learned:
- DNA, RNA, and protein are all representable as character sequences
- DNA bases pair: A↔T and G↔C
- The reverse complement gives the opposite strand in 5’→3’ direction
- Transcription replaces T with U (DNA → RNA)
- Translation reads codons (triplets) to produce amino acids
- GC content measures sequence stability and varies across organisms