Sequence File Formats
Learn to read and write FASTA and FASTQ — the two most common file formats in bioinformatics.
Why File Formats Matter
Bioinformatics revolves around data. Before you can analyze a genome, align sequences, or call variants, you need to read the data. Two formats dominate sequence bioinformatics:
| Format | Use case | Stores quality? |
|---|---|---|
| FASTA | Reference genomes, protein databases, any sequence | No |
| FASTQ | Sequencing reads (Illumina, Nanopore, PacBio) | Yes |
FASTA Format
FASTA is the simplest sequence format. Each record has two parts:
- A header line starting with
>, containing an identifier and optional description - One or more lines of sequence data
>gene1 Homo sapiens beta-globin
ATGGTGCATCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGG
CAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAGGCTGCTGG
>gene2 Another sequence
ATGGCTAGCAAAGAC
Let’s parse a FASTA string:
let fasta = ">seq1 Example gene\nATGGCTAGCAAA\nGACTGATCG\n>seq2 Another\nTTTGGGCCC"
let records = IO.parse_fasta(fasta)
print("Found " + records.length + " records")
print("First: " + records[0].id + " — " + records[0].sequence)
print("Second: " + records[1].id + " — " + records[1].sequence)
Key Rules
- Header lines start with
> - The first word after
>is the sequence ID - Everything else on the header line is the description
- Sequence data can span multiple lines (they are concatenated)
- Blank lines between records are allowed but not required
FASTQ Format
FASTQ extends FASTA by adding quality scores for each base. Every record has exactly four lines:
- Header starting with
@(sequence identifier) - Sequence (one line)
- Separator line starting with
+(often just+) - Quality string (same length as sequence)
@read1
ATGGCTAGCAAAGAC
+
IIIIIHHHGGGFFFF
The quality string uses ASCII characters to encode Phred quality scores. Each character’s ASCII value minus 33 gives the quality score:
| Character | ASCII | Quality (Q) | Error probability |
|---|---|---|---|
! | 33 | 0 | 100% |
+ | 43 | 10 | 10% |
5 | 53 | 20 | 1% |
? | 63 | 30 | 0.1% |
I | 73 | 40 | 0.01% |
let fastq = "@read1\nATGGCTAGCAAA\n+\nIIIIIHHHGGGF"
let records = IO.parse_fastq(fastq)
let read = records[0]
print("ID: " + read.id)
print("Seq: " + read.sequence)
print("Qual: " + read.quality)
Quality Score Math
A Phred score of Q means the probability of an incorrect base call is 10^(-Q/10):
| Q score | Error rate | Accuracy |
|---|---|---|
| 10 | 1 in 10 | 90% |
| 20 | 1 in 100 | 99% |
| 30 | 1 in 1,000 | 99.9% |
| 40 | 1 in 10,000 | 99.99% |
Modern Illumina sequencing typically produces reads with Q30+ for most bases.
Comparing Formats
let fasta = ">myseq Human gene fragment\nATGGCTAGCAAA"
let fastq = "@myseq\nATGGCTAGCAAA\n+\nIIIIIHHHGGGF"
let fa_records = IO.parse_fasta(fasta)
let fq_records = IO.parse_fastq(fastq)
print("FASTA record: " + fa_records[0].id)
print("FASTQ record: " + fq_records[0].id)
print("Both have sequence: " + fa_records[0].sequence)
print("Only FASTQ has quality: " + fq_records[0].quality)
Exercise: Parse and Analyze
Given a multi-record FASTA string, count the total number of sequences.
let fasta = ">seq1\nATGGCTAGC\n>seq2\nTTTGGGCCCAAA\n>seq3\nATGATG"
let records = IO.parse_fasta(fasta)
print(records.length)
Knowledge Check
Summary
In this unit you learned:
- FASTA stores sequences with headers (
>) — no quality information - FASTQ adds per-base quality scores using ASCII-encoded Phred scores
- Phred scores: Q = -10 * log10(P_error); higher Q = higher confidence
- Q30 (99.9% accuracy) is the standard threshold for modern sequencing
- Both formats are plain text and can contain multiple records