Sequence File Formats

Beginner Genomics & Bioinformatics ~20 min

← Previous Next →

Learn to read and write FASTA and FASTQ — the two most common file formats in bioinformatics.

Why File Formats Matter

Bioinformatics revolves around data. Before you can analyze a genome, align sequences, or call variants, you need to read the data. Two formats dominate sequence bioinformatics:

Format	Use case	Stores quality?
FASTA	Reference genomes, protein databases, any sequence	No
FASTQ	Sequencing reads (Illumina, Nanopore, PacBio)	Yes

FASTA Format

FASTA is the simplest sequence format. Each record has two parts:

A header line starting with >, containing an identifier and optional description
One or more lines of sequence data

>gene1 Homo sapiens beta-globin
ATGGTGCATCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGG
CAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAGGCTGCTGG
>gene2 Another sequence
ATGGCTAGCAAAGAC

Let’s parse a FASTA string:

let fasta = ">seq1 Example gene\nATGGCTAGCAAA\nGACTGATCG\n>seq2 Another\nTTTGGGCCC"
let records = IO.parse_fasta(fasta)
print("Found " + records.length + " records")
print("First: " + records[0].id + " — " + records[0].sequence)
print("Second: " + records[1].id + " — " + records[1].sequence)

Key Rules

Header lines start with >
The first word after > is the sequence ID
Everything else on the header line is the description
Sequence data can span multiple lines (they are concatenated)
Blank lines between records are allowed but not required

FASTQ Format

FASTQ extends FASTA by adding quality scores for each base. Every record has exactly four lines:

Header starting with @ (sequence identifier)
Sequence (one line)
Separator line starting with + (often just +)
Quality string (same length as sequence)

@read1
ATGGCTAGCAAAGAC
+
IIIIIHHHGGGFFFF

The quality string uses ASCII characters to encode Phred quality scores. Each character’s ASCII value minus 33 gives the quality score:

Character	ASCII	Quality (Q)	Error probability
`!`	33	0	100%
`+`	43	10	10%
`5`	53	20	1%
`?`	63	30	0.1%
`I`	73	40	0.01%

let fastq = "@read1\nATGGCTAGCAAA\n+\nIIIIIHHHGGGF"
let records = IO.parse_fastq(fastq)
let read = records[0]
print("ID: " + read.id)
print("Seq: " + read.sequence)
print("Qual: " + read.quality)

Quality Score Math

A Phred score of Q means the probability of an incorrect base call is 10^(-Q/10):

Q score	Error rate	Accuracy
10	1 in 10	90%
20	1 in 100	99%
30	1 in 1,000	99.9%
40	1 in 10,000	99.99%

Modern Illumina sequencing typically produces reads with Q30+ for most bases.

Comparing Formats

let fasta = ">myseq Human gene fragment\nATGGCTAGCAAA"
let fastq = "@myseq\nATGGCTAGCAAA\n+\nIIIIIHHHGGGF"
let fa_records = IO.parse_fasta(fasta)
let fq_records = IO.parse_fastq(fastq)
print("FASTA record: " + fa_records[0].id)
print("FASTQ record: " + fq_records[0].id)
print("Both have sequence: " + fa_records[0].sequence)
print("Only FASTQ has quality: " + fq_records[0].quality)

Exercise: Parse and Analyze

Given a multi-record FASTA string, count the total number of sequences.

let fasta = ">seq1\nATGGCTAGC\n>seq2\nTTTGGGCCCAAA\n>seq3\nATGATG"
let records = IO.parse_fasta(fasta)
print(records.length)

Knowledge Check

Summary

In this unit you learned:

FASTA stores sequences with headers (>) — no quality information
FASTQ adds per-base quality scores using ASCII-encoded Phred scores
Phred scores: Q = -10 * log10(P_error); higher Q = higher confidence
Q30 (99.9% accuracy) is the standard threshold for modern sequencing
Both formats are plain text and can contain multiple records

Powered by

cyanea-seq cyanea-io

File Formats FASTA FASTQ Quality Scores Bioinformatics