Skip to main content
Alpha Cyanea is in public alpha. We're building in the open — expect rough edges and rapid iteration. See what's live

Sequence File Formats

Beginner Genomics & Bioinformatics ~20 min

Learn to read and write FASTA and FASTQ — the two most common file formats in bioinformatics.

Why File Formats Matter

Bioinformatics revolves around data. Before you can analyze a genome, align sequences, or call variants, you need to read the data. Two formats dominate sequence bioinformatics:

FormatUse caseStores quality?
FASTAReference genomes, protein databases, any sequenceNo
FASTQSequencing reads (Illumina, Nanopore, PacBio)Yes

FASTA Format

FASTA is the simplest sequence format. Each record has two parts:

  1. A header line starting with >, containing an identifier and optional description
  2. One or more lines of sequence data
>gene1 Homo sapiens beta-globin
ATGGTGCATCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGG
CAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAGGCTGCTGG
>gene2 Another sequence
ATGGCTAGCAAAGAC

Let’s parse a FASTA string:

let fasta = ">seq1 Example gene\nATGGCTAGCAAA\nGACTGATCG\n>seq2 Another\nTTTGGGCCC"
let records = IO.parse_fasta(fasta)
print("Found " + records.length + " records")
print("First: " + records[0].id + " — " + records[0].sequence)
print("Second: " + records[1].id + " — " + records[1].sequence)

Key Rules

  • Header lines start with >
  • The first word after > is the sequence ID
  • Everything else on the header line is the description
  • Sequence data can span multiple lines (they are concatenated)
  • Blank lines between records are allowed but not required

FASTQ Format

FASTQ extends FASTA by adding quality scores for each base. Every record has exactly four lines:

  1. Header starting with @ (sequence identifier)
  2. Sequence (one line)
  3. Separator line starting with + (often just +)
  4. Quality string (same length as sequence)
@read1
ATGGCTAGCAAAGAC
+
IIIIIHHHGGGFFFF

The quality string uses ASCII characters to encode Phred quality scores. Each character’s ASCII value minus 33 gives the quality score:

CharacterASCIIQuality (Q)Error probability
!330100%
+431010%
553201%
?63300.1%
I73400.01%
let fastq = "@read1\nATGGCTAGCAAA\n+\nIIIIIHHHGGGF"
let records = IO.parse_fastq(fastq)
let read = records[0]
print("ID: " + read.id)
print("Seq: " + read.sequence)
print("Qual: " + read.quality)

Quality Score Math

A Phred score of Q means the probability of an incorrect base call is 10^(-Q/10):

Q scoreError rateAccuracy
101 in 1090%
201 in 10099%
301 in 1,00099.9%
401 in 10,00099.99%

Modern Illumina sequencing typically produces reads with Q30+ for most bases.

Comparing Formats

let fasta = ">myseq Human gene fragment\nATGGCTAGCAAA"
let fastq = "@myseq\nATGGCTAGCAAA\n+\nIIIIIHHHGGGF"
let fa_records = IO.parse_fasta(fasta)
let fq_records = IO.parse_fastq(fastq)
print("FASTA record: " + fa_records[0].id)
print("FASTQ record: " + fq_records[0].id)
print("Both have sequence: " + fa_records[0].sequence)
print("Only FASTQ has quality: " + fq_records[0].quality)

Exercise: Parse and Analyze

Given a multi-record FASTA string, count the total number of sequences.

let fasta = ">seq1\nATGGCTAGC\n>seq2\nTTTGGGCCCAAA\n>seq3\nATGATG"
let records = IO.parse_fasta(fasta)
print(records.length)

Knowledge Check

Summary

In this unit you learned:

  • FASTA stores sequences with headers (>) — no quality information
  • FASTQ adds per-base quality scores using ASCII-encoded Phred scores
  • Phred scores: Q = -10 * log10(P_error); higher Q = higher confidence
  • Q30 (99.9% accuracy) is the standard threshold for modern sequencing
  • Both formats are plain text and can contain multiple records

Powered by

cyanea-seq cyanea-io
File Formats FASTA FASTQ Quality Scores Bioinformatics