cyanea-ml
v0.1.3 AnalysisDimensionality reduction, clustering, and classification for biology.
Machine learning primitives for biological data — PCA, UMAP, t-SNE, k-means clustering, random forest classification, and distance metrics.
Playground
Overview
cyanea-ml provides the core machine learning algorithms used in modern bioinformatics — from dimensionality reduction (PCA, UMAP, t-SNE) through clustering (k-means) to classification (random forest) and evaluation (confusion matrix, ROC/AUC, precision-recall).
These algorithms run entirely in the browser via WASM, making it possible to explore single-cell datasets, classify sequences, and visualize high-dimensional data without any server infrastructure.
Key Concepts
Dimensionality Reduction
High-dimensional biological data (gene expression, k-mer profiles, protein features) needs to be projected into 2 or 3 dimensions for visualization. PCA finds the linear projections of maximum variance; t-SNE and UMAP find non-linear embeddings that preserve local structure. UMAP is generally preferred for its speed and global structure preservation.
K-means Clustering
K-means partitions data into k clusters by iteratively assigning points to the nearest centroid and recomputing centroids. It is fast and deterministic (given a seed), making it suitable for interactive exploration where you want to try different values of k.
Random Forest Classification
A random forest builds many decision trees on random subsets of the data and features, then aggregates their votes. It handles high-dimensional data well, provides feature importance scores, and is robust to overfitting. confusion_matrix and roc_curve evaluate the classifier’s performance.
Code Examples
Rust
use cyanea_ml::{pca, kmeans, random_forest_classify};
let embedding = pca(&data, n_features, 2);
let clusters = kmeans(&data, n_features, 5, 100, 42);
let model = random_forest_classify(&data, n_features, &labels, 100, 10, 42);
Python
import cyanea
embedding = cyanea.pca(data, n_features=100, n_components=2)
clusters = cyanea.kmeans(data, n_features=100, k=5)
JavaScript (WASM)
import { pca, umap, kmeans, random_forest_classify } from '/wasm/cyanea_wasm.js';
const embedding = JSON.parse(pca(JSON.stringify(data), n_features, 2));
const clusters = JSON.parse(kmeans(JSON.stringify(data), n_features, 5, 100, 42));
Use Cases
- Single-cell RNA-seq — PCA → UMAP → k-means pipeline for cell type discovery.
- Metagenomics — Classify sequences by k-mer profile using random forests.
- Drug response — Predict cell line drug sensitivity from gene expression features.
- QC — Visualize batch effects in high-dimensional assay data.