cyanea-ml

v0.1.3 Analysis

Dimensionality reduction, clustering, and classification for biology.

Analysis layer Apache-2.0 9 functions Interactive playground

Machine learning primitives for biological data — PCA, UMAP, t-SNE, k-means clustering, random forest classification, and distance metrics.

Playground

Loading playground…

Overview

cyanea-ml provides the core machine learning algorithms used in modern bioinformatics — from dimensionality reduction (PCA, UMAP, t-SNE) through clustering (k-means) to classification (random forest) and evaluation (confusion matrix, ROC/AUC, precision-recall).

These algorithms run entirely in the browser via WASM, making it possible to explore single-cell datasets, classify sequences, and visualize high-dimensional data without any server infrastructure.

Key Concepts

Dimensionality Reduction

High-dimensional biological data (gene expression, k-mer profiles, protein features) needs to be projected into 2 or 3 dimensions for visualization. PCA finds the linear projections of maximum variance; t-SNE and UMAP find non-linear embeddings that preserve local structure. UMAP is generally preferred for its speed and global structure preservation.

K-means Clustering

K-means partitions data into k clusters by iteratively assigning points to the nearest centroid and recomputing centroids. It is fast and deterministic (given a seed), making it suitable for interactive exploration where you want to try different values of k.

Random Forest Classification

A random forest builds many decision trees on random subsets of the data and features, then aggregates their votes. It handles high-dimensional data well, provides feature importance scores, and is robust to overfitting. confusion_matrix and roc_curve evaluate the classifier’s performance.

Code Examples

Rust

use cyanea_ml::{pca, kmeans, random_forest_classify};

let embedding = pca(&data, n_features, 2);
let clusters = kmeans(&data, n_features, 5, 100, 42);
let model = random_forest_classify(&data, n_features, &labels, 100, 10, 42);

Python

import cyanea

embedding = cyanea.pca(data, n_features=100, n_components=2)
clusters = cyanea.kmeans(data, n_features=100, k=5)

JavaScript (WASM)

import { pca, umap, kmeans, random_forest_classify } from '/wasm/cyanea_wasm.js';

const embedding = JSON.parse(pca(JSON.stringify(data), n_features, 2));
const clusters = JSON.parse(kmeans(JSON.stringify(data), n_features, 5, 100, 42));

Use Cases

Single-cell RNA-seq — PCA → UMAP → k-means pipeline for cell type discovery.
Metagenomics — Classify sequences by k-mer profile using random forests.
Drug response — Predict cell line drug sensitivity from gene expression features.
QC — Visualize batch effects in high-dimensional assay data.

API Surface

kmer_count (seq: &str, k: u32) -> JSON Count k-mer frequencies in a sequence

euclidean_distance (a, b: JSON) -> f64 Euclidean distance between two vectors

pca (data, n_feat, n_comp) -> JSON Principal component analysis

umap (data, n_feat, n_comp, ...) -> JSON UMAP dimensionality reduction

tsne (data, n_feat, n_comp, ...) -> JSON t-SNE dimensionality reduction

kmeans (data, n_feat, k, max_iter, seed) -> JSON K-means clustering

random_forest_classify (data, n_feat, labels, n_trees, depth, seed) -> JSON Random forest classifier

confusion_matrix (actual, predicted: JSON) -> JSON Compute confusion matrix from predictions

roc_curve (scores, labels: JSON) -> JSON Compute ROC curve points and AUC

Depends on

cyanea-core cyanea-stats

Depended on by

cyanea-wasm