CytoBridge Deep Dive: A Comprehensive Tutorial¶

Welcome to the detailed tutorial on using CytoBridge for single-cell spatiotemporal dynamical modeling. This guide is designed to provide an in-depth walkthrough of the entire process, from data preparation and preprocessing to PCA extraction and model training. Additionally, it offers practical insights into selecting the most suitable models and components based on specific biological questions and computational limitations.

1. Data Requirements¶

CytoBridge is designed for single-cell transcriptomic data (scRNA-seq/spatial transcriptomics) with temporal annotations. Below are the supported formats and critical requirements:

Supported Input Formats¶

Format	Use Case	Required Structure
AnnData (h5ad) (Recommended)	Standard scRNA-seq/spatial data (compatible with Scanpy/Seurat)	- `adata.X`: Gene expression matrix (cells × genes, raw counts/normalized values)- `adata.obs`: Metadata with a time column (e.g., `Time_point`, `Day`)- Optional: Cell type annotations
CSV	Lightweight datasets	- Rows = cells, Columns = gene counts + 1 time column (e.g., `samples`, `Time`)- No missing values in the time column

Critical Annotations¶

Time Key: A column in adata.obs (or CSV) that labels each cell with its experimental time point (e.g., categorical: Day0/Day2; numerical: 0/2.0).

This is mandatory for modeling temporal dynamics.
Raw Counts (Preferred): CytoBridge’s preprocessing pipeline is optimized for raw scRNA-seq counts (normalization/log1p steps are configurable).

2. Quick Start¶

Get up and running in 5 minutes with a minimal example (AnnData input):¶

import scanpy as sc
import CytoBridge as cb

# 1. Load data
adata = sc.read_h5ad("raw_scrna_data.h5ad")

# 2. Preprocess (normalize + HVG + PCA)
time_mapping = {"Day0": 0.0, "Day2": 2.0, "Day5": 5.0}  # Map time points to numbers
n_pc_adata = cb.pp.preprocess(
    adata=adata,
    time_key="Time_point",
    time_mapping=time_mapping,
    normalization=True,
    log1p=True,
    select_hvg=True,
    dim_reduction="pca",
    n_pcs=50
)

# 3. Train model (Velocity + Growth + Score)
cb.tl.fit(
    adata=n_pc_adata,
    config="ruot",          # Velocity + Growth + Score
    device="cuda",          # Use "cpu" if no GPU
)

# 4. Save results
n_pc_adata.write_h5ad("trained_model.h5ad")

3. Full Workflow Walkthrough¶

3.1 Preprocessing¶

Preprocessing reduces noise and extracts biologically meaningful features for dynamical modeling. The cb.pp.preprocess function encapsulates a standardized workflow—tune parameters based on your data type.

Key Preprocessing Steps & Parameters¶

Step	Purpose	Parameter	When to Use
Normalization	Scale counts to fix sequencing depth differences	`normalization=True`	Enable for raw counts; disable for pre-normalized data
Log1p Transformation	Stabilize variance in gene expression	`log1p=True`	Enable for raw counts; disable for already log-transformed data
HVG Selection	Focus on biologically variable genes (reduce noise)	`select_hvg=Truen_top_genes=2000`	Enable for scRNA-seq; disable for sparse spatial transcriptomics
Time Mapping	Convert categorical time labels to numbers	`time_mapping={"Day0":0.0}`	Always define explicitly (avoids arbitrary time ordering)

Full Preprocessing Code¶

import scanpy as sc

import CytoBridge as cb

# Load data

adata = sc.read_h5ad("raw_scrna_data.h5ad")

# Define time mapping (critical for temporal dynamics)

time_mapping = {"Day0": 0.0, "Day2": 2.0, "Day5": 5.0}

# Run preprocessing

n_pc_adata = cb.pp.preprocess(

   adata=adata,

   time_key="Time_point",          # Column with time labels

   time_mapping=time_mapping,      # Explicit time mapping (avoid auto-mapping)

   normalization=True,             # Raw counts → True; normalized → False

   log1p=True,                     # Raw counts → True; log-transformed → False

   select_hvg=True,                # scRNA-seq → True; spatial → False

   n_top_genes=2000,               # Adjust for data size (2000 for small datasets)

   dim_reduction="pca",            # Use PCA for modeling 
    
   n_pcs=50                        # 30-50 for scRNA-seq; 20-30 for spatial data

)

# Verify preprocessing output

print(f"Preprocessed PCA data shape: {n_pc_adata.shape}")  # (n_cells, n_pcs)

print(f"Latent space: {n_pc_adata.obsm['X_latent'].shape}")  # latent results stored here 
# when you choose dim_reduction="none",it will train original data. (latent) 
# when you choose dim_reduction="pca",it will train n_pc_adata.	(latent)	

print(f"Processed time points: {n_pc_adata.obs['time_point_processed'].unique()}")

3.2 PCA Extraction¶

PCA is the gold standard for linear dimension reduction in CytoBridge—it preserves global dynamical structure (unlike UMAP/t-SNE, which are for visualization only).

Why PCA?¶

Retains linear relationships between cell states (critical for modeling velocity/growth).
Reduces computational cost of training (50 PCs vs. 20k genes).
Minimizes noise while preserving biological signal.

3.3 Model Training¶

Use the preprocessed PCA data (n_pc_adata) to train a CytoBridge model. Choose a config (component combination) and mode (training method) based on your biological question.

Core Training Parameters¶

Parameter	Description	Options
`config`	Defines model components (velocity/growth/score/interaction)	`dynamical_ot` (velocity only);`unbalanced_ot` (velocity + growth); `ruot` (velocity + growth + score); `crufm` (velocity + growth + score); `cytobridge` (all components)
`device`	Compute device	`cuda` (GPU, fast)`cpu` (slow, no GPU)
`mode`	Training algorithm	`flow_matching` (fast)`neural_ode` (comprehensive)

Training Examples¶

Example 1: Stem Cell Differentiation (Velocity + Growth + Score)¶

Biological context: Proliferating stem cells with stochastic fate decisions.

# Train RUOT model (Velocity + Growth + Score)

cb.tl.fit(
   adata=n_pc_adata,
    
   config="unbalanced_ot",   
    				#corresponding to mode="neural_ode"
					# Match biology: velocity (differentiation) + growth (proliferation)
    
    
   device="cuda",
)

# Inspect outputs

print(f"Latent velocity: {n_pc_adata.obsm['velocity_latent'].shape}")

print(f"Growth rates: {n_pc_adata.obsm['growth_rate'].shape}")

Example 2: Large-Scale Steady-State Tissue (Velocity Only)¶

Biological context: Adult liver tissue (stable cell numbers, minimal noise).

# Train Dynamical OT model (Velocity only)

cb.tl.fit(

   adata=n_pc_adata,

   config="CRUFM",  #corresponding to mode="flow_matching"
    				# Match biology: velocity (differentiation) + growth (proliferation) + score (stochasticity)

   device="cuda",

)

print(f"Latent velocity: {n_pc_adata.obsm['velocity_latent'].shape}")

print(f"Growth rates: {n_pc_adata.obsm['growth_rate'].shape}")

print(f"Stochastic score: {n_pc_adata.obsm['score_latent'].shape}")

4. Model Selection Guide¶

4.1 Component Selection (Biology-Driven)¶

Components map to biological phenomena—velocity is mandatory; add others based on your system.

Component	Biological Meaning	When to Include
Velocity (Mandatory)	Instantaneous direction of cell state transitions (e.g., differentiation/activation)	Always (core of trajectory inference)
Growth	Proliferation/apoptosis rates (changes in cell population size over time)	- Developmental expansion -Tumor proliferation
Score	Stochasticity in cell state transitions (biological noise)	- Stem cell fate plasticity
Interaction	Cell-cell communication effects	Tumor microenvironment (immune-cancer crosstalk)

4.2 Training Mode Selection (Computation-Driven)¶

Choose between flow_matching (speed) and neural_ode (flexibility):

Training Mode	Speed	Flexibility	Best For	Limitations
`flow_matching`	⚡ Fast (simulation-free)	❌ Low (simple loss functions only)	- Large datasets - Simple dynamics (velocity + growth + score)	Struggles with complex dynamics (interaction)
`neural_ode`	🐢 Slow (numerical ODE solving)	✅ High (all components/losses)	- Small/medium datasets - Complex dynamics (velocity + growth + score + interaction)	Computationally expensive for large data

4.3 Model Selection Cheat Sheet¶

Biological Scenario	Components	Training Mode	CytoBridge Config
Only considering Velocity	Velocity only	`neural_ode`	`dynamical_ot`
Proliferation (no stochasticity)	Velocity + Growth	`neural_ode`	`unbalanced_ot`
Proliferation + Stochasticity	Velocity + Growth + Score	`neural_ode` +`flow_matching`	`ruot`
Proliferation + Stochasticity	Velocity + Growth + Score	`flow_matching`	`crufm`
proliferation + Stochasticity + cell-cell communication	Velocity + Growth + Score+Interaction	`neural_ode`	`cytobridge`

5. Optional Loss Functions¶

Loss functions in CytoBridge quantify the mismatch between model predictions and biological data. The choice of loss depends on your model components (e.g., growth requires mass-aware losses) and biological goals (e.g., stochasticity requires score matching). Below is a breakdown of all supported losses.

5.1 Overview¶

Each loss addresses a specific biological or computational challenge. Here’s a high-level mapping to model components:

Loss Type	Target Challenge	Essential model components
OT Loss	Align cell state distributions across time	Velocity (all OT-based models)
Mass Loss	Enforce consistent cell population sizes (proliferation/apoptosis)	Velocity + Growth
Score Matching Loss	Model stochasticity in cell state transitions	Velocity + Score
Density Loss	Align local density of cell states	Velocity (neural ODE mode)
PINN Loss	Enforce physical constraints (e.g., continuity of cell density)	Velocity (neural ODE mode)

5.2 OT Loss (Optimal Transport Loss)¶

Core Function¶

Measures the “cost” of transporting cell states from one time point to another—critical for aligning distributions in velocity-based models. Supports both balanced (equal cell counts) and unbalanced (proliferation/apoptosis) scenarios.

Method Options & Use Cases¶

The method parameter in calc_ot_loss dictates how transport costs are computed. Choose based on data size and whether you need gradient flow (for training).

Method	Balanced/Unbalanced	Gradient Support	Speed
`emd_detach`	Balanced	❌ No (weights detached)	Slow
`sinkhorn`	Balanced	✅ Yes	Fast
`sinkhorn_detach`	Balanced	❌ No	Fast
`sinkhorn_knopp_unbalanced`	Unbalanced	✅ Yes	Fast
`sinkhorn_knopp_unbalanced_detach`	Unbalanced	❌ No	Fast

Key Parameters¶

reg (Sinkhorn methods): Regularization strength (default: 0.1). Increase for noisy data; decrease for sharper distributions.
reg_m (Unbalanced Sinkhorn): Controls mass imbalance tolerance (default: 0.01). Larger values allow more extreme proliferation/apoptosis.

5.3 Mass Loss¶

Core Function¶

Ensures the model preserves local cell population sizes—critical for models with the Growth component (proliferation/apoptosis). It matches the mass (cell count) of predicted cells to real cells in local neighborhoods.

Use Cases¶

Models with Growth (Unbalanced OT, RUOT, cytobridge).
Datasets where cell proliferation/apoptosis is heterogeneous (e.g., tumor cores vs. edges).

Key Parameters¶

relative_mass: The expected proportion of cells in the target time point (e.g., 0.5 if cell count doubles between time points).
global_mass: If True, adds a global mass constraint (ensures total cell count matches); use for datasets with global proliferation.

5.4 Score Matching Loss¶

Core Function¶

Models stochasticity in cell state transitions—required for models with the Score component (RUOT, cytobridge). It trains the model to predict the “score” (gradient of the log density) of noisy cell states, capturing biological noise.

Use Cases¶

Models with Score (RUOT, cytobridge).
Datasets with stochastic cell fate decisions (e.g., stem cell differentiation, iPSC reprogramming).

Key Parameters¶

sigma: Noise level added to cell states (default: 0.1). Match to the noise in your data (e.g., higher for heterogeneous datasets).
lambda_penalty: Penalty for positive log-probability (default: 1.0). Prevents unphysical positive scores.
model: Trained DynamicalModel object (must include a score_net).

5.5 Density Loss¶

Core Function¶

Aligns the local density of predicted cell states with real cells—useful for scenarios where cell state density carries biological meaning (e.g., dense cell clusters representing cell types).

Use Cases¶

Models where cell clustering is important (e.g., identifying stable cell types in differentiation trajectories).
Spatial transcriptomics data (where local cell density correlates with tissue structure).

Key Parameters¶

hinge_value: Minimum distance threshold for density matching (default: 0.01). Values below this are ignored (reduces noise).
top_k: Number of nearest neighbors to consider for local density (default: 5).
groups: Optional cell groups (e.g., cell types) to compute density loss per group.

5.6 PINN Loss (Physics-Informed Neural Network Loss)¶

Core Function¶

Enforces physical constraints on cell dynamics—specifically the continuity equation (conservation of cell density over time). Ideal for models trained with neural_ode mode that require mechanistic rigor.

Use Cases¶

Models with Velocity, Growth, or Interaction (cytobridge).
Mechanistic studies where dynamics must follow physical laws (e.g., tissue growth, cell migration).

Key Parameters¶

sigma: Noise level for density estimation (default: 0.1).
use_mass: If True, includes the Growth component in the continuity equation.
use_interaction: If True, includes the Interaction component (cell-cell communication) in the equation.

Loss Selection Cheat Sheet¶

Model Config	Required Losses	Optional Losses	Training Mode
`dynamical_ot` (Velocity only)	`OT Loss`	`Density Loss`	`neural_ode`
`unbalanced_ot` (Velocity + Growth)	`OT Loss`+ `Mass Loss`	`Density Loss`	`neural_ode`
`ruot` (Velocity + Growth + Score)	`OT Loss`+ `Mass Loss` + `Score Matching Loss`	`PINN Loss`+`Density Loss`	`neural_ode` + `flow_matching`
`crufm` (Velocity + Growth + Score)	`Score Matching Loss`		`flow_matching`
`cytobridge` (All components)	`OT Loss`+ `Mass Loss` + `Score Matching Loss`	`Density Loss`+`PINN Loss`	`neural_ode`