CytoBridge Deep Dive: A Comprehensive Tutorial

Welcome to the detailed tutorial on using CytoBridge for single-cell spatiotemporal dynamical modeling. This guide is designed to provide an in-depth walkthrough of the entire process, from data preparation and preprocessing to PCA extraction and model training. Additionally, it offers practical insights into selecting the most suitable models and components based on specific biological questions and computational limitations.

Table of Contents

1. Data Requirements

CytoBridge is designed for single-cell transcriptomic data (scRNA-seq/spatial transcriptomics) with temporal annotations. Below are the supported formats and critical requirements:

Supported Input Formats

Format

Use Case

Required Structure

AnnData (h5ad) (Recommended)

Standard scRNA-seq/spatial data (compatible with Scanpy/Seurat)

- adata.X: Gene expression matrix (cells × genes, raw counts/normalized values)- adata.obs: Metadata with a time column (e.g., Time_point, Day)- Optional: Cell type annotations

CSV

Lightweight datasets

- Rows = cells, Columns = gene counts + 1 time column (e.g., samples, Time)- No missing values in the time column

Critical Annotations

  • Time Key: A column in adata.obs (or CSV) that labels each cell with its experimental time point (e.g., categorical: Day0/Day2; numerical: 0/2.0).

    This is mandatory for modeling temporal dynamics.

  • Raw Counts (Preferred): CytoBridge’s preprocessing pipeline is optimized for raw scRNA-seq counts (normalization/log1p steps are configurable).

2. Quick Start

Get up and running in 5 minutes with a minimal example (AnnData input):

import scanpy as sc
import CytoBridge as cb

# 1. Load data
adata = sc.read_h5ad("raw_scrna_data.h5ad")

# 2. Preprocess (normalize + HVG + PCA)
time_mapping = {"Day0": 0.0, "Day2": 2.0, "Day5": 5.0}  # Map time points to numbers
n_pc_adata = cb.pp.preprocess(
    adata=adata,
    time_key="Time_point",
    time_mapping=time_mapping,
    normalization=True,
    log1p=True,
    select_hvg=True,
    dim_reduction="pca",
    n_pcs=50
)

# 3. Train model (Velocity + Growth + Score)
cb.tl.fit(
    adata=n_pc_adata,
    config="ruot",          # Velocity + Growth + Score
    device="cuda",          # Use "cpu" if no GPU
)

# 4. Save results
n_pc_adata.write_h5ad("trained_model.h5ad")

3. Full Workflow Walkthrough

3.1 Preprocessing

Preprocessing reduces noise and extracts biologically meaningful features for dynamical modeling. The cb.pp.preprocess function encapsulates a standardized workflow—tune parameters based on your data type.

Key Preprocessing Steps & Parameters

Step

Purpose

Parameter

When to Use

Normalization

Scale counts to fix sequencing depth differences

normalization=True

Enable for raw counts; disable for pre-normalized data

Log1p Transformation

Stabilize variance in gene expression

log1p=True

Enable for raw counts; disable for already log-transformed data

HVG Selection

Focus on biologically variable genes (reduce noise)

select_hvg=Truen_top_genes=2000

Enable for scRNA-seq; disable for sparse spatial transcriptomics

Time Mapping

Convert categorical time labels to numbers

time_mapping={"Day0":0.0}

Always define explicitly (avoids arbitrary time ordering)

Full Preprocessing Code

import scanpy as sc

import CytoBridge as cb

# Load data

adata = sc.read_h5ad("raw_scrna_data.h5ad")

# Define time mapping (critical for temporal dynamics)

time_mapping = {"Day0": 0.0, "Day2": 2.0, "Day5": 5.0}

# Run preprocessing

n_pc_adata = cb.pp.preprocess(

   adata=adata,

   time_key="Time_point",          # Column with time labels

   time_mapping=time_mapping,      # Explicit time mapping (avoid auto-mapping)

   normalization=True,             # Raw counts → True; normalized → False

   log1p=True,                     # Raw counts → True; log-transformed → False

   select_hvg=True,                # scRNA-seq → True; spatial → False

   n_top_genes=2000,               # Adjust for data size (2000 for small datasets)

   dim_reduction="pca",            # Use PCA for modeling 
    
   n_pcs=50                        # 30-50 for scRNA-seq; 20-30 for spatial data

)

# Verify preprocessing output

print(f"Preprocessed PCA data shape: {n_pc_adata.shape}")  # (n_cells, n_pcs)

print(f"Latent space: {n_pc_adata.obsm['X_latent'].shape}")  # latent results stored here 
# when you choose dim_reduction="none",it will train original data. (latent) 
# when you choose dim_reduction="pca",it will train n_pc_adata.	(latent)	

print(f"Processed time points: {n_pc_adata.obs['time_point_processed'].unique()}")

3.2 PCA Extraction

PCA is the gold standard for linear dimension reduction in CytoBridge—it preserves global dynamical structure (unlike UMAP/t-SNE, which are for visualization only).

Why PCA?

  • Retains linear relationships between cell states (critical for modeling velocity/growth).

  • Reduces computational cost of training (50 PCs vs. 20k genes).

  • Minimizes noise while preserving biological signal.

3.3 Model Training

Use the preprocessed PCA data (n_pc_adata) to train a CytoBridge model. Choose a config (component combination) and mode (training method) based on your biological question.

Core Training Parameters

Parameter

Description

Options

config

Defines model components (velocity/growth/score/interaction)

dynamical_ot (velocity only);unbalanced_ot (velocity + growth); ruot (velocity + growth + score); crufm (velocity + growth + score); cytobridge (all components)

device

Compute device

cuda (GPU, fast)cpu (slow, no GPU)

mode

Training algorithm

flow_matching (fast)neural_ode (comprehensive)

Training Examples

Example 1: Stem Cell Differentiation (Velocity + Growth + Score)

Biological context: Proliferating stem cells with stochastic fate decisions.

# Train RUOT model (Velocity + Growth + Score)

cb.tl.fit(
   adata=n_pc_adata,
    
   config="unbalanced_ot",   
    				#corresponding to mode="neural_ode"
					# Match biology: velocity (differentiation) + growth (proliferation)
    
    
   device="cuda",
)

# Inspect outputs

print(f"Latent velocity: {n_pc_adata.obsm['velocity_latent'].shape}")

print(f"Growth rates: {n_pc_adata.obsm['growth_rate'].shape}")
Example 2: Large-Scale Steady-State Tissue (Velocity Only)

Biological context: Adult liver tissue (stable cell numbers, minimal noise).

# Train Dynamical OT model (Velocity only)

cb.tl.fit(

   adata=n_pc_adata,

   config="CRUFM",  #corresponding to mode="flow_matching"
    				# Match biology: velocity (differentiation) + growth (proliferation) + score (stochasticity)

   device="cuda",

)

print(f"Latent velocity: {n_pc_adata.obsm['velocity_latent'].shape}")

print(f"Growth rates: {n_pc_adata.obsm['growth_rate'].shape}")

print(f"Stochastic score: {n_pc_adata.obsm['score_latent'].shape}")

4. Model Selection Guide

4.1 Component Selection (Biology-Driven)

Components map to biological phenomena—velocity is mandatory; add others based on your system.

Component

Biological Meaning

When to Include

Velocity (Mandatory)

Instantaneous direction of cell state transitions (e.g., differentiation/activation)

Always (core of trajectory inference)

Growth

Proliferation/apoptosis rates (changes in cell population size over time)

- Developmental expansion -Tumor proliferation

Score

Stochasticity in cell state transitions (biological noise)

- Stem cell fate plasticity

Interaction

Cell-cell communication effects

Tumor microenvironment (immune-cancer crosstalk)

4.2 Training Mode Selection (Computation-Driven)

Choose between flow_matching (speed) and neural_ode (flexibility):

Training Mode

Speed

Flexibility

Best For

Limitations

flow_matching

⚡ Fast (simulation-free)

❌ Low (simple loss functions only)

- Large datasets - Simple dynamics (velocity + growth + score)

Struggles with complex dynamics (interaction)

neural_ode

🐢 Slow (numerical ODE solving)

✅ High (all components/losses)

- Small/medium datasets - Complex dynamics (velocity + growth + score + interaction)

Computationally expensive for large data

4.3 Model Selection Cheat Sheet

Biological Scenario

Components

Training Mode

CytoBridge Config

Only considering Velocity

Velocity only

neural_ode

dynamical_ot

Proliferation (no stochasticity)

Velocity + Growth

neural_ode

unbalanced_ot

Proliferation + Stochasticity

Velocity + Growth + Score

neural_ode +flow_matching

ruot

Proliferation + Stochasticity

Velocity + Growth + Score

flow_matching

crufm

proliferation + Stochasticity + cell-cell communication

Velocity + Growth + Score+Interaction

neural_ode

cytobridge

5. Optional Loss Functions

Loss functions in CytoBridge quantify the mismatch between model predictions and biological data. The choice of loss depends on your model components (e.g., growth requires mass-aware losses) and biological goals (e.g., stochasticity requires score matching). Below is a breakdown of all supported losses.

5.1 Overview

Each loss addresses a specific biological or computational challenge. Here’s a high-level mapping to model components:

Loss Type

Target Challenge

Essential model components

OT Loss

Align cell state distributions across time

Velocity (all OT-based models)

Mass Loss

Enforce consistent cell population sizes (proliferation/apoptosis)

Velocity + Growth

Score Matching Loss

Model stochasticity in cell state transitions

Velocity + Score

Density Loss

Align local density of cell states

Velocity (neural ODE mode)

PINN Loss

Enforce physical constraints (e.g., continuity of cell density)

Velocity (neural ODE mode)

5.2 OT Loss (Optimal Transport Loss)

Core Function

Measures the “cost” of transporting cell states from one time point to another—critical for aligning distributions in velocity-based models. Supports both balanced (equal cell counts) and unbalanced (proliferation/apoptosis) scenarios.

Method Options & Use Cases

The method parameter in calc_ot_loss dictates how transport costs are computed. Choose based on data size and whether you need gradient flow (for training).

Method

Balanced/Unbalanced

Gradient Support

Speed

emd_detach

Balanced

❌ No (weights detached)

Slow

sinkhorn

Balanced

✅ Yes

Fast

sinkhorn_detach

Balanced

❌ No

Fast

sinkhorn_knopp_unbalanced

Unbalanced

✅ Yes

Fast

sinkhorn_knopp_unbalanced_detach

Unbalanced

❌ No

Fast

Key Parameters

  • reg (Sinkhorn methods): Regularization strength (default: 0.1). Increase for noisy data; decrease for sharper distributions.

  • reg_m (Unbalanced Sinkhorn): Controls mass imbalance tolerance (default: 0.01). Larger values allow more extreme proliferation/apoptosis.

5.3 Mass Loss

Core Function

Ensures the model preserves local cell population sizes—critical for models with the Growth component (proliferation/apoptosis). It matches the mass (cell count) of predicted cells to real cells in local neighborhoods.

Use Cases

  • Models with Growth (Unbalanced OT, RUOT, cytobridge).

  • Datasets where cell proliferation/apoptosis is heterogeneous (e.g., tumor cores vs. edges).

Key Parameters

  • relative_mass: The expected proportion of cells in the target time point (e.g., 0.5 if cell count doubles between time points).

  • global_mass: If True, adds a global mass constraint (ensures total cell count matches); use for datasets with global proliferation.

5.4 Score Matching Loss

Core Function

Models stochasticity in cell state transitions—required for models with the Score component (RUOT, cytobridge). It trains the model to predict the “score” (gradient of the log density) of noisy cell states, capturing biological noise.

Use Cases

  • Models with Score (RUOT, cytobridge).

  • Datasets with stochastic cell fate decisions (e.g., stem cell differentiation, iPSC reprogramming).

Key Parameters

  • sigma: Noise level added to cell states (default: 0.1). Match to the noise in your data (e.g., higher for heterogeneous datasets).

  • lambda_penalty: Penalty for positive log-probability (default: 1.0). Prevents unphysical positive scores.

  • model: Trained DynamicalModel object (must include a score_net).

5.5 Density Loss

Core Function

Aligns the local density of predicted cell states with real cells—useful for scenarios where cell state density carries biological meaning (e.g., dense cell clusters representing cell types).

Use Cases

  • Models where cell clustering is important (e.g., identifying stable cell types in differentiation trajectories).

  • Spatial transcriptomics data (where local cell density correlates with tissue structure).

Key Parameters

  • hinge_value: Minimum distance threshold for density matching (default: 0.01). Values below this are ignored (reduces noise).

  • top_k: Number of nearest neighbors to consider for local density (default: 5).

  • groups: Optional cell groups (e.g., cell types) to compute density loss per group.

5.6 PINN Loss (Physics-Informed Neural Network Loss)

Core Function

Enforces physical constraints on cell dynamics—specifically the continuity equation (conservation of cell density over time). Ideal for models trained with neural_ode mode that require mechanistic rigor.

Use Cases

  • Models with Velocity, Growth, or Interaction (cytobridge).

  • Mechanistic studies where dynamics must follow physical laws (e.g., tissue growth, cell migration).

Key Parameters

  • sigma: Noise level for density estimation (default: 0.1).

  • use_mass: If True, includes the Growth component in the continuity equation.

  • use_interaction: If True, includes the Interaction component (cell-cell communication) in the equation.

Loss Selection Cheat Sheet

Model Config

Required Losses

Optional Losses

Training Mode

dynamical_ot (Velocity only)

OT Loss

Density Loss

neural_ode

unbalanced_ot (Velocity + Growth)

OT Loss+ Mass Loss

Density Loss

neural_ode

ruot (Velocity + Growth + Score)

OT Loss+ Mass Loss + Score Matching Loss

PINN Loss+Density Loss

neural_ode + flow_matching

crufm (Velocity + Growth + Score)

Score Matching Loss

flow_matching

cytobridge (All components)

OT Loss+ Mass Loss + Score Matching Loss

Density Loss+PINN Loss

neural_ode