CytoBridge Deep Dive: A Comprehensive Tutorial¶
Welcome to the detailed tutorial on using CytoBridge for single-cell spatiotemporal dynamical modeling. This guide is designed to provide an in-depth walkthrough of the entire process, from data preparation and preprocessing to PCA extraction and model training. Additionally, it offers practical insights into selecting the most suitable models and components based on specific biological questions and computational limitations.
Table of Contents¶
1. Data Requirements¶
CytoBridge is designed for single-cell transcriptomic data (scRNA-seq/spatial transcriptomics) with temporal annotations. Below are the supported formats and critical requirements:
Supported Input Formats¶
Format |
Use Case |
Required Structure |
|---|---|---|
AnnData (h5ad) (Recommended) |
Standard scRNA-seq/spatial data (compatible with Scanpy/Seurat) |
- |
CSV |
Lightweight datasets |
- Rows = cells, Columns = gene counts + 1 time column (e.g., |
Critical Annotations¶
Time Key: A column in
adata.obs(or CSV) that labels each cell with its experimental time point (e.g., categorical:Day0/Day2; numerical:0/2.0).This is mandatory for modeling temporal dynamics.
Raw Counts (Preferred): CytoBridge’s preprocessing pipeline is optimized for raw scRNA-seq counts (normalization/log1p steps are configurable).
2. Quick Start¶
Get up and running in 5 minutes with a minimal example (AnnData input):¶
import scanpy as sc
import CytoBridge as cb
# 1. Load data
adata = sc.read_h5ad("raw_scrna_data.h5ad")
# 2. Preprocess (normalize + HVG + PCA)
time_mapping = {"Day0": 0.0, "Day2": 2.0, "Day5": 5.0} # Map time points to numbers
n_pc_adata = cb.pp.preprocess(
adata=adata,
time_key="Time_point",
time_mapping=time_mapping,
normalization=True,
log1p=True,
select_hvg=True,
dim_reduction="pca",
n_pcs=50
)
# 3. Train model (Velocity + Growth + Score)
cb.tl.fit(
adata=n_pc_adata,
config="ruot", # Velocity + Growth + Score
device="cuda", # Use "cpu" if no GPU
)
# 4. Save results
n_pc_adata.write_h5ad("trained_model.h5ad")
3. Full Workflow Walkthrough¶
3.1 Preprocessing¶
Preprocessing reduces noise and extracts biologically meaningful features for dynamical modeling. The cb.pp.preprocess function encapsulates a standardized workflow—tune parameters based on your data type.
Key Preprocessing Steps & Parameters¶
Step |
Purpose |
Parameter |
When to Use |
|---|---|---|---|
Normalization |
Scale counts to fix sequencing depth differences |
|
Enable for raw counts; disable for pre-normalized data |
Log1p Transformation |
Stabilize variance in gene expression |
|
Enable for raw counts; disable for already log-transformed data |
HVG Selection |
Focus on biologically variable genes (reduce noise) |
|
Enable for scRNA-seq; disable for sparse spatial transcriptomics |
Time Mapping |
Convert categorical time labels to numbers |
|
Always define explicitly (avoids arbitrary time ordering) |
Full Preprocessing Code¶
import scanpy as sc
import CytoBridge as cb
# Load data
adata = sc.read_h5ad("raw_scrna_data.h5ad")
# Define time mapping (critical for temporal dynamics)
time_mapping = {"Day0": 0.0, "Day2": 2.0, "Day5": 5.0}
# Run preprocessing
n_pc_adata = cb.pp.preprocess(
adata=adata,
time_key="Time_point", # Column with time labels
time_mapping=time_mapping, # Explicit time mapping (avoid auto-mapping)
normalization=True, # Raw counts → True; normalized → False
log1p=True, # Raw counts → True; log-transformed → False
select_hvg=True, # scRNA-seq → True; spatial → False
n_top_genes=2000, # Adjust for data size (2000 for small datasets)
dim_reduction="pca", # Use PCA for modeling
n_pcs=50 # 30-50 for scRNA-seq; 20-30 for spatial data
)
# Verify preprocessing output
print(f"Preprocessed PCA data shape: {n_pc_adata.shape}") # (n_cells, n_pcs)
print(f"Latent space: {n_pc_adata.obsm['X_latent'].shape}") # latent results stored here
# when you choose dim_reduction="none",it will train original data. (latent)
# when you choose dim_reduction="pca",it will train n_pc_adata. (latent)
print(f"Processed time points: {n_pc_adata.obs['time_point_processed'].unique()}")
3.2 PCA Extraction¶
PCA is the gold standard for linear dimension reduction in CytoBridge—it preserves global dynamical structure (unlike UMAP/t-SNE, which are for visualization only).
Why PCA?¶
Retains linear relationships between cell states (critical for modeling velocity/growth).
Reduces computational cost of training (50 PCs vs. 20k genes).
Minimizes noise while preserving biological signal.
3.3 Model Training¶
Use the preprocessed PCA data (n_pc_adata) to train a CytoBridge model. Choose a config (component combination) and mode (training method) based on your biological question.
Core Training Parameters¶
Parameter |
Description |
Options |
|---|---|---|
|
Defines model components (velocity/growth/score/interaction) |
|
|
Compute device |
|
|
Training algorithm |
|
Training Examples¶
Example 1: Stem Cell Differentiation (Velocity + Growth + Score)¶
Biological context: Proliferating stem cells with stochastic fate decisions.
# Train RUOT model (Velocity + Growth + Score)
cb.tl.fit(
adata=n_pc_adata,
config="unbalanced_ot",
#corresponding to mode="neural_ode"
# Match biology: velocity (differentiation) + growth (proliferation)
device="cuda",
)
# Inspect outputs
print(f"Latent velocity: {n_pc_adata.obsm['velocity_latent'].shape}")
print(f"Growth rates: {n_pc_adata.obsm['growth_rate'].shape}")
Example 2: Large-Scale Steady-State Tissue (Velocity Only)¶
Biological context: Adult liver tissue (stable cell numbers, minimal noise).
# Train Dynamical OT model (Velocity only)
cb.tl.fit(
adata=n_pc_adata,
config="CRUFM", #corresponding to mode="flow_matching"
# Match biology: velocity (differentiation) + growth (proliferation) + score (stochasticity)
device="cuda",
)
print(f"Latent velocity: {n_pc_adata.obsm['velocity_latent'].shape}")
print(f"Growth rates: {n_pc_adata.obsm['growth_rate'].shape}")
print(f"Stochastic score: {n_pc_adata.obsm['score_latent'].shape}")
4. Model Selection Guide¶
4.1 Component Selection (Biology-Driven)¶
Components map to biological phenomena—velocity is mandatory; add others based on your system.
Component |
Biological Meaning |
When to Include |
|---|---|---|
Velocity (Mandatory) |
Instantaneous direction of cell state transitions (e.g., differentiation/activation) |
Always (core of trajectory inference) |
Growth |
Proliferation/apoptosis rates (changes in cell population size over time) |
- Developmental expansion -Tumor proliferation |
Score |
Stochasticity in cell state transitions (biological noise) |
- Stem cell fate plasticity |
Interaction |
Cell-cell communication effects |
Tumor microenvironment (immune-cancer crosstalk) |
4.2 Training Mode Selection (Computation-Driven)¶
Choose between flow_matching (speed) and neural_ode (flexibility):
Training Mode |
Speed |
Flexibility |
Best For |
Limitations |
|---|---|---|---|---|
|
⚡ Fast (simulation-free) |
❌ Low (simple loss functions only) |
- Large datasets - Simple dynamics (velocity + growth + score) |
Struggles with complex dynamics (interaction) |
|
🐢 Slow (numerical ODE solving) |
✅ High (all components/losses) |
- Small/medium datasets - Complex dynamics (velocity + growth + score + interaction) |
Computationally expensive for large data |
4.3 Model Selection Cheat Sheet¶
Biological Scenario |
Components |
Training Mode |
CytoBridge Config |
|---|---|---|---|
Only considering Velocity |
Velocity only |
|
|
Proliferation (no stochasticity) |
Velocity + Growth |
|
|
Proliferation + Stochasticity |
Velocity + Growth + Score |
|
|
Proliferation + Stochasticity |
Velocity + Growth + Score |
|
|
proliferation + Stochasticity + cell-cell communication |
Velocity + Growth + Score+Interaction |
|
|
5. Optional Loss Functions¶
Loss functions in CytoBridge quantify the mismatch between model predictions and biological data. The choice of loss depends on your model components (e.g., growth requires mass-aware losses) and biological goals (e.g., stochasticity requires score matching). Below is a breakdown of all supported losses.
5.1 Overview¶
Each loss addresses a specific biological or computational challenge. Here’s a high-level mapping to model components:
Loss Type |
Target Challenge |
Essential model components |
|---|---|---|
OT Loss |
Align cell state distributions across time |
Velocity (all OT-based models) |
Mass Loss |
Enforce consistent cell population sizes (proliferation/apoptosis) |
Velocity + Growth |
Score Matching Loss |
Model stochasticity in cell state transitions |
Velocity + Score |
Density Loss |
Align local density of cell states |
Velocity (neural ODE mode) |
PINN Loss |
Enforce physical constraints (e.g., continuity of cell density) |
Velocity (neural ODE mode) |
5.2 OT Loss (Optimal Transport Loss)¶
Core Function¶
Measures the “cost” of transporting cell states from one time point to another—critical for aligning distributions in velocity-based models. Supports both balanced (equal cell counts) and unbalanced (proliferation/apoptosis) scenarios.
Method Options & Use Cases¶
The method parameter in calc_ot_loss dictates how transport costs are computed. Choose based on data size and whether you need gradient flow (for training).
Method |
Balanced/Unbalanced |
Gradient Support |
Speed |
|---|---|---|---|
|
Balanced |
❌ No (weights detached) |
Slow |
|
Balanced |
✅ Yes |
Fast |
|
Balanced |
❌ No |
Fast |
|
Unbalanced |
✅ Yes |
Fast |
|
Unbalanced |
❌ No |
Fast |
Key Parameters¶
reg(Sinkhorn methods): Regularization strength (default: 0.1). Increase for noisy data; decrease for sharper distributions.reg_m(Unbalanced Sinkhorn): Controls mass imbalance tolerance (default: 0.01). Larger values allow more extreme proliferation/apoptosis.
5.3 Mass Loss¶
Core Function¶
Ensures the model preserves local cell population sizes—critical for models with the Growth component (proliferation/apoptosis). It matches the mass (cell count) of predicted cells to real cells in local neighborhoods.
Use Cases¶
Models with
Growth(Unbalanced OT, RUOT, cytobridge).Datasets where cell proliferation/apoptosis is heterogeneous (e.g., tumor cores vs. edges).
Key Parameters¶
relative_mass: The expected proportion of cells in the target time point (e.g., 0.5 if cell count doubles between time points).global_mass: IfTrue, adds a global mass constraint (ensures total cell count matches); use for datasets with global proliferation.
5.4 Score Matching Loss¶
Core Function¶
Models stochasticity in cell state transitions—required for models with the Score component (RUOT, cytobridge). It trains the model to predict the “score” (gradient of the log density) of noisy cell states, capturing biological noise.
Use Cases¶
Models with
Score(RUOT, cytobridge).Datasets with stochastic cell fate decisions (e.g., stem cell differentiation, iPSC reprogramming).
Key Parameters¶
sigma: Noise level added to cell states (default: 0.1). Match to the noise in your data (e.g., higher for heterogeneous datasets).lambda_penalty: Penalty for positive log-probability (default: 1.0). Prevents unphysical positive scores.model: TrainedDynamicalModelobject (must include ascore_net).
5.5 Density Loss¶
Core Function¶
Aligns the local density of predicted cell states with real cells—useful for scenarios where cell state density carries biological meaning (e.g., dense cell clusters representing cell types).
Use Cases¶
Models where cell clustering is important (e.g., identifying stable cell types in differentiation trajectories).
Spatial transcriptomics data (where local cell density correlates with tissue structure).
Key Parameters¶
hinge_value: Minimum distance threshold for density matching (default: 0.01). Values below this are ignored (reduces noise).top_k: Number of nearest neighbors to consider for local density (default: 5).groups: Optional cell groups (e.g., cell types) to compute density loss per group.
5.6 PINN Loss (Physics-Informed Neural Network Loss)¶
Core Function¶
Enforces physical constraints on cell dynamics—specifically the continuity equation (conservation of cell density over time). Ideal for models trained with neural_ode mode that require mechanistic rigor.
Use Cases¶
Models with
Velocity,Growth, orInteraction(cytobridge).Mechanistic studies where dynamics must follow physical laws (e.g., tissue growth, cell migration).
Key Parameters¶
sigma: Noise level for density estimation (default: 0.1).use_mass: IfTrue, includes theGrowthcomponent in the continuity equation.use_interaction: IfTrue, includes theInteractioncomponent (cell-cell communication) in the equation.
Loss Selection Cheat Sheet¶
Model Config |
Required Losses |
Optional Losses |
Training Mode |
|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|