Artificial Intelligence for Single-Cell Biology: From Representation Learning to Predictive Modeling

https://dx.doi.org/10.71373/YAGA2665

Show Outline

Summary

While single-cell sequencing technologies provide a high-resolution approach for dissecting cellular heterogeneity, their data are inherently high-dimensional, sparse, noisy, and strongly affected by batch effects and limited annotations. These properties make artificial intelligence (AI), particularly deep generative and probabilistic models, more suitable for analyzing single-cell data. Recent AI frameworks, including variational inference–based model scVI and its extensions, have supported unified pipelines for normalization, representation learning, batch correction, multimodal integration, and downstream analyses. Specialized downstream analyses, such as scalable cell-type annotation, trajectory and dynamic inference, cell–cell communication analysis, spatial mapping, and the prediction of genetic or pharmacological perturbation responses, can be extended by learning transferable latent representations. Emerging self-supervised foundation models promise reusable cellular representations across tasks, tissues, and species. Addressing challenges in benchmarking, interpretability, uncertainty quantification, and robust generalization highlights future frontiers in the development of predictive and causal single-cell models.

Why Single-Cell Omics Naturally Requires Artificial Intelligence

Single-cell sequencing technologies, especially scRNA-seq, scATAC-seq, and multi-omics joint profiling, allow researchers to systematically characterize the heterogeneity of tissues and diseases at single-cell resolution. However, these data exhibit different statistical structures from traditional omics: high dimensionality, highly sparse expression matrices, pervasive technical noise and dropout events, strong batch effects and experimental biases, while annotations are scarce and distribution differences across technologies, tissues, and even species are significant
1,
2,
3
.

These intrinsic properties limit the performance of linear models or empirical rules. AI-based approaches such as deep learning, probabilistic generative models, and self-supervised learning are therefore better suited for unified representation learning and robust inference. In recent years, the scVI ecosystem, centered on variational inference and deep generative models, has gradually established a new analytical paradigm: viewing single-cell data as a stochastic process jointly generated by latent biological states and technical noise, and performing end-to-end probabilistic modeling to achieve holistic inference from raw count data to downstream biological interpretation
1,
4,
5
.

From Workflow-Based Analysis to Probabilistic Generative Paradigm

Traditional single-cell analysis typically employs a workflow that includes quality control, normalization, selection of highly variable genes, dimensionality reduction, clustering, and differential analysis
6,
7
. In the generative modeling framework, these steps are reformulated as modeling different aspects of the data-generating process
1,
8
.

For example, Seurat uses normalization and variance stabilization based on regularized negative binomial regression have been widely applied to large-scale datasets, providing a reliable data foundation for robust analysis
9,
7
. Complementarily, scVI directly models expression distributions at the gene count level, learning a low-dimensional latent space through variational autoencoders, preserving biological variation, separating technical biases, and naturally supporting uncertainty quantification
1,
11
.

Latent Representation Learning and Multi-Modal Unified Representation

Learning transferable latent representations is one of the core objectives of single-cell AI
1,
11
.This latent space is used not only for visualization and clustering but also serves as the foundation for cross-dataset integration, annotation transfer, and dynamic modeling.

In multi-omics scenarios, totalVI achieves end-to-end denoising, integration, and missing modality inference through jointly probabilistically modeling RNA and protein (CITE-seq) data
4
.For multi-modal or mosaic data such as scRNA-seq and scATAC-seq, MultiVI and subsequent frameworks further expand the applicability of joint representation learning, thereby improving the reliability of integrative analysis under partially missing modalities
12,
13
.

Batch Correction and Cross-Dataset Integration: Toward Reusable Reference Atlases

Batch effect correction is a long-standing challenge in single-cell research. Ideally, technical differences should be removed while preserving true biological variation to the greatest extent possible
11
.

To address this challenge, Harmony was proposed to iteratively learn correction terms in a low-dimensional embedding, balancing computational efficiency and integration quality, and has become a commonly used solution for large-scale data integration
11
. Seurat provides an alternative integration strategy based on anchor identification, aligning cell states between datasets, enabling alignment across experimental conditions and sequencing technologies, as well as reference mapping
10,
14
.

Building on this, scArches introduces transfer learning into generative models, allowing new data to be lightly mapped into existing reference latent spaces, providing a scalable solution for continuously expanding cell atlases
15
.

Automated Cell Type Annotation and Uncertainty Management

As single-cell atlases grow in scale, marker-based manual annotation gradually becomes inadequate, and automated outputs with controllable uncertainty become critical
16
. scANVI extends the scVI framework with semi-supervised learning, using partially labeled cells to guide latent space structure, improving annotation quality for unlabeled cells while maintaining probabilistic consistency in the presence of batch effects
5
. scNym further combines semi-supervised learning with domain adaptation and adversarial training, enhancing generalization across experimental conditions and platforms
17
.

In practical applications, such models are often combined with hierarchical labeling systems, rare cell recognition, and rejection mechanisms to reduce the risk of overconfident mislabeling
16,
17
.

Continuous States, Trajectories, and Cellular Dynamics Modeling

Single-cell states often form continuous spectra rather than discrete categories; therefore, trajectory inference and dynamics modeling are crucial for understanding development, activation, and disease progression
18,
19
.

Manifold learning methods such as UMAP excel at preserving local structure and often serve as the basis for neighborhood graphs and trajectory analysis
18
. RNA velocity, which introduces directional information to static expression data through splicing kinetics, is generalized by scVelo to capture more general transient states and enhances robustness via likelihood-driven dynamic modeling
19
.

From an AI perspective, these problems can be formulated as learning continuous-time generative processes or directed graph structures in latent space, naturally compatible with neural ODEs, graph neural networks, and probabilistic state-space models
19,
20
.

Cell–Cell Communication: From Co-Expression Inference to Mechanistic Links

Cellular communication analysis usually constructs latent interaction networks between cell populations based on ligand–receptor databases as priors. CellPhoneDB provides systematic ligand–receptor resources and statistical inference pipelines, while CellChat emphasizes hierarchical features and pattern analysis of communication networks
21,
22
.

NicheNet advances communication inference from “whether interactions exist” to “downstream transcriptional responses,” predicting ligand–target gene relationships by integrating signal transduction and regulatory network priors, yielding results closer to mechanistic interpretation
23
. With the development of spatial transcriptomics, incorporating spatial proximity constraints and graph models into communication analysis has become an important trend.

Generative Prediction of Perturbation Responses and Drug Effects

Single-cell perturbation experiments (CRISPR, drugs, infection, etc.) provide important information for causal mechanism studies, but high experimental costs and enormous combinatorial spaces make perturbation response prediction a key AI task
24,
25
.

scGen learns latent differential vectors between control and perturbed states through conditional variational autoencoders, enabling extrapolation of perturbation effects across cell types and studies
24
. CellOT integrates optimal transport with neural networks to learn mappings between unpaired distributions for distribution-level counterfactual prediction
25
.

These methods have potential value in drug screening and personalized therapy, but require stricter cross-batch, cross-donor, and cross-platform evaluation to avoid overly optimistic generalization estimates
26
.

Spatial Information Integration: Placing Cellular States Back into Tissue Coordinates

Spatial transcriptomics provides tissue structural information, but remains limited in resolution and sequencing depth; scRNA-seq, though information-rich, lacks spatial localization
27,
28
. Tangram aligns single-cell expression with spatial measurements through deep learning, providing probabilistic mapping from single cells to spatial positions and reconstructing spatial expression patterns of unmeasured genes
27
.

Recent studies further introduce graph structures, tissue morphological images, and multi-modal priors to improve spatial deconvolution and cell localization accuracy in complex tissues
28,
29
.

Single-Cell Foundation Models and Self-Supervised Pretraining

With the rapid growth of public single-cell atlases and cell numbers, the field has begun exploring ‘foundation models' based on self-supervised pretraining to learn universal cellular representations for downstream tasks
30,
31,
32
.

scGPT, inspired by Transformer architectures in natural language processing, treats “gene–cell” analogous to “word–sentence,” pretraining on large-scale datasets and showing potential in annotation, perturbation prediction, and cross-species mapping tasks
30
.

Core challenges in this direction include distribution shift, cross-tissue generalization, model interpretability, and data governance and privacy protection
31,
33
.

Evaluation, Interpretability, and Conclusion Credibility

Single-cell AI analyses are often affected by batch leakage, label inconsistency, and donor overlaps, which may lead to offline evaluation results that overestimate true generalization performance
26,
34
. Therefore, the standardized evaluation strategies by donor, batch, or experiment, as well as external validation across tissues and platforms, have become increasingly important
34
.

For interpretability, traditional marker and pathway enrichment analysis are now complemented by model-intrinsic approaches, such as attention mechanisms, feature attribution, latent factor constraints, and generative counterfactual experiments, that help convert “black-box representations” into testable biological hypotheses
23,
35,
36
.

Summary and Outlook

Overall, artificial intelligence is advancing single-cell analysis from a “toolbox-style workflow” to a “transferable, generative, and joint-inference” system modeling paradigm, particularly excelling in representation learning, reference mapping, spatial integration, and foundation models
1,
4,
15,
27,
30
. Unified modeling of heterogeneous single-cell data drives analysis toward holistic insights into cellular states, dynamics, and interactions. Moreover, these approaches facilitate scalable integration across experiments, tissues, and even species, constructing a foundation for reusable reference atlases and predictive models for perturbation or disease responses.

Key future directions include stricter cross-domain extrapolation evaluation, causal inference–based perturbation prediction, multi-modal models integrating spatial and imaging data, and improved uncertainty quantification and interpretability frameworks to support high-risk applications such as clinical and drug development
25,
29,
34,
37
. In addition, the development of self-supervised foundation models shows potential for broad application in cell representations, thereby reducing reliance on extensive manual annotation and enabling transfer learning across diverse biological contexts. Coupling these models with mechanistic priors and multi-scale spatial information may further enhance their predictive accuracy and biological interpretability. Ultimately, these advances are expected to transform single-cell research from descriptive atlases to predictive, hypothesis-driven frameworks capable of guiding experimental design, precision medicine, and therapeutic discovery.

References

Figure Legends(1)

Supplemental information(0)