Pseudobulking in scRNA-Seq for Differential expression Analysis

Differential Expression

ScRNA-seq

P-value

DeSeq2

This blog explores how pseudobulking aggregates single-cell data to reduce variability and improve the reliability of differential expression analysis in scRNA-seq studies.

Author

Affiliation

Wajid Iqbal

International Islamic University, Islamabad

Does cells from the same sample behave Independent of Each other ?

The advent of single-cell RNA sequencing (scRNA-seq) has revolutionized our ability to dissect cellular heterogeneity within complex tissues, revealing insights that were previously obscured in bulk RNA sequencing. By examining gene expression at the resolution of individual cells, scRNA-seq has uncovered novel biological phenomena with significant implications for understanding disease mechanisms, developmental biology, and therapeutic targets. However, despite the powerful potential of scRNA-seq, differential expression (DE) analysis at the single-cell level presents unique challenges. One effective approach to mitigate these challenges is pseudobulking.

The process of differential expression so far has been looked at two different ways, the sample level where expressions are aggregated to make a pseudobulks and then employing the methods for differential expression used in bulk expression such as edgeR and Deseq2 while the other way is to look at the cell-level view where cells are modeled individually using generalized mixed effect models such as MAST (Finak et al. 2015 ) or glmmTMB (Brooks et al. 2017). But recent studies have come to the conclusion that pseudobulk methods performs best whether sum or mean aggregation works better requires further investigation (Zimmerman, Espeland, and Langefeld 2021), (Murphy and Skene 2022).

The phrase “Cells from the same sample are not independent” refers to the idea that individual cells within a single sample (such as a tissue biopsy or a blood sample) are likely to have correlated or similar gene expression profiles due to shared biological and environmental factors discussed below in detail.

1. Biological Similarity:

Shared Environment: Cells from the same sample originate from the same organism, tissue, or microenvironment. This means they have experienced similar conditions, such as the same nutrient availability, oxygen levels, and signaling molecules, which can influence their gene expression in similar ways.

Common Lineage: In many cases, cells from the same sample may be derived from a common progenitor cell or cell type. This shared lineage can lead to similar expression patterns because the cells have similar differentiation states or functions.

2. Technical Similarity:

Sample Processing: When cells are collected and processed together, they undergo the same technical procedures (e.g., dissociation, RNA extraction, library preparation). These shared technical steps can introduce batch effects, where technical variation is mistaken for biological differences. For example, if a certain gene is consistently upregulated due to the processing method, all cells in that sample might show similar expression of that gene, making them correlated.

Sequencing Batch Effects: If all cells from a sample are sequenced together, they might share sequencing artifacts or biases. This can further reduce their independence in terms of their measured gene expression profiles.

3. Statistical Implications:

Correlation Among Cells: Because of the reasons above, the expression levels of genes in cells from the same sample are likely to be correlated. This correlation violates the assumption of independence that underlies many statistical tests. When cells are treated as independent, these correlations can lead to an overestimation of the significance of observed differences, resulting in artificially small p-values.

False Positives: If cells are treated as independent units, the statistical models might incorrectly identify genes as differentially expressed (false positives), simply because they are capturing the shared variance within a sample, rather than true differences between conditions.

What is Pseudobulking?

Pseudobulking is a technique that addresses these challenges by aggregating or averaging gene expression data across multiple cells within a sample, creating a “bulk-like” expression profile for each sample.
Instead of treating each cell as an independent observation, pseudobulking considers each sample as the unit of analysis, similar to how bulk RNA-seq data is analyzed.
This approach reduces technical noise by averaging out the variability among individual cells and mitigates the issue of non-independence by focusing on differences between samples rather than between cells.

Why Performing pseudo-bulk analysis ?

Single-cell RNAseq data tend to exhibit an abundance of zero counts, a complicated distribution, and huge heterogeneity.
The heterogeneity within and between the cell population manifests major challenges to the differential gene expression analysis in scRNAseq data.
Single cell methods to identify highly expressed genes as DE exhibit low sensitivity for genes having low expression.
Single cell methods can often inflate the p-values as each cell is treated as a sample and if cell are treated as samples, then variation across population is not truly investigated.

R Code example for performing pseudo-bulk analysis:

The R example code for performing pseudo-bulk analysis can be found here.

References

Brooks, Mollie,E., Kasper Kristensen, Koen,J.,van Benthem, Arni Magnusson, Casper,W. Berg, Anders Nielsen, Hans,J. Skaug, Martin Mächler, and Benjamin,M. Bolker. 2017. “glmmTMB Balances Speed and Flexibility Among Packages for Zero-Inflated Generalized Linear Mixed Modeling.” The R Journal 9 (2): 378. https://doi.org/10.32614/rj-2017-066.

Finak, Greg, Andrew McDavid, Masanao Yajima, Jingyuan Deng, Vivian Gersuk, Alex K. Shalek, Chloe K. Slichter, et al. 2015. “MAST: A Flexible Statistical Framework for Assessing Transcriptional Changes and Characterizing Heterogeneity in Single-Cell RNA Sequencing Data.” Genome Biology 16 (1). https://doi.org/10.1186/s13059-015-0844-5.

Murphy, Alan E., and Nathan G. Skene. 2022. “A Balanced Measure Shows Superior Performance of Pseudobulk Methods in Single-Cell RNA-Sequencing Analysis.” Nature Communications 13 (1). https://doi.org/10.1038/s41467-022-35519-4.

Zimmerman, Kip D., Mark A. Espeland, and Carl D. Langefeld. 2021. “A Practical Solution to Pseudoreplication Bias in Single-Cell Studies.” Nature Communications 12 (1). https://doi.org/10.1038/s41467-021-21038-1.