Quantifying Expression in Single Cell

Scrnaseq

RNAseq

Genomics

Seurat

Measuring gene expression levels within individual cells.

Author

Affiliation

Wajid Iqbal

International Islamic University, Islamabad

Part2:

Disclaimer: ⚠️ This post may be lengthy for some readers, so please adjust accordingly. Welcome to the second blog in our series on single-cell RNA sequencing (scRNA-seq). In this blog, we will discuss various transcript quantification methods and some of the problems associated with quantification. Most of the content discussed here relates to the droplet-based method of scRNA-seq.

Transcript Quantification:

Transcript quantification is the process of measuring the abundance of RNA transcripts (messenger RNA or mRNA) within a sample. In the context of single-cell RNA sequencing (scRNA-seq), it involves determining the number of RNA molecules corresponding to each gene in individual cells.The general workflow for single cell library construction is shown in Figure 1

Figure 1: General Workflow for single cell Library Construction.

There are two major version of Transcript Quantification: Full-length and tag-based.

Full length:

This method makes an attempt to cover the whole transcript with sequencing reads.

Based on the type of methods used for transcript quantification, can have a serious impact on the number of genes captured, so try to asses all the available methods and use the approach that best suits your study.

Only plate-based protocols can perform Full-length sequencing.
Full-length protocols do not always ensure even coverage of transcripts, leading to potential biases in specific regions across the gene body.
The only advantage that full length based approach provides, is the detection of splice site variants.

Tag-based Protocols:

Tag-based protocols sequence only the 3’ or 5’ ends of transcripts, which limits full gene coverage and complicates the alignment of reads to specific transcripts and the differentiation of isoforms. However, these protocols enable the use of unique molecular identifiers (UMIs), which help correct biases in transcript amplification. Although this is beyond the scope of this blog to discuss in detail each approach and platform used for sc-RNASeq but down below is a summarized table that contains all the information related to the platform name and separation methods used in each approach.

Want to see the code

library(gt)
scRNA_seq_data <- data.frame(
  Platform_Name = c("VASA-seq", "Smart-seq3", "DNBelabC4", "Seq-Well", "MATQ-seq", "10× Genomics", "Cyto-Seq", "SC3-seq", "inDrop-seq", "Drop-seq", "MARS-seq", "STRT-seq", "Quartz-seq", "Fluidigm C1", "Smart-seq2", "Smart-seq", "CEL-seq", "Tang-2009"),
  Separation_Method = c("FANS", "Microfluidics", "Microfluidics", "Microfluidics", "FACS", "Microfluidics", "Microfluidics", "Micromanipulation", "Microfluidics", "Microfluidics", "FACS", "Microfluidics", "FACS", "Microfluidics", "FACS", "FACS", "FACS", "FACS"),
  Amplification_Method = c("PCR", "PCR", "PCR", "PCR", "PCR", "PCR", "PCR", "PCR", "IVT", "PCR", "IVT", "PCR", "PCR", "PCR", "PCR", "PCR", "IVT", "PCR"),
  Using_UMI = c("YES", "YES", "YES", "YES", "YES", "YES", "YES", "YES", "YES", "YES", "YES", "NO", "YES", "NO", "NO", "NO", "YES", "NO"),
  Amplification_Range = c("All transcripts", "5′ end", "All transcripts", "3′ end", "All transcripts", "3′ end", "3′ end", "3′ end", "3′ end", "3′ end", "3′ end", "All transcripts", "3′ end", "All transcripts", "All transcripts", "All transcripts", "3′ end", "3′ end"),
  Advantages = c("Low cost and accurate dosing", "High sensitivity", "Precise quantification", "Low cost and precise quantification", "Precise quantification", "High cell capture efficiency, fast cycle time, high cell suitability, and reproducibility", "Low cost and high throughput", "Good reproducibility and accurate quantification", "Low cost and linear amplification", "Low cost and high throughput", "High specificity", "Accurate positioning of transcripts at the 5′ end to reduce amplification bias", "High sensitivity, reproducibility, and operational simplicity", "Simple process", "Full-length cDNA detects structural and RNA shear variants", "High sensitivity to reduce the rates of nucleic acid loss", "Good reproducibility and highly sensitive", "Good reproducibility"),
  Disadvantages = c("/", "Time-consuming", "/", "Unsuitable for variable splicing and allelic expression", "Low cell throughput", "Sequencing can be performed only for the 3′ end", "Cross-contamination between RNAs", "Recognize DNA at the 3′ end", "Long operating time and high initial cell concentration", "Low cell capture rate", "Low amplification efficiency", "Low sensitivity, only available for identification of 5′ end DNA", "Higher noise levels", "High cost and low throughput", "High cost, low throughput, and time-consuming", "Low throughput and the existence of transcript length bias", "Low throughput and amplification efficiency, library biased toward the 3′ end of the gene", "High cost and low throughput"),
  Release_Date = c("2022", "2020", "2019", "2017", "2017", "2016", "2015", "2015", "2015", "2015", "2014", "2014", "2013", "2013", "2013", "2012", "2012", "2009")
)

scRNA_seq_table <- gt(scRNA_seq_data) %>%
  tab_header(
    title = md("**Comparison of Different ScRNA-Seq Technology Approaches**")
  ) %>%
  cols_label(
    Platform_Name = "Platform Name",
    Separation_Method = "Separation Method",
    Amplification_Method = "Amplification Method",
    Using_UMI = "Using UMI",
    Amplification_Range = "Amplification Range",
    Advantages = "Advantages",
    Disadvantages = "Disadvantages",
    Release_Date = "Release Date"
  ) %>%
 tab_style(
    style = list(
      cell_text(weight = "bold")
    ),
    locations = cells_column_labels(everything())
  ) %>% 
  tab_source_note(source_note = md("**Reference**: Shou wang et al., The Evolution of Single-Cell RNA Sequencing Technology and Application: Progress and Perspectives.")) %>% 
  opt_stylize(style = 6, color ='gray') %>% 
  tab_options(heading.align = 'center',
              heading.background.color = 'darkgreen',
              heading.title.font.size = px(20))
scRNA_seq_table |> as_raw_html()

Comparison of Different ScRNA-Seq Technology Approaches
Platform Name	Separation Method	Amplification Method	Using UMI	Amplification Range	Advantages	Disadvantages	Release Date
VASA-seq	FANS	PCR	YES	All transcripts	Low cost and accurate dosing	/	2022
Smart-seq3	Microfluidics	PCR	YES	5′ end	High sensitivity	Time-consuming	2020
DNBelabC4	Microfluidics	PCR	YES	All transcripts	Precise quantification	/	2019
Seq-Well	Microfluidics	PCR	YES	3′ end	Low cost and precise quantification	Unsuitable for variable splicing and allelic expression	2017
MATQ-seq	FACS	PCR	YES	All transcripts	Precise quantification	Low cell throughput	2017
10× Genomics	Microfluidics	PCR	YES	3′ end	High cell capture efficiency, fast cycle time, high cell suitability, and reproducibility	Sequencing can be performed only for the 3′ end	2016
Cyto-Seq	Microfluidics	PCR	YES	3′ end	Low cost and high throughput	Cross-contamination between RNAs	2015
SC3-seq	Micromanipulation	PCR	YES	3′ end	Good reproducibility and accurate quantification	Recognize DNA at the 3′ end	2015
inDrop-seq	Microfluidics	IVT	YES	3′ end	Low cost and linear amplification	Long operating time and high initial cell concentration	2015
Drop-seq	Microfluidics	PCR	YES	3′ end	Low cost and high throughput	Low cell capture rate	2015
MARS-seq	FACS	IVT	YES	3′ end	High specificity	Low amplification efficiency	2014
STRT-seq	Microfluidics	PCR	NO	All transcripts	Accurate positioning of transcripts at the 5′ end to reduce amplification bias	Low sensitivity, only available for identification of 5′ end DNA	2014
Quartz-seq	FACS	PCR	YES	3′ end	High sensitivity, reproducibility, and operational simplicity	Higher noise levels	2013
Fluidigm C1	Microfluidics	PCR	NO	All transcripts	Simple process	High cost and low throughput	2013
Smart-seq2	FACS	PCR	NO	All transcripts	Full-length cDNA detects structural and RNA shear variants	High cost, low throughput, and time-consuming	2013
Smart-seq	FACS	PCR	NO	All transcripts	High sensitivity to reduce the rates of nucleic acid loss	Low throughput and the existence of transcript length bias	2012
CEL-seq	FACS	IVT	YES	3′ end	Good reproducibility and highly sensitive	Low throughput and amplification efficiency, library biased toward the 3′ end of the gene	2012
Tang-2009	FACS	PCR	NO	3′ end	Good reproducibility	High cost and low throughput	2009
Reference: Shou wang et al., The Evolution of Single-Cell RNA Sequencing Technology and Application: Progress and Perspectives.

Since Tag-based methods involves sequencing only specific parts of each RNA molecule such as 3’ or 5’ ends, this allows for the efficient capture and sequencing of RNA from a large number of cell, allowing better identity of cell type population.
Tag-based methods have usually cheaper per cell cost.

Figure 2: Whole picture of how the reads are synthesized

To capture the mRNA molecule, we hybridize it to a poly-dT sequence that has unique molecular identifier (UMI) attached to it. UMIs are randomly synthesized sequences, typically between 8 and 12 base pairs in length, depending on the technology and its version, providing over 50,000 possible combinations(Archer et al. 2016).

Possible combination with 8 and 12 bp:

Since there are 4 nucleotides (A, T, C, G)

For a UMI of 8 base pairs, there are 4^8 (65,536) possible combinations. For 12 base pairs, the number of combinations is 4^12(16,777,216).

Theoretically, each mRNA molecule would have a unique barcode. However, in practice, randomness in synthesis does not guarantee equal probability for all sequences, leading to biases in UMI distribution. To address this, we combine the UMI with the genomic location of the mRNA transcript to uniquely identify captured mRNA molecules.

Additionally, we incorporate a cell barcode, a sequence between 14 and 18 base pairs, the number varies with the technology used. These barcodes are deliberately generated from predetermined pools, ensuring each barcode differs by at least two base pairs from any other to allow for error correction.

More explanation on barcodes:

When we design barcodes to identify each cell, we make sure every barcode is different from every other barcode by at least two letters. This way, even if there’s a small mistake in reading the barcode (like reading an ‘A’ as a ‘T’), we can still tell which cell it’s supposed to be from. If two barcodes are only one letter different, we assume the difference is just a mistake and group them together as coming from the same cell. This helps us correct for small errors and ensures we correctly identify each cell.

Sequencing errors can occur, and having such errors in cell barcodes could misassign reads to incorrect cell types, such as assigning a read from a T-cell to an endothelial cell. The two base pair differences between barcodes enable us to aggregate barcodes with single base pair differences, treating them as sequencing errors, and ensuring they all originated from the same cell (Aird et al. 2011).

Raw data to count-matrix:

The general workflow that is followed in scrnaseq is shown in Figure 5.

Figure 5: General workflow of scRNA-seq. Image obtained from here.

Once you get the reads from sequencing facility, they will either be in BCL or FASTQ format or count matrix. Although most of the sequencing facilities nowadays would usually gives the sequencing file in FASTQ format. In case you are given the file in BCL format the first step would be to convert them to FASTQ format which can easily be done using Illumina bcl2fastq or for instance if one is using CellRanger pipeline then they have provide the useful function in the pipeline cellranger mkfastq which is preferred option for converting BCLs to FASTQs. However, if one wants to use the CellRanger pipeline then using their function cellranger mkfastq is the preferred one because it will generate cell-Ranger compatible FASTQs. More details on Cell-Ranger pipeline is here.

Quality Check:

After getting FASTQs, the first step generally that is common in almost all the high-throughput sequencing is running a quality control on FASTQ file and usually this is done utilizing FASTQC user guide page . Overall, FASTQC generate an html or this. Here:i am attaching a link to one of the very useful resources that can get you started with FASTQC ,how to interpret it and what is good and bad quality. This material was taken from this course:Bioinformatics for Biologists: Analysing and Interpreting Genomics Datasets that was offered by wellcome connecting science and is one of the best course if someone is interested in analyzing genomics dataset and variant interpretation. Nevertheless, FASTQC generates a QC report for each input FASTQ file, essentially summarizes the quality score, base content, and certain relevant statistics to spot potential problems originating from library preparation or sequencing. Additionally, another tools that can give more comprehensive report is MultiQC, the output file of which looks something like this.

Mapping:

Alignment or mapping refers to the process of matching the sequenced RNA reads to a reference genome or transcriptome. This step is crucial for identifying the origins of fragments especially RNA in the context of scRNA within each cell, enabling the quantification of gene or transcript expression levels. This step is usually computationally intensive in any high-throughput sequencing analysis.
The alignment process uses specialized algorithms to locate the positions of RNA reads on the reference genome, thus determining which genes are expressed and at what levels.
The output of this process typically includes several key files: an aligned reads file in BAM (Binary Alignment Map) or SAM (Sequence Alignment Map) format, which contains detailed information about the read sequences, their alignment positions, and mapping quality; a gene expression matrix in tabular format, which provides counts of reads or transcripts mapped to each gene for every single cell, essential for downstream analyses like identifying differentially expressed genes and clustering cells based on their gene expression profiles.
Summary statistics that assess the quality of the alignment, including metrics such as the percentage of successfully aligned reads, the distribution of reads across genomic features, and overall coverage depth. Through alignment or mapping, researchers can accurately quantify gene expression at the single-cell level, facilitating insights into cellular heterogeneity and the discovery of novel cell types and states.
As of now, several tools has been developed to align or map scRNA reads to either reference transcriptome, genome or an augmented transcriptome. Tools like Cell Ranger (Zheng et al. 2017), zUMIs (Parekh et al. 2018), alevin (Srivastava et al. 2016), Raindrop (Niebler et al. 2020), kallisto Bustools (Melsted et al. 2021), STARsolo (Kaminow, Yunusov, and Dobin 2021), and Allevin fry (He et al. 2022) are some of the tools that have been extensively used to carry out alignment or mapping.

Different reference sequences can be used for mapping. There are three main types of reference sequences:

The full, annotated reference genome.
The annotated transcriptome.
Mapping against an augmented transcriptome.

However, it’s important to note that not all combinations of mapping algorithms and reference sequences are currently possible. For instance, there are no lightweight mapping algorithms available that can perform spliced mapping of reads against a reference genome.

Mapping using reference Genome:

Mapping in scRNA-seq involves aligning sequenced RNA reads to the entire genome of the target organism, including annotated transcripts.
Tools like zUMIs, Cell Ranger, and STARsolo use this whole-genome approach.
These tools employ splice-aware alignment algorithms to handle reads from spliced transcripts, meaning they can align reads split across splicing junctions.
The advantage of this method is that it accounts for reads from any location in the genome, not just known transcripts.
This approach allows capturing reads within introns and other non-annotated regions, enabling comprehensive gene detection and quantification.
Constructing a genome-wide index is required, which allows reporting reads that map to both known spliced transcripts and intronic regions.
Such methods can also capture reads outside annotated transcripts, which helps in post hoc augmentation of quantified loci, increasing gene detection sensitivity.
However, this method requires substantial memory, especially during the index construction and storage phase, often exceeding hundreds of gigabytes.
Using a sparse suffix array can reduce the index size but at the cost of alignment speed.
Spliced alignment is more computationally expensive than contiguous alignment due to the complexity of explicitly computing alignment scores for each read.
This approach necessitates the availability of the organism’s genome, which can be challenging for non-model organisms that only have a transcriptome assembly available.

Mapping using Reference Transcriptome:

To reduce the computational load of spliced alignment to a genome, an alternative approach is to use only the sequences of annotated transcripts as the reference.
Most single-cell experiments use model organisms (like mice or humans) that have well-annotated transcriptomes, making transcriptome-based methods provide similar read coverage to genome-based methods.
Transcriptome sequences are smaller than genome sequences, which reduces the computational resources needed for the mapping process.
Using transcript sequences, which already include splicing patterns, eliminates the need for complex spliced alignment. Instead, reads are mapped through contiguous alignments.
This approach works well with both alignment and lightweight mapping techniques, and is used in many popular tools.
However, this method doesn’t capture reads from outside the spliced transcriptome, making it unsuitable for single-nucleus data.
In single-cell experiments, reads from outside the spliced transcriptome can be significant and should be included in the analysis.
Using lightweight mapping approaches can sometimes lead to incorrect read mappings and unrealistic estimates of gene expression.

Mapping against an augmented transcriptome:

An augmented transcriptome is an updated version of a standard transcriptome reference that includes additional sequences beyond the typical spliced transcripts. This augmentation can involve the inclusion of:

Full-Length Unspliced Transcripts: These are the complete RNA sequences before splicing occurs, which include both exons and introns.
Intronic Sequences: These are the sequences of introns, which are usually removed during the splicing process in eukaryotic cells.
Other Non-Coding Regions: Additional sequences that may give rise to RNA reads, such as untranslated regions (UTRs) or intergenic regions.
To account for reads that may come from outside of spliced transcripts, spliced transcript sequences can be augmented with additional reference sequences, such as full-length unspliced transcripts or intronic sequences.
Augmented transcriptome references, which include introns, create reference indices that are smaller than those needed for the full genome while still enabling the search for contiguous read alignments.
This approach can lead to faster and less memory-intensive alignment compared to mapping against the full genome, while also capturing many reads from outside the spliced transcriptome.
Mapping to an expanded set of reference sequences recovers more reads than mapping to the spliced transcriptome alone, and reduces spurious mappings when using lightweight mapping-based methods.
Augmented transcriptomes are useful for quantifying single-nucleus data or preparing data for RNA-velocity analysis, offering a balance between computational efficiency and comprehensive read mapping.
(He et al. 2022) recommend constructing an augmented splici (spliced + intronic) reference even for standard single-cell RNA-seq data. This reference combines spliced transcript sequences with intronic sequences of annotated genes. Using this splici reference allows lightweight mapping methods, reduces spurious mappings, and improves sensitivity by detecting both spliced and unspliced reads.
During quantification, splicing status is tracked and reported separately, unifying the preprocessing pipeline for single-cell, single-nucleus, and RNA-velocity analysis.

Considerations Regarding Mapping:

Here I am attaching a shell file: prepare_reference.sh that contains all the script on downloading reference genome and annotation file and then some minor changes to make it more cellRanger-compatible. Additionally if you are working on mouse data then go to this link and you will find all the relevant information regarding mouse data. When mapping reads to transcriptome one needs to consider:

If a Knock-in mouse has a novel human gene inserted into its genome, it is necessary to add that gene to the reference genome to accurately quantify its expression.
Annotation quality needs to be considered, especially when working with non-model species. For mouse or human, this is generally not an issue. However, with non-model species, incomplete or inaccurate annotations can cause problems, particularly when sequencing only one end of the transcript. If many transcripts have long untranslated regions (UTRs) that have not been identified, it could result in the loss of a significant number of reads and genes. This is because many reads might not map to the annotated gene body. Therefore, it may be necessary to update the annotation if working with a less well-characterized genome of a non-typical species.
Consider whether to include introns and exons or just exons. Previously, the answer was to use exons only for single-cell RNA sequencing and both introns and exons for single-nucleus RNA sequencing. However, now Cell Ranger includes both by default, as it captures 20% more reads even for single-cell RNA sequencing. There are also methods that utilize intronic versus exonic information, such as scVelo,(Soneson et al. 2021) which requires separate quantification of intron and exon reads.

For assigning reads to cells, the correct whitelist for cell barcodes is needed, and UMIs (unique molecular identifiers) are counted. One challenge is dealing with multi-mapping reads, where a read maps to multiple genes so which gene gets UMI count for that read. Cell Ranger’s current rule is to exclude all those reads that are multimapping, which can lead to an undercounting bias for genes with close homologs.

Cell Ranger also addresses sequencing errors in UMIs by allowing for up to one base pair error in the UMI sequence and correcting for it. While Cell Ranger is commonly used for 10x Genomics data, there are alternatives like STARsolo, which mimics Cell Ranger but is open-source and allows for easier modification of parameters.
I’ve been discussing sequencing errors, which might seem unusual to those familiar with sequencing technology, as it is typically 99.9% accurate these days. However, when I mention sequencing errors, I’m actually referring to all the errors that can occur from the mRNA molecules to our sequencing reads. Errors can arise from reverse transcriptase during cDNA synthesis, from DNA polymerase during sequence amplification, and from the sequencing machine itself. While the sequencing machine’s errors are very low, errors at other stages are not necessarily as minimal.

What other things that could possibly go wrong😧:

Before explaining what other things could possibly go wrong, its time to define few terminologies.

Ambient RNA: Ambient RNA is cell-free mRNA that is released during preparation of single-cell suspensions for scRNA-Seq analysis.

Doublets: A doublet in scRNA-seq refers to an event where two cells are captured in the same droplet or well, leading to their RNA being sequenced together.

Lets explore the ambient RNA thing in detail now, but first we need to know how the droplets are generated.

How are droplets generated in an ideal world:

To explain the process of empty droplet generation in single-cell assays, we need to look at how these droplets are created. We produce these droplets by simultaneously flowing barcoded beads and cells, which are then encased in droplets by oil in our 10x Chromium machine as shown in Figure 6.

This process generates hundreds of droplets per second. Ideally, each droplet would contain one cell and one barcoded bead, but in reality, only about 5% of the droplets achieve this ideal condition. Additionally, all droplets contain ambient RNA. This occurs because some cells get damaged or break during handling and extraction from tissue, particularly when passing through microfluidic devices at high speeds. This results in RNA being released into the surrounding solution.

When droplets are formed, they capture both the cells and the ambient RNA floating in the supernatant. Therefore, ambient RNA is present in nearly all droplets. In practice, about 95% of droplets are considered empty, with approximately 20% containing only ambient RNA. Around 7% of droplets will have a barcoded bead with capture nucleotides along with ambient RNA as shown in Figure 7.

Figure 7: Possible droplets Configuration

This means the barcoded bead can capture ambient RNA, generate cDNA, and include unique molecular identifiers (UMIs) that will appear in the sequencing library and subsequently in the sequencing data results.

When ambient RNA is present at moderate to high levels, it can cause problems in later parts of the analyses for two main reasons:

Ambient RNA can mix with the genuine gene expression profiles of cells, making it difficult to accurately identify cell types.
Differences observed between experimental conditions might be due to variations in ambient RNA levels rather than actual biological differences.
For the solution of this problem (Lun et al. 2019) proposed an efficient method in this paper and that cell Ranger Version 3 and above has adapted in their pipeline. For the detail method here is the link to the paper (Lun et al. 2019). The function emptydrops is embedded in DropletUtils R package. Another useful resource for When should I correct for ambient RNA?

2nd Problem: Processed pseudogenes are mRNAs that have undergone reverse transcription within the cell and been reintegrated into the genome. They possess the entire exonic sequence but lack promoters, enhancers, or any regulatory elements, making them incapable of being expressed since transcription factors or RNA polymerase cannot bind to them.

Despite this, we still detect them due to mismatched reads. Ideally, all reads should align with the gene that is actively transcribed, but a small portion of reads actually map to the processed pseudogene. This mismapping accounts for about 4% of the reads. Considering a 1% sequencing error rate, a 100 base pair read would result in approximately 4% of the reads having three or more sequencing errors, causing mismapping. This analysis only includes uniquely mapping genes, excluding multimapping reads.

To address this mismapping issue, the default genome provided by cellRanger masks all pseudogenes and processed pseudogenes. This masking is done to prevent errors caused by mismatched reads.

Doublet Problem:

Doublets in single-cell RNA sequencing (scRNA-seq) arise when two cells are captured together in the same droplet or well during the encapsulation process as shown in Figure 7. Multiple cells with multiple beads but that is less common.
Doublets are divided further into two categories:
Heterotypic Doublets: When one bead capture two different cells where one could be an epithelial cell and one could be a T-cell.
Homotypic Doublets: When one bead capture two same cells, where both the cell are of the same type such as T-cells.
Sadly, at this stage we cannot do anything about homotypic doublet and this is an active area of research. Andrews Computation Biology Lab is working actively on this side of the single cell and ambient RNA as well. Anyone interested in this can go and check out their lab.
But their are a bunch of tools that have been developed which works efficiently on identifying these heterotypic doublets i.e Doublet finder, Scrublet.

Be ashamed to die until you have won some victory for humanity: Horace Mann

References

Aird, Daniel, Michael G Ross, Wei-Sheng Chen, Maxwell Danielsson, Timothy Fennell, Carsten Russ, David B Jaffe, Chad Nusbaum, and Andreas Gnirke. 2011. “Analyzing and Minimizing PCR Amplification Bias in Illumina Sequencing Libraries.” Genome Biology 12 (2): R18. https://doi.org/10.1186/gb-2011-12-2-r18.

Archer, Nathan, Mark D Walsh, Vahid Shahrezaei, and Daniel Hebenstreit. 2016. “Modeling Enzyme Processivity Reveals That RNA-Seq Libraries Are Biased in Characteristic and Correctable Ways.” Cell Systems 3 (5): 467–79.

He, Dongze, Mohsen Zakeri, Hirak Sarkar, Charlotte Soneson, Avi Srivastava, and Rob Patro. 2022. “Alevin-Fry Unlocks Rapid, Accurate and Memory-Frugal Quantification of Single-Cell RNA-Seq Data.” Nature Methods 19 (3): 316–22. https://doi.org/10.1038/s41592-022-01408-3.

Kaminow, Benjamin, Dinar Yunusov, and Alexander Dobin. 2021. “STARsolo: Accurate, Fast and Versatile Mapping/Quantification of Single-Cell and Single-Nucleus RNA-Seq Data.” http://dx.doi.org/10.1101/2021.05.05.442755.

Lun, Aaron T. L., Samantha Riesenfeld, Tallulah Andrews, The Phuong Dao, Tomas Gomes, and John C. Marioni. 2019. “EmptyDrops: Distinguishing Cells from Empty Droplets in Droplet-Based Single-Cell RNA Sequencing Data.” Genome Biology 20 (1). https://doi.org/10.1186/s13059-019-1662-y.

Melsted, Páll, A. Sina Booeshaghi, Lauren Liu, Fan Gao, Lambda Lu, Kyung Hoi Min, Eduardo da Veiga Beltrame, Kristján Eldjárn Hjörleifsson, Jase Gehring, and Lior Pachter. 2021. “Modular, Efficient and Constant-Memory Single-Cell RNA-Seq Preprocessing.” Nature Biotechnology 39 (7): 813–18. https://doi.org/10.1038/s41587-021-00870-2.

Niebler, Stefan, André Müller, Thomas Hankeln, and Bertil Schmidt. 2020. “RainDrop: Rapid Activation Matrix Computation for Droplet-Based Single-Cell RNA-Seq Reads.” BMC Bioinformatics 21 (1). https://doi.org/10.1186/s12859-020-03593-4.

Parekh, Swati, Christoph Ziegenhain, Beate Vieth, Wolfgang Enard, and Ines Hellmann. 2018. “zUMIs - A Fast and Flexible Pipeline to Process RNA Sequencing Data with UMIs.” GigaScience 7 (6). https://doi.org/10.1093/gigascience/giy059.

Soneson, Charlotte, Avi Srivastava, Rob Patro, and Michael B. Stadler. 2021. “Preprocessing Choices Affect RNA Velocity Results for Droplet scRNA-Seq Data.” Edited by Min Li. PLOS Computational Biology 17 (1): e1008585. https://doi.org/10.1371/journal.pcbi.1008585.

Srivastava, Avi, Hirak Sarkar, Nitish Gupta, and Rob Patro. 2016. “RapMap: A Rapid, Sensitive and Accurate Tool for Mapping RNA-Seq Reads to Transcriptomes.” Bioinformatics 32 (12): i192–200. https://doi.org/10.1093/bioinformatics/btw277.

Zheng, Grace X. Y., Jessica M. Terry, Phillip Belgrader, Paul Ryvkin, Zachary W. Bent, Ryan Wilson, Solongo B. Ziraldo, et al. 2017. “Massively Parallel Digital Transcriptional Profiling of Single Cells.” Nature Communications 8 (1). https://doi.org/10.1038/ncomms14049.