1. Cell Ranger and Sample Aggregation

We use the Cell Ranger tool to analyze the transcriptome of individuals. Cell Ranger carries out several key tasks in processing the paired-end reads. Firstly, it performs alignment using the reference genome obtained from 10X Genomics, filtering, barcode counting and unique molecular identifier (UMI) counting. This ensures accurate mapping of reads, quality control, and quantification of gene expression. Next, Cell Ranger utilizes the cellular barcodes obtained through the Chromium platform to generate feature-barcode matrices. It determines clusters based on these matrices and conducts comprehensive gene expression analysis. This enables the identification of distinct cell populations and the exploration of gene expression patterns. We aggregate multiple samples using Cell Ranger’s aggr pipeline. This pipeline combines data from multiple samples, creating an experiment-wide feature-barcode matrix. This integrated matrix allows for a unified analysis across all samples, facilitating comparative analysis and providing a comprehensive view of the cellular landscape. All of the subsequent analyses were performed on the aggregated data.

2. Raw Data QC and Preprocessing

2.1. Sequencing Stats and Barcode Rank Plot

To assess the quality of single-cell RNA sequencing (scRNA-seq) data, we first examine the sequencing statistics presented in Table 1. These statistics include important metrics such as UMI counts, Q30 rates (indicating the percentage of high-quality base calls), and mapping rates to various genomic regions. These measures provide insights into the quality and reliability of the sequencing data.

Additionally, we review the barcode rank plots for individual samples (not shown here), which is a useful tool for evaluating the characteristics of scRNA-seq data based on UMI distribution. By examining the UMI counts in the plot, we can gain valuable information about the gene expression patterns within the dataset. A steep initial slope suggests the presence of a small number of highly expressed genes, while a more gradual slope indicates a diverse range of gene expressions. The knee point on the plot represents a significant change in slope and signifies the transition from high-abundance to low-abundance barcodes. This point is crucial in determining the appropriate threshold for filtering out low-quality or background barcodes from the dataset. In an ideal sample, a steep cliff followed by a plateaued knee is expected, indicating that the cell calling algorithm effectively distinguished intact cells from background barcodes.

By considering these sequencing statistics and analyzing the barcode rank plot, we can thoroughly evaluate the quality and characteristics of the scRNA-seq data, which is essential for subsequent analyses and interpretations.

Table 1 Sequencing and Mapping Stats

Estimated Number of Cells 14,995

Aggregation

Pre-Normalization Total Number of Reads 401,690,202
Post-Normalization Total Number of Reads 389,670,221
Pre-Normalization Mean Reads per Cell 26,788
Post-Normalization Mean Reads per Cell 25,987
Fraction of Reads Kept (Sample-01) 100%
Fraction of Reads Kept (Sample-02) 94.4%
Pre-Normalization Total Reads per Cell (Sample-01) 25,790
Pre-Normalization Total Reads per Cell (Sample-02) 27,709
Pre-Normalization Confidently Mapped Barcoded Reads per Cell (Sampe-01) 14,874
Pre-Normalization Confidently Mapped Barcoded Reads per Cell (Sample-02) 15,750

Cells

Estimated Number of Cells 14,995
Fraction Reads in Cells 92.4%
Median Genes per Cell 1,587
Median UMI Counts per Cell 4,292

Chemistry Batch Correction

Batch Effect Score Before Correction 1.35863
Batch Effect Score After Correction 1.653769


2.2 The Percentage of Mitochondrial Genes

Next, we calculated the percentage of mitochondrial genes (%mito) using the Seurat tool. Mitochondrial genes code for proteins involved in cellular respiration, and an elevated percentage of mitochondrial gene expression can indicate various issues, including cell stress, cell damage, or technical artifacts during library preparation. Therefore, a higher %mito may suggest lower data quality or the presence of compromised cells. Conversely, a lower %mito indicates better data quality, as it suggests a higher proportion of transcripts originating from nuclear genes involved in cellular functions other than energy production.

Figure 1 Violin plots with dots representing the number of genes identified per cell, the number of UMIs, and the percent of reads that are indicated to be mitochondrial in origin.

3. Dimensionality Reduction and Cell-type Assignment

After conducting thorough quality checks on the data, we perform two common types of dimensionality reduction using Seurat R-package, namely, t-distributed stochastic neighbor embedding (t-SNE) and Uniform Manifold Approximation and Projection (UMAP). As shown in Figure 3, both t-SNE and UMAP generate plots where each data point represents an individual cell, and cells with similar gene expression profiles located closer to each other.
Afterward, we employ our algorithm developed with the assistance of the CellMarker2.0 database to identify gene markers associated with your gene of interest. Using the Seurat tool, we identify the corresponding clusters. This approach enables the discovery of genes that are specifically expressed within cell types or subpopulations, providing valuable insights into the biological characteristics and functions of these cells.

Figure 2 Dimensionality Reduction using UMPA (top) and tSNE (bottom)

4. Differential Expression Analysis

Post cell-type assignment, Differential Expression (DE) analysis was performed using Seurat for each cell type. To ensure the reliability of the results, only genes meeting specific criteria were included for further analysis. Specifically, genes were retained if they exhibited an absolute log fold change of 0.2 or greater and were detected in a minimum of 10% of cells in either population. By applying these rigorous criteria, a subset of genes was selected for downstream analysis. In Figure 4 below, it shows Down and Up regulated genes for each cell type.

Figure 3 Up-Regulated (Left) and Down-Regulated (Right), Differentially Expressed Genes

5. Heatmap and Enrichment Analysis

After performing a differential expression analysis, we identified a set of genes that displayed significant differential expression for different cell types. From this set, we selected the top N genes for further analysis. In Figure 5, we depict the expression values of these top N genes in each cell type. Subsequently, we utilized the same gene set for enrichment analysis. The enrichment analysis was carried out using the R-package clusterProfiler, with a specific focus on KEGG pathways. Figure 6 presents the outcomes of the enrichment analysis, showcasing the genes that were found to be significantly enriched in the selected cell types. Additionally, their corresponding -log10 P-values are displayed, allowing for an assessment of statistical significance. This visualization provides valuable insights into the functional enrichment of genes within specific cell types, shedding light on the pathways that exhibit significant overrepresentation.

Figure 4 Expression value for top N genes in each cell-type


Figure 5 Enrichment Analysis using KEGG Pathways

6. Pseudotime and Trajectory Analysis

Single-cell Pseudotime and trajectory analysis are computational methods used to infer the developmental progression and lineage relationships of individual cells within a population. These approaches leverage single-cell transcriptomic data to unravel the temporal ordering and spatial relationships of cells, enabling the reconstruction of developmental trajectories and the identification of key regulatory events during cellular development. Pseudotime analysis orders cells along a hypothetical trajectory, representing their progression from an undifferentiated state to a mature cell type. The left plot in Figure 6 below displays the Pseudotime analysis performed using R-package Slingshot. Trajectory analysis maps the branching patterns and interconnections between different cell lineages, providing insights into cellular differentiation and dynamic processes such as cell fate decisions and cell state transitions. We performed the Trajectory analysis using Monocle2 packages which is shown in right plot in Figure 6 below.


Figure 6 Pseudotime (Left) and Trajectory (Right) Analysis

7. Conclusions

Overall, the analysis was successfully completed. All supporting documents including the raw data have been transferred to you, which we presume will assist greatly with any further validations and pursuit of key research answers.

8. Citation

Appendix

Table 2: The List of software used in the analysis pipeline.

Software Version
Cellranger 7.1.10
R-Seurat 3.2.2
R- clusterProfiler 3.18.1
R-slingshot 1.8.0
Reference Genome and annotation Mm10
R 4.2.0

Table 3: Lists of important files from the data analysis

Path Description
01.CellRange Cellranger output files
02.QC-Analysis QC plots and tables
03.HeatMap-Analysis Heatmap analysis of top10 genes
04.Clustering-Analysis Plots and Table for Clustering Analysis performed using tSNE and Umap
05.DE-Analysis Plots and table related to Differential expression analysis including KEGG
05.DE-Analysis /*-DE-genes.csv Differentially expressed genes per cell type

9. Contact Us

Address: 126 Corporate Boulevard, South Plainfield, New Jersey 07080
Email:
Phone: 908-222-0533