Analysis Workflow

Here is the workflow of our method following the best practice of RNA-seq data analysis:


1. Trimming and Read alignment

Initially, Trimmomatic was used to trim adapter sequences, remove low-quality reads, and cut off poor-quality bases. The resulting high-quality reads were saved in the “00.TrimmedData” directory. Subsequently, FastQC was utilized to evaluate the quality of the raw sequencing reads as well as the clean reads. The quality control (QC) reports are located in the directory “01.FastqQualityCheck”. MutilQC was used to aggregate all the QC results into a single report named 01.FastqQualityCheck-multiqc_report.html. An example of the multiQC result can be checked Here.

After cleaning the reads, they were aligned to a reference transcriptome using STAR. The alignment files can be found in the “02.BamFiles” directory, named in the format ‘sample_ID.bam.’

2. Counting and normalization

Before counting and normalization, we employed picard to assess the quality of the data further by collecting metrics describing the distribution of the bases within the transcripts. The result of this step were stored in “03.rnaMetrics”. MultiQC was used to aggregate results from many samples into a single report named as 03.rnaMetrics-multiqc_report. An example of the multiQC report can be checked here.

The alignment BAM files were analyzed using StringTie to derive normalized FPKM/TPM counts. These resulting count matrices are stored as files in the “04.FPKM” folder. Concurrently, HTseq was used to generate raw read counts for genes. The raw count matrix were stored in folder “05.HTseq_count”. Counts from all samples within a project were aggregated to create count matrices respectively, which can be used for downstream analysis. An example of the aggregated TPM count file is as follows:

Table 1.example of a TPM matrix.


An example of the aggregated raw read count file is as follows:

Table 2.example of a raw count matrix.


3, Differential gene expression analysis

Differential gene expression analysis was conducted by DEseq2 with a default cutoff of log2 fold change 1 and p value 0.05. The results were stored as Treatment_vs_Control_DE folder under “06.DE-GO-KEGG”. An example of the significantly differential expressed gene summary file is as follows:

Table 3. Example of significantly differential expressed gene file

The contents of each column is as follows.

Column Content
EnsemblID Ensembl ID
gene Gene symbol
baseMean Mean of all samples
baseMean_Group1 Mean of Group1
baseMean_Group2 Mean of Group2
log2FoldChange log2 fold change between case and treatment
lfcSE Logfoldchange Standard Error
stat The value of the test statistic for the gene or transcript
pvalue P-value of the test for the gene or transcript
padj Adjusted P-value for multiple testing for the gene or transcript


4. Visualization of differential gene expression

We provided visualization of differential expression analysis in multiple ways. All the figures are stored in “06.DE-GO-KEGG”. The following are the main figures we provided, for additional figures you could check the folder for detail. We could also plot more customized figures if you need.

a. Principal component analysis(PCA) plot of the samples

PCA transforms the gene expression data into a set of linearly uncorrelated variables known as principal components (PCs). Each sample is represented as a point in a two-dimensional space defined by the first and second principal components, which capture the largest amounts of variance in the dataset. The positioning of the samples in this PCA plot reflects their similarities or differences in gene expression profiles. An example of PCA figure is as follows:

Figure 2. Example PCA plot


b. Sample-to-sample distance heatmap

A sample-to-sample heatmap illustrates the similarities and dissimilarities between samples based on their gene expression profiles. Both rows and columns represent the samples being analyzed. Each cell in the heatmap indicates the degree of similarity or difference in gene expression between a pair of samples. An example figure is as follows:

Figure 3. Example of Sample-to-sample distance heatmap


c. Volcano plot for all genes

Volcano plot is a type of scatter plot that summarizes both the fold-change in gene expression and the statistical significance derived from a gene-specific test. On this plot, the x-axis represents the log2-transformed fold changes in gene expression, while the y-axis shows the negative log10-transformed p-values. Data points with low p-values, indicating high statistical significance, and significant dysregulation in gene expression, tend to appear towards the upper left or upper right regions of the plot. An example figure is as follows:

Figure 4. Example of volcano plot


d. Gene level expression heatmap for all samples

A gene-level expression heatmap displays the expression levels of differential expressed genes across multiple samples. In this heatmap, rows represent differential expressed genes, while columns correspond to different samples.The color scale, ranging from blue (low expression) to red (high expression), provides a quick and intuitive way to identify genes that are differentially expressed across the samples. An example figure is as follows:

Figure 5. Example of gene level expression heatmap for all samples


We also provide gene-level expression heatmap for top100 differential expressed genes. An example figure is as follows:

Figure 5. Example of top100 gene level expression heatmap for all samples


e. MA-Plot

An MA-Plot is a scatter plot used for comparing two groups of samples in an experiment. It plots M-values (log-fold changes in gene expression levels between two samples) on the y-axis against A-values (average gene expression levels across both samples) on the x-axis. This plot is valuable for assessing the reproducibility of results between samples. It helps in determining whether normalization of data is necessary. An example figure is as follows:

Figure 1. Example of MA-Plot


5. Functional enrichment analysis

Once differential expressed genes have been identified, functional enrichment anlaysis were conducted by ClusterProfiler, which identify biological terms or pathways that are significantly overrepresented in the list of differentially expressed genes. Gene Ontology (GO) or KEGG pathways enrichment anlaysis were conducted. The resulting significantly enriched GO/KEGG pathways information are stored in “06.DE-GO-KEGG”. An example of enriched GO Biological process plot is as follows:

Figure 6. Example of enriched GO terms


An example of enriched KEGG pathway term plot is as follows:

Figure 7. Example of enriched KEGG pathway


6. Differential alternative splicing events analysis (Customized)

RNA splicing is the process of editing pre-mRNA by removing non-coding regions (introns) and connecting the coding sequences (exons), which is crucial for translating mRNA into proteins. For detecting alternative splicing events, we utilized rMATS (Multivariate Analysis of Transcript Splicing). This tool analyzes data to identify various types of alternative splicing events in the RNA. The output from rMATS includes several categories of Alternative Splicing events, identified from gene transfer format (GTF) files and RNA-seq data. The events include differential skipped exon(SE), alterinative 5’ splice site(A5SS), alterinative 3’ splice site(A3SS), mutually exclusive exons(MXE) and retained intron(RI). This analysis is customized with the result stored in 07.CustomizedAnalyses.

a. Example of the output that contains the list of events and read counts. Only splice junction reads are counted.


b. Example of the output that contains the list of events and read counts. Both splice junction reads and exon body reads are counted.


c. Example of the output of skipped exon (SE) events derived from GTF and RNA.


d. Example of the output of alternative 5’ splice sites (A5SS) events derived from GTF and RNA.


e. Example of the output of alternative 3’ splice sites (A3SS) events derived from GTF and RNA.


f. Example of the output of mutually exclusive exons (MXE) events derived from GTF and RNA.


g. Example of the output of retained intron (RI) events derived from GTF and RNA.

The content of shared columns is as follows.

Column Content
ID rMATS event id
GeneID Gene ID
geneSymbol Gene name
chr Chromosome
IJC_SAMPLE_1 Inclusion counts for sample 1. Replicates are comma separated
SJC_SAMPLE_1 Skipping counts for sample 1. Replicates are comma separated
IJC_SAMPLE_2 Inclusion counts for sample 2. Replicates are comma separated
SJC_SAMPLE_2 Skipping counts for sample 2. Replicates are comma separated
IncFormLen Length of inclusion form, used for normalization
SkipFormLen Length of skipping form, used for normalization
PValue Significance of splicing difference between the two sample groups. (Only available if the statistical model is on)
FDR False Discovery Rate calculated from p-value. (Only available if statistical model is on)
IncLevel1 Inclusion level for sample 1. Replicates are comma separated. Calculated from normalized counts
IncLevel2 Inclusion level for sample 2. Replicates are comma separated. Calculated from normalized counts
IncLevelDifference average(IncLevel1) - average(IncLevel2)


Event specific columns (event coordinates):

7. Appendix

Table 4. The List of software used in the analysis pipeline.

Software Version
FastQC 0.11.8
multiQC 1.18
Trimmomatic 0.38
STAR 2.7.1a
Picard 2.20.4
StringTie 2.0.4
HTSeq -
DEseq2 1.14.1
ClusterProfiler 3.18
rMATS 4.0.2


8. Citation

9, Contact us

Address: 126 Corporate Boulevard, South Plainfield, New Jersey 07080
Email:
Phone: 908-222-0533