RNAseq_SampleReport.knit

1. Trimming and Read alignment

3. Differential gene expression analysis

4. Visualization of differential gene expression

5. Functional enrichment analysis

6. Differential alternative splicing events analysis (Customized)

Analysis Workflow

Here is the workflow of our method following the best practice of RNA-seq data analysis:

1. Trimming and Read alignment

Initially, Trimmomatic was used to trim adapter sequences, remove low-quality reads, and cut off poor-quality bases. The resulting high-quality reads were saved in the “00.TrimmedData” directory. Subsequently, FastQC was utilized to evaluate the quality of the raw sequencing reads as well as the clean reads. The quality control (QC) reports are located in the directory “01.FastqQualityCheck”. MutilQC was used to aggregate all the QC results into a single report named 01.FastqQualityCheck-multiqc_report.html. An example of the multiQC result can be checked Here.

After cleaning the reads, they were aligned to a reference transcriptome using STAR. The alignment files can be found in the “02.BamFiles” directory, named in the format ‘sample_ID.bam.’

2. Counting and normalization

Before counting and normalization, we employed picard to assess the quality of the data further by collecting metrics describing the distribution of the bases within the transcripts. The result of this step were stored in “03.rnaMetrics”. MultiQC was used to aggregate results from many samples into a single report named as 03.rnaMetrics-multiqc_report. An example of the multiQC report can be checked here.

The alignment BAM files were analyzed using StringTie to derive normalized FPKM/TPM counts. These resulting count matrices are stored as files in the “04.FPKM” folder. Concurrently, HTseq was used to generate raw read counts for genes. The raw count matrix were stored in folder “05.HTseq_count”. Counts from all samples within a project were aggregated to create count matrices respectively, which can be used for downstream analysis. An example of the aggregated TPM count file is as follows:

Table 1.example of a TPM matrix.

An example of the aggregated raw read count file is as follows:

Table 2.example of a raw count matrix.

3, Differential gene expression analysis

Differential gene expression analysis was conducted by DEseq2 with a default cutoff of log2 fold change 1 and p value 0.05. The results were stored as Treatment_vs_Control_DE folder under “06.DE-GO-KEGG”. An example of the significantly differential expressed gene summary file is as follows:

Table 3. Example of significantly differential expressed gene file

The contents of each column is as follows.

Column	Content
EnsemblID	Ensembl ID
gene	Gene symbol
baseMean	Mean of all samples
baseMean_Group1	Mean of Group1
baseMean_Group2	Mean of Group2
log2FoldChange	log2 fold change between case and treatment
lfcSE	Logfoldchange Standard Error
stat	The value of the test statistic for the gene or transcript
pvalue	P-value of the test for the gene or transcript
padj	Adjusted P-value for multiple testing for the gene or transcript

4. Visualization of differential gene expression

We provided visualization of differential expression analysis in multiple ways. All the figures are stored in “06.DE-GO-KEGG”. The following are the main figures we provided, for additional figures you could check the folder for detail. We could also plot more customized figures if you need.

a. Principal component analysis(PCA) plot of the samples

PCA transforms the gene expression data into a set of linearly uncorrelated variables known as principal components (PCs). Each sample is represented as a point in a two-dimensional space defined by the first and second principal components, which capture the largest amounts of variance in the dataset. The positioning of the samples in this PCA plot reflects their similarities or differences in gene expression profiles. An example of PCA figure is as follows:

Figure 2. Example PCA plot

b. Sample-to-sample distance heatmap

A sample-to-sample heatmap illustrates the similarities and dissimilarities between samples based on their gene expression profiles. Both rows and columns represent the samples being analyzed. Each cell in the heatmap indicates the degree of similarity or difference in gene expression between a pair of samples. An example figure is as follows:

Figure 3. Example of Sample-to-sample distance heatmap

c. Volcano plot for all genes

Volcano plot is a type of scatter plot that summarizes both the fold-change in gene expression and the statistical significance derived from a gene-specific test. On this plot, the x-axis represents the log2-transformed fold changes in gene expression, while the y-axis shows the negative log10-transformed p-values. Data points with low p-values, indicating high statistical significance, and significant dysregulation in gene expression, tend to appear towards the upper left or upper right regions of the plot. An example figure is as follows:

Figure 4. Example of volcano plot

d. Gene level expression heatmap for all samples

A gene-level expression heatmap displays the expression levels of differential expressed genes across multiple samples. In this heatmap, rows represent differential expressed genes, while columns correspond to different samples.The color scale, ranging from blue (low expression) to red (high expression), provides a quick and intuitive way to identify genes that are differentially expressed across the samples. An example figure is as follows:

Figure 5. Example of gene level expression heatmap for all samples

We also provide gene-level expression heatmap for top100 differential expressed genes. An example figure is as follows:

Figure 5. Example of top100 gene level expression heatmap for all samples

e. MA-Plot

An MA-Plot is a scatter plot used for comparing two groups of samples in an experiment. It plots M-values (log-fold changes in gene expression levels between two samples) on the y-axis against A-values (average gene expression levels across both samples) on the x-axis. This plot is valuable for assessing the reproducibility of results between samples. It helps in determining whether normalization of data is necessary. An example figure is as follows:

Figure 1. Example of MA-Plot

5. Functional enrichment analysis

Once differential expressed genes have been identified, functional enrichment anlaysis were conducted by ClusterProfiler, which identify biological terms or pathways that are significantly overrepresented in the list of differentially expressed genes. Gene Ontology (GO) or KEGG pathways enrichment anlaysis were conducted. The resulting significantly enriched GO/KEGG pathways information are stored in “06.DE-GO-KEGG”. An example of enriched GO Biological process plot is as follows:

Figure 6. Example of enriched GO terms

An example of enriched KEGG pathway term plot is as follows:

Figure 7. Example of enriched KEGG pathway

6. Differential alternative splicing events analysis (Customized)

RNA splicing is the process of editing pre-mRNA by removing non-coding regions (introns) and connecting the coding sequences (exons), which is crucial for translating mRNA into proteins. For detecting alternative splicing events, we utilized rMATS (Multivariate Analysis of Transcript Splicing). This tool analyzes data to identify various types of alternative splicing events in the RNA. The output from rMATS includes several categories of Alternative Splicing events, identified from gene transfer format (GTF) files and RNA-seq data. The events include differential skipped exon(SE), alterinative 5’ splice site(A5SS), alterinative 3’ splice site(A3SS), mutually exclusive exons(MXE) and retained intron(RI). This analysis is customized with the result stored in 07.CustomizedAnalyses.

a. Example of the output that contains the list of events and read counts. Only splice junction reads are counted.

b. Example of the output that contains the list of events and read counts. Both splice junction reads and exon body reads are counted.

c. Example of the output of skipped exon (SE) events derived from GTF and RNA.

d. Example of the output of alternative 5’ splice sites (A5SS) events derived from GTF and RNA.

e. Example of the output of alternative 3’ splice sites (A3SS) events derived from GTF and RNA.

f. Example of the output of mutually exclusive exons (MXE) events derived from GTF and RNA.

g. Example of the output of retained intron (RI) events derived from GTF and RNA.

The content of shared columns is as follows.

Column	Content
ID	rMATS event id
GeneID	Gene ID
geneSymbol	Gene name
chr	Chromosome
IJC_SAMPLE_1	Inclusion counts for sample 1. Replicates are comma separated
SJC_SAMPLE_1	Skipping counts for sample 1. Replicates are comma separated
IJC_SAMPLE_2	Inclusion counts for sample 2. Replicates are comma separated
SJC_SAMPLE_2	Skipping counts for sample 2. Replicates are comma separated
IncFormLen	Length of inclusion form, used for normalization
SkipFormLen	Length of skipping form, used for normalization
PValue	Significance of splicing difference between the two sample groups. (Only available if the statistical model is on)
FDR	False Discovery Rate calculated from p-value. (Only available if statistical model is on)
IncLevel1	Inclusion level for sample 1. Replicates are comma separated. Calculated from normalized counts
IncLevel2	Inclusion level for sample 2. Replicates are comma separated. Calculated from normalized counts
IncLevelDifference	average(IncLevel1) - average(IncLevel2)

Event specific columns (event coordinates):

SE: exonStart_0base exonEnd upstreamES upstreamEE downstreamES downstreamEE
- The inclusion form includes the target exon (exonStart_0base, exonEnd)
MXE: 1stExonStart_0base 1stExonEnd 2ndExonStart_0base 2ndExonEnd upstreamES upstreamEE downstreamES downstreamEE
- If the strand is +, then the inclusion form includes the 1st exon (1stExonStart_0base, 1stExonEnd) and skips the 2nd exon If the strand is -, then the inclusion form includes the 2nd exon (2ndExonStart_0base, 2ndExonEnd) and skips the 1st exon
A3SS, A5SS: longExonStart_0base longExonEnd shortES shortEE flankingES flankingEE
- The inclusion form includes the long exon (longExonStart_0base, longExonEnd) instead of the short exon (shortES shortEE)
RI: riExonStart_0base riExonEnd upstreamES upstreamEE downstreamES downstreamEE
- The inclusion form includes (retains) the intron (upstreamEE, downstreamES)

7. Appendix

Table 4. The List of software used in the analysis pipeline.

Software	Version
FastQC	0.11.8
multiQC	1.18
Trimmomatic	0.38
STAR	2.7.1a
Picard	2.20.4
StringTie	2.0.4
HTSeq	-
DEseq2	1.14.1
ClusterProfiler	3.18
rMATS	4.0.2

8. Citation

9, Contact us

Address: 126 Corporate Boulevard, South Plainfield, New Jersey 07080
Email: custom-services@admerahealth.com
Phone: 908-222-0533

RNA Analysis Report

November 22, 2023

Content

Analysis Workflow

1. Trimming and Read alignment

2. Counting and normalization

3. Differential gene expression analysis

4. Visualization of differential gene expression

5. Functional enrichment analysis

6. Differential alternative splicing events analysis (Customized)

7. Appendix

8. Contact US

Analysis Workflow

1. Trimming and Read alignment

2. Counting and normalization

3, Differential gene expression analysis

4. Visualization of differential gene expression

a. Principal component analysis(PCA) plot of the samples

b. Sample-to-sample distance heatmap

c. Volcano plot for all genes

d. Gene level expression heatmap for all samples

e. MA-Plot

5. Functional enrichment analysis

6. Differential alternative splicing events analysis (Customized)

7. Appendix

8. Citation

9, Contact us