RNA Analysis Report
Here is the workflow of our method following the best practice of
RNA-seq data analysis:
Initially, Trimmomatic was used to trim adapter sequences, remove low-quality reads, and cut off poor-quality bases. The resulting high-quality reads were saved in the “00.TrimmedData” directory. Subsequently, FastQC was utilized to evaluate the quality of the raw sequencing reads as well as the clean reads. The quality control (QC) reports are located in the directory “01.FastqQualityCheck”. MutilQC was used to aggregate all the QC results into a single report named 01.FastqQualityCheck-multiqc_report.html. An example of the multiQC result can be checked Here.
After cleaning the reads, they were aligned to a reference
transcriptome using STAR. The alignment files can be found in the
“02.BamFiles” directory, named in the format
‘sample_ID.bam.’
Before counting and normalization, we employed picard to assess the quality of the data further by collecting metrics describing the distribution of the bases within the transcripts. The result of this step were stored in “03.rnaMetrics”. MultiQC was used to aggregate results from many samples into a single report named as 03.rnaMetrics-multiqc_report. An example of the multiQC report can be checked here.
The alignment BAM files were analyzed using StringTie to derive normalized FPKM/TPM counts. These resulting count matrices are stored as files in the “04.FPKM” folder. Concurrently, HTseq was used to generate raw read counts for genes. The raw count matrix were stored in folder “05.HTseq_count”. Counts from all samples within a project were aggregated to create count matrices respectively, which can be used for downstream analysis. An example of the aggregated TPM count file is as follows:
Table 1.example of a TPM matrix.
An example of the aggregated raw read count file is as
follows:
Table 2.example of a raw count matrix.
Differential gene expression analysis was conducted by DEseq2 with a default cutoff of log2 fold change 1 and p value 0.05. The results were stored as Treatment_vs_Control_DE folder under “06.DE-GO-KEGG”. An example of the significantly differential expressed gene summary file is as follows:
Table 3. Example of significantly differential expressed gene file
The contents of each column is as follows.
Column | Content |
---|---|
EnsemblID | Ensembl ID |
gene | Gene symbol |
baseMean | Mean of all samples |
baseMean_Group1 | Mean of Group1 |
baseMean_Group2 | Mean of Group2 |
log2FoldChange | log2 fold change between case and treatment |
lfcSE | Logfoldchange Standard Error |
stat | The value of the test statistic for the gene or transcript |
pvalue | P-value of the test for the gene or transcript |
padj | Adjusted P-value for multiple testing for the gene or transcript |
We provided visualization of differential expression analysis in multiple ways. All the figures are stored in “06.DE-GO-KEGG”. The following are the main figures we provided, for additional figures you could check the folder for detail. We could also plot more customized figures if you need.
PCA transforms the gene expression data into a set of linearly uncorrelated variables known as principal components (PCs). Each sample is represented as a point in a two-dimensional space defined by the first and second principal components, which capture the largest amounts of variance in the dataset. The positioning of the samples in this PCA plot reflects their similarities or differences in gene expression profiles. An example of PCA figure is as follows:
Figure 2. Example PCA plot
A sample-to-sample heatmap illustrates the similarities and dissimilarities between samples based on their gene expression profiles. Both rows and columns represent the samples being analyzed. Each cell in the heatmap indicates the degree of similarity or difference in gene expression between a pair of samples. An example figure is as follows:
Figure 3. Example of Sample-to-sample distance heatmap
Volcano plot is a type of scatter plot that summarizes both the fold-change in gene expression and the statistical significance derived from a gene-specific test. On this plot, the x-axis represents the log2-transformed fold changes in gene expression, while the y-axis shows the negative log10-transformed p-values. Data points with low p-values, indicating high statistical significance, and significant dysregulation in gene expression, tend to appear towards the upper left or upper right regions of the plot. An example figure is as follows:
Figure 4. Example of volcano plot
A gene-level expression heatmap displays the expression levels of differential expressed genes across multiple samples. In this heatmap, rows represent differential expressed genes, while columns correspond to different samples.The color scale, ranging from blue (low expression) to red (high expression), provides a quick and intuitive way to identify genes that are differentially expressed across the samples. An example figure is as follows:
Figure 5. Example of gene level expression heatmap for all samples
Figure 5. Example of top100 gene level expression heatmap for all samples
An MA-Plot is a scatter plot used for comparing two groups of samples in an experiment. It plots M-values (log-fold changes in gene expression levels between two samples) on the y-axis against A-values (average gene expression levels across both samples) on the x-axis. This plot is valuable for assessing the reproducibility of results between samples. It helps in determining whether normalization of data is necessary. An example figure is as follows:
Figure 1. Example of MA-Plot
Once differential expressed genes have been identified, functional enrichment anlaysis were conducted by ClusterProfiler, which identify biological terms or pathways that are significantly overrepresented in the list of differentially expressed genes. Gene Ontology (GO) or KEGG pathways enrichment anlaysis were conducted. The resulting significantly enriched GO/KEGG pathways information are stored in “06.DE-GO-KEGG”. An example of enriched GO Biological process plot is as follows:
Figure 6. Example of enriched GO terms
An example of enriched KEGG pathway term plot is as follows:
Figure 7. Example of enriched KEGG pathway
RNA splicing is the process of editing pre-mRNA by removing
non-coding regions (introns) and connecting the coding sequences
(exons), which is crucial for translating mRNA into proteins. For
detecting alternative splicing events, we utilized rMATS (Multivariate
Analysis of Transcript Splicing). This tool analyzes data to identify
various types of alternative splicing events in the RNA. The output from
rMATS includes several categories of Alternative Splicing events,
identified from gene transfer format (GTF) files and RNA-seq data. The
events include differential skipped exon(SE), alterinative 5’ splice
site(A5SS), alterinative 3’ splice site(A3SS), mutually exclusive
exons(MXE) and retained intron(RI). This analysis is customized with the
result stored in 07.CustomizedAnalyses.
a.
Example of the output that contains the list of events and read counts.
Only splice junction reads are counted.
b. Example of the output that contains the list of events and
read counts. Both splice junction reads and exon body reads are
counted.
c. Example of the output of skipped exon (SE) events derived
from GTF and RNA.
d. Example of the output of alternative 5’ splice sites (A5SS)
events derived from GTF and RNA.
e. Example of the output of alternative 3’ splice sites (A3SS)
events derived from GTF and RNA.
f. Example of the output of mutually exclusive exons (MXE)
events derived from GTF and RNA.
g. Example of the output of retained intron (RI) events derived
from GTF and RNA.
The content of shared columns is as follows.
Column | Content |
---|---|
ID | rMATS event id |
GeneID | Gene ID |
geneSymbol | Gene name |
chr | Chromosome |
IJC_SAMPLE_1 | Inclusion counts for sample 1. Replicates are comma separated |
SJC_SAMPLE_1 | Skipping counts for sample 1. Replicates are comma separated |
IJC_SAMPLE_2 | Inclusion counts for sample 2. Replicates are comma separated |
SJC_SAMPLE_2 | Skipping counts for sample 2. Replicates are comma separated |
IncFormLen | Length of inclusion form, used for normalization |
SkipFormLen | Length of skipping form, used for normalization |
PValue | Significance of splicing difference between the two sample groups. (Only available if the statistical model is on) |
FDR | False Discovery Rate calculated from p-value. (Only available if statistical model is on) |
IncLevel1 | Inclusion level for sample 1. Replicates are comma separated. Calculated from normalized counts |
IncLevel2 | Inclusion level for sample 2. Replicates are comma separated. Calculated from normalized counts |
IncLevelDifference | average(IncLevel1) - average(IncLevel2) |
Event specific columns (event coordinates):
Table 4. The List of software used in the analysis pipeline.
Software | Version |
---|---|
FastQC | 0.11.8 |
multiQC | 1.18 |
Trimmomatic | 0.38 |
STAR | 2.7.1a |
Picard | 2.20.4 |
StringTie | 2.0.4 |
HTSeq | - |
DEseq2 | 1.14.1 |
ClusterProfiler | 3.18 |
rMATS | 4.0.2 |
Address: 126 Corporate Boulevard, South Plainfield, New Jersey 07080
Email: custom-services@admerahealth.com
Phone:
908-222-0533