Pooled CRISPR Screening Analysis Report
We have employed our pooled CRISPR screening single guide RNA (sgRNA)
sequencing analysis pipeline, assembled with the best components of
various computational tools, extended by our in-house computational
methods. Here is the workflow of our method:
Tips: We provide services for both CRISPR screening with single-sgRNA CRISPR interference(CRISPRi) libray and dual-sgRNA CRISPRi library.
Initially, the paired fastq files were merged, and adapter trimming was performed using Cutadapt. Quality assessment was subsequently carried out utilizing the FastQC tools.For a comprehensive understanding of the FastQC report, it is recommended to refer to http://www.bioinformatics.babraham.ac.uk/projects/fastqc/. The FastQC visualization and summary files are organized into two file types for each sample: ‘sample_ID.fastqc.zip’ for the visualization files, and ‘sample_ID.fastqc.html’ for the summary files. The QC reports can be found under directory “01.fastqc”.
The cleaned and processed reads were stroed in
“02.cleanReads” and mapped to a reference of gRNA
sequences using the Bowtie2 tool for the data with single-sgRNA CRISPRi
library. For data from experiment with dual-sgRNA CRISPRi library, we
used our in house python script to do the mapping. The alignment outputs
are accessible in the ‘sample_ID.bam’ format, located in the directory
“03.bam.” Mapping statistics is also in this folder.
The alignment bam files were processed by Samtools and our custom processing scripts to derive the count of reads that mapped to each sgRNA for single-sgRNA library data. For dual-sgRNA library data, in house python script was used to count the number of each sgRNA pair. The resulting count matrices are stored in files as sample_ID.gRNA.count.frac.txt under the “04.count” folder.
Specifically, the output files contain the count of perfect matched reads, reads with one or no mismatch and reads with two or fewer mismatch as well as the fraction of these read counts. An example of the sgRNA read count file is as follows:
Table 1.example of sgRNA count matrix.
The contents of each column is as follows.
Column | Content |
---|---|
sgRNAID | sgRNA ID |
sequence | sequence of sgRNA |
mm0 | count of perfect match reads |
mm1 | count of reads with one or no mismatch |
mm2 | count of reads with two or fewer mismatch |
mm0_frac | fraction of perfect match reads |
mm1_frac | fraction of reads with one or no mismatch |
mm2_frac | fraction of reads with two or fewer mismatch |
We ranked sgRNAs and their target genes according to their effect on the phenotype of interest, which are typically defined by the stimuli and conditions in the study design. Analysis of Genome-wide CRISPR-Cas9 Knockout (MAGeCK) was employed to do the ranking. MAGeCK uses a modified robust ranking aggregation (RRA) algorithm to identify positive/negative selected genes across two conditions. A key aspect of this method is the RRA score, where a higher score implies that a significant number of sgRNAs associated with a gene are ranked near the top of the sgRNA list, indicating its notable influence on the phenotype. Additionally, the significance of these RRA scores is quantified through corresponding p-values, offering a statistical measure of their relevance.
The sgRNA ranking result is stored as Treatment_vs_Control.sgRNA_summary.txt in “05.rankResult”, which contains the information of each sgRNA count in selected samples, log2 fold change, p value and so on. The gene ranking result is stored as Treatment_vs_Control.gene_summary.txt in “05.rankResult”. An example of the gene summary file is as follows:
Table 2. Example of gene summary file
The contents of each column is as follows.
Column | Content |
---|---|
sgRNAID | sgRNA ID |
id | Gene ID |
num | The number of targeting sgRNAs for each gene |
neg|score | The RRA lo value of this gene in negative selection |
neg|p-value | The raw p-value (using permutation) of this gene in negative selection |
neg|fdr | The false discovery rate of this gene in negative selection |
neg|rank | The ranking of this gene in negative selection |
neg|goodsgrna | The number of “good” sgRNAs, i.e., sgRNAs whose ranking is below the alpha cutoff (determined by the –gene-test-fdr-threshold option), in negative selection. |
neg|lfc | The log2 fold change of this gene in negative selection. The way to calculate gene lfc is controlled by the –gene-lfc-method option |
pos|score | The RRA lo value of this gene in positive selection |
pos|p-value | The raw p-value (using permutation) of this gene in positive selection |
pos|fdr | The false discovery rate of this gene in positive selection |
pos|rank | The ranking of this gene in positive selection |
pos|goodsgrna | The number of “good” sgRNAs, i.e., sgRNAs whose ranking is below the alpha cutoff (determined by the –gene-test-fdr-threshold option), in positive selection. |
pos|lfc | The log fold change of this gene in positive selection |
We provided visualization of comparison results and top-ranked genes in multiple ways. All the figures are stored in “05.rankResult”.
Volcano plots plot gRNA log fold changes against their associated P values, showing differences between two conditions. Top 10 ranked positively and negatively selected genes were labeled. An example figure is as follows:
Figure 1. Example of volcano plot
Distribution of RRA scores displays the distribution of RRA scores for all genes with top10 ranked selected genes labeled. RRA score for negative and positive selection were plotted separately. An example figure is as follows:
Figure 2. Example of distribution of RRA scores of all genes for negative selection
Individual sgRNA read counts for top-ranked genes were plotted, where each dot color corresponds to a sgRNA located in the genes. An example figure is as follows:
Figure 3. Example of sgRNA read counts (normalized) of top-ranked genes in selected samples
Rank View of sgRNA targeting top-ranked genes provides a clear and organized way to assess and compare the impact of various sgRNAs on key genes of interest. An example figure is as follows:
Figure 4. Example of sgRNA rank view in top-ranked genes
Once target genes have been ranked by their enrichment/depletion score, top hits were assessed in terms of plausibility and biological relevance by GSEA MsigDB signature enrichment analysis. The GSEA plot was also stored in “05.rankResult” folder. An example figure is as follows:
Figure 5. Example of enriched GSEA signature for top hits
Table 3. The List of software used in the analysis pipeline.
Software | Version |
---|---|
cutadapt | 3.4 |
Bowtie2 | 2.5.1 |
Samtools | 1.17 |
seqtk | 1.3-r106 |
mageck | - |
mageckFlute | - |
ClusterProfiler | 3.18 |
sgRNA sequence | Custom (Provided) |
Address: 126 Corporate Boulevard, South Plainfield, New Jersey 07080
Email: custom-services@admerahealth.com
Phone:
908-222-0533