Analysis Workflow

We have employed our pooled CRISPR screening single guide RNA (sgRNA) sequencing analysis pipeline, assembled with the best components of various computational tools, extended by our in-house computational methods. Here is the workflow of our method:


Tips: We provide services for both CRISPR screening with single-sgRNA CRISPR interference(CRISPRi) libray and dual-sgRNA CRISPRi library.

1. Quality Assessment and Sequence Alignment

Initially, the paired fastq files were merged, and adapter trimming was performed using Cutadapt. Quality assessment was subsequently carried out utilizing the FastQC tools.For a comprehensive understanding of the FastQC report, it is recommended to refer to http://www.bioinformatics.babraham.ac.uk/projects/fastqc/. The FastQC visualization and summary files are organized into two file types for each sample: ‘sample_ID.fastqc.zip’ for the visualization files, and ‘sample_ID.fastqc.html’ for the summary files. The QC reports can be found under directory “01.fastqc”.

The cleaned and processed reads were stroed in “02.cleanReads” and mapped to a reference of gRNA sequences using the Bowtie2 tool for the data with single-sgRNA CRISPRi library. For data from experiment with dual-sgRNA CRISPRi library, we used our in house python script to do the mapping. The alignment outputs are accessible in the ‘sample_ID.bam’ format, located in the directory “03.bam.” Mapping statistics is also in this folder.

2. Derivation of sgRNA count matrices

The alignment bam files were processed by Samtools and our custom processing scripts to derive the count of reads that mapped to each sgRNA for single-sgRNA library data. For dual-sgRNA library data, in house python script was used to count the number of each sgRNA pair. The resulting count matrices are stored in files as sample_ID.gRNA.count.frac.txt under the “04.count” folder.

Specifically, the output files contain the count of perfect matched reads, reads with one or no mismatch and reads with two or fewer mismatch as well as the fraction of these read counts. An example of the sgRNA read count file is as follows:

Table 1.example of sgRNA count matrix.

The contents of each column is as follows.

Column Content
sgRNAID sgRNA ID
sequence sequence of sgRNA
mm0 count of perfect match reads
mm1 count of reads with one or no mismatch
mm2 count of reads with two or fewer mismatch
mm0_frac fraction of perfect match reads
mm1_frac fraction of reads with one or no mismatch
mm2_frac fraction of reads with two or fewer mismatch


3. Gene ranking by comparison analysis

We ranked sgRNAs and their target genes according to their effect on the phenotype of interest, which are typically defined by the stimuli and conditions in the study design. Analysis of Genome-wide CRISPR-Cas9 Knockout (MAGeCK) was employed to do the ranking. MAGeCK uses a modified robust ranking aggregation (RRA) algorithm to identify positive/negative selected genes across two conditions. A key aspect of this method is the RRA score, where a higher score implies that a significant number of sgRNAs associated with a gene are ranked near the top of the sgRNA list, indicating its notable influence on the phenotype. Additionally, the significance of these RRA scores is quantified through corresponding p-values, offering a statistical measure of their relevance.

The sgRNA ranking result is stored as Treatment_vs_Control.sgRNA_summary.txt in “05.rankResult”, which contains the information of each sgRNA count in selected samples, log2 fold change, p value and so on. The gene ranking result is stored as Treatment_vs_Control.gene_summary.txt in “05.rankResult”. An example of the gene summary file is as follows:

Table 2. Example of gene summary file

The contents of each column is as follows.

Column Content
sgRNAID sgRNA ID
id Gene ID
num The number of targeting sgRNAs for each gene
neg|score The RRA lo value of this gene in negative selection
neg|p-value The raw p-value (using permutation) of this gene in negative selection
neg|fdr The false discovery rate of this gene in negative selection
neg|rank The ranking of this gene in negative selection
neg|goodsgrna The number of “good” sgRNAs, i.e., sgRNAs whose ranking is below the alpha cutoff (determined by the –gene-test-fdr-threshold option), in negative selection.
neg|lfc The log2 fold change of this gene in negative selection. The way to calculate gene lfc is controlled by the –gene-lfc-method option
pos|score The RRA lo value of this gene in positive selection
pos|p-value The raw p-value (using permutation) of this gene in positive selection
pos|fdr The false discovery rate of this gene in positive selection
pos|rank The ranking of this gene in positive selection
pos|goodsgrna The number of “good” sgRNAs, i.e., sgRNAs whose ranking is below the alpha cutoff (determined by the –gene-test-fdr-threshold option), in positive selection.
pos|lfc The log fold change of this gene in positive selection


4. Visualization of comparison result and top-ranked genes

We provided visualization of comparison results and top-ranked genes in multiple ways. All the figures are stored in “05.rankResult”.

a.Volcano plot for all genes

Volcano plots plot gRNA log fold changes against their associated P values, showing differences between two conditions. Top 10 ranked positively and negatively selected genes were labeled. An example figure is as follows:

Figure 1. Example of volcano plot


b. Distribution of RRA scores and p values of all genes

Distribution of RRA scores displays the distribution of RRA scores for all genes with top10 ranked selected genes labeled. RRA score for negative and positive selection were plotted separately. An example figure is as follows:

Figure 2. Example of distribution of RRA scores of all genes for negative selection


c. Individual sgRNA read counts of top-ranked genes

Individual sgRNA read counts for top-ranked genes were plotted, where each dot color corresponds to a sgRNA located in the genes. An example figure is as follows:

Figure 3. Example of sgRNA read counts (normalized) of top-ranked genes in selected samples


d. Rank View of sgRNA targeting top-ranked genes

Rank View of sgRNA targeting top-ranked genes provides a clear and organized way to assess and compare the impact of various sgRNAs on key genes of interest. An example figure is as follows:

Figure 4. Example of sgRNA rank view in top-ranked genes


5. hit analysis with GSEA

Once target genes have been ranked by their enrichment/depletion score, top hits were assessed in terms of plausibility and biological relevance by GSEA MsigDB signature enrichment analysis. The GSEA plot was also stored in “05.rankResult” folder. An example figure is as follows:

Figure 5. Example of enriched GSEA signature for top hits


6, Appendix

Table 3. The List of software used in the analysis pipeline.

Software Version
cutadapt 3.4
Bowtie2 2.5.1
Samtools 1.17
seqtk 1.3-r106
mageck -
mageckFlute -
ClusterProfiler 3.18
sgRNA sequence Custom (Provided)


7. Citation

8, Contact us

Address: 126 Corporate Boulevard, South Plainfield, New Jersey 07080
Email:
Phone: 908-222-0533