ChIP-Seq/ATAC-Seq/CUT&RUN Analysis Report

Introduction
Bioinformatics Procedure and Data Processing
Peak Calling
Clustering (DiffBind R package)
Peak Overlay
Differential Analysis
Functional Analysis
Combined Density Profile Analysis
Motif Analysis
Citation

1. Introduction

ChIP-Seq is a method to study epigenetics interaction between DNA and proteins by identifying the target binding sites. For ChIP-Seq, we employ an in-house bioinformatics approach to map reads, call peaks, perform differential analysis, and detect motifs using reputable computational resources. In addition, we also offer ATAC-Seq and CUT&RUN analysis. For ATAC-Seq (Yan et al., 2020) and CUT&RUN (Yu et al., 2021) specifically, ENCODE ATAC-seq pipeline and CUT&RUNTools 2.0 pipeline are proceeded to the peak detection step, respectively, and then followed by our downstream analysis.

2. Bioinformatics Procedure and Data Processing

For our in-house bioinformatics pipeline, FastQC was used to check the quality of raw and trimmed reads firstly. Trimmomatics was used to cut adapters and trim low-quality bases with a default setting. After mapping reads to the reference genome, mapped reads that have MAPQ score < 10 were removed (filters may be set conditionally). Duplicates were also removed. deepTools was used to normalized BAM and generate BW format for visualization. MACS2 was used to call peak. Annotation was performed using ChIPseeker (R package). If there was no replicate, MAnorm (R package) was used for sample comparison. On the other hand, if there were replicates, their called peaks were merged and DiffBind (R package) was then used for the comparison. Other downstream procedures included GO/KEGG (R package clusterProfiler), combined density profiles (deepTools), and motif detection (MEME).

File availability:
Raw / trimmed FastQ
Mapping statistics
FastQC reports / MultiQC reports
BAM
BigWig (BW)

3. Peak Calling

Peak Calling is a computational method used to identify areas in the genome that have been enriched with aligned reads. MACS algorithm can be used for identifying transcription factor binding sites (narrow peaks) and histone modification enriched regions (broad peaks). It outputs key files such as peak files (file which contains the peak locations along with peak summit, p-value, and q-value) and summit files (file which contains peak summit locations for motif analysis).

4. Clustering (DiffBind R package)

Sample Cluster Heatmap. Replicate samples. Clustering of samples helps to segregate data into similar groups.

Correlation heatmap for all peak-calling samples.

PCA plot. Replicate samples. PCA is a procedure which principle components are obtained by orthogonally transforming a set of possibly correlated variables (high-dimensional data) into a smaller number (few dimensions) of linearly uncorrelated variables.

Dimensional reduction of peak datasets using PCA.

5. Peak Overlay

Peak overlay plots show the number of peaks that are common and different between comparable datasets.

Venn diagram of binding site overlaps of replicate samples in the comparison groups.

6. Differential Analysis

Replicate samples. The differential analysis between replicate sample groups was performed using DiffBind (R package). This package includes functions that help to process peak sets, including merging overlapping peak sets in replicates, counting merged peak sets, and identifying statistically-significant differentially bound sites based on binding affinity between two groups. There are five outputs:

I. MA plot

MA plot for data normalization. Red dots represent sites that are significantly differentially bound. p < 0.05. Blue line: 0-fold change.

II.Volcano plot

Volcano plot that shows significantly differentially bound sites. If log2(Target)-log2(Control) < 0, there are decreased affinity toward binding sties; whereas if log2(Target)-log2(Control) > 0, there are increased affinity toward binding sites.

III. Box plot

Box plot of read distributions for differentially bound sites.

IV. Heatmap

Heatmap between replicate samples of comparison groups.

V. Binding affinity analysis report
(Table in CSV format)

Non-Replicate Samples. MAnorm method was used when there are non-replicate samples. ChIP-Seq data often have a lot of signal-to-background (S/N) ratios. This method focuses on the common sites and uses them as a reference for normalization.

MA plot before normalization using common peaks (left) and after normalization displaying all peaks (right). M: log2 fold changes. A: average expression signal. Green line: robust regression; red line: LOWESS regression; blue line: M = 0.

7. Functional Analysis

The function analysis helps to understand the functions and utilities of the biological systems. ChIPseeker (R package) was used for peak annotation. clusterProfiler (R package) was used for GO and KEGG-pathway enrichment analysis of individual sample peaks and comparison peaks.

Gene Ontology (GO) enrichment analysis. Dot plot of significant Biological Process GO terms.

KEGG pathway enrichment analysis. Bar graph of significant KEGG terms.

8. Combined Density Profile Analysis

Combined Density profile plots of peak regions were generated using deepTools.

Density plots and genomic heatmaps show respective peak intensities centered at the transcriptional start site (TSS) flanked by 3kb regions. x-axis represents the genomic location relative to the TSS. y-axis represents signal intensity.

9. Motif Analysis

MEME was used to analyze motif sequences around the peak regions. The MEME-chIP performs motif discovery, enrichment, search, comparison, and visualization.

Motif analysis. Top two motifs found within the peak regions (+100 nucleotides on each side) of a sample were shown. The first motif (E-value = 4.4e-182) was related to that of CTCF. The second motif (E-value = 1.9e-138) was related to that of SPIB or SPI1.

10. Citation

Appendix

A list of software, tools, and libraries used in the analysis pipeline:
Trimmomatic (v0.38)
FastQC (v0.11.8)
MultiQC (v1.11)
SAMtools (v0.1.19)
Picard (v2.20.4)
bedTools (v2.29.2)
deepTools (v3.4.1)
MACS (v2.2.6)
MEME (v5.1.1) (https://meme-suite.org/meme/)

R libraries:
ChIPseeker (v1.22.1)
clusterProfiler (v3.14.3)
DiffBind (v3.8.4)

For ChIP-Seq:
BWA (v0.7.10)

For ATAC-Seq:
ENCODE ATAC-seq pipeline (https://github.com/ENCODE-DCC/atac-seq-pipeline)
Bowtie2 (v2.3.4.1)

For Cut&Run:
CUT&RUNTools 2.0 (https://github.com/fl-yu/CUT-RUNTools-2.0)
Bowtie2 (v2.3.4.1)

Language:
Shell
R (v4.3.0)
Python (v3.7.13)
Perl (v5.26.2)

Abbreviations

ATAC-Seq    Assay for Transposase-Accessible Chromatin with Sequencing
ChIP-Seq    Chromatin Immunoprecipitation with Sequencing
CUT&RUN    Cleavage Under Targets and Release Using Nuclease
GO    Gene Ontology
KEGG    Kyoto Encyclopedia of Genes and Genomes
MACS    Model-based Analysis of ChIP-Seq
PCA    Principal Component Analysis

ChIP-Seq/ATAC-Seq/CUT&RUN Analysis Report

2023-06-22

Table of Contents

Introduction

Bioinformatics Procedure and Data Processing

Peak Calling

Clustering (DiffBind R package)

Peak Overlay

Differential Analysis

Functional Analysis

Combined Density Profile Analysis

Motif Analysis

Citation