# 3.ChIP-seq

* 染色质免疫共沉淀测序（Chromatin immunoprecipitation sequencing，简称ChIP-seq）是一种用于分析蛋白质与DNA相互作用的高通量实验技术。常见的ChIP-seq实验主要用来确定特定一种转录因子的结合位点，以及特定组蛋白修饰在基因组上的的分布。ChIP-seq是现代生物学研究基因调控的一个重要的手段。本节中我们对ChIP-seq数据的基本分析方法进行介绍。
* ChIP-seq实验通常会有一组针对感兴趣的蛋白或表观修饰利用特定的抗体进行富集的样本，还有一组不做富集的control样本(一般被称为"input")。ChIP-seq数据分析最主要的目的就是通过比较这两组数据，推断出基因组的哪些位置是我们感兴趣的位点。
* 我们对于ChIP-seq数据的处理也是从mapping开始。chip-seq中基因组上被富集到的区域，在reads coverage上会体现为一个尖峰(peak)。由于高通量测序数据往往有比较大的噪声，我们需要利用一些算法把peak和背景区分开来，这个过程一般被称为"peak calling"。peak calling的结果就是一组genomic intervals。我们认为这样一些genomic intervals就对应着转录因子的结合位点(或者存在特定表观修饰的区域)。
* 转录因子的结合区域一般比较短，对应的peaks比较狭窄，被称为"narrow peaks"。存在组蛋白修饰的区域通常更长一些，对应的peaks也比较宽，被称为"broad peaks"。
* 转录因子的结合通常有比较强的序列偏好性，这种序列偏好性可以用"motif"来描述。在后续章节中，我们将进行更细致的介绍。
* 我们本节主要介绍的[homer](http://homer.ucsd.edu/homer/)这个工具，也会对[MACS](https://github.com/macs3-project/MACS)和R包[ChIPseeker](https://bioconductor.org/packages/release/bioc/html/ChIPseeker.html)进行简要介绍。

## 1) Pipeline

在[mapping](https://book.ncrnalab.org/teaching/part-iii.-ngs-data-analyses/1.mapping)一章我们已经对DNA测序的reads向基因组的比对，这对于chip-seq同样是适用的。所以我们跳过mapping一步，从实验组和input对应的两个bam文件出发开始分析。这两个bam文件的产生过程请参考[这里](#4a-preparation-of-bam-file)。本节主要参考[homer](http://homer.ucsd.edu/homer/)的文档，感兴趣的同学可以直接阅读他们提供的文档。

![](https://4115668567-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-LPVsf5VZbQ7h14X29qW%2F-LP_9Iz3MDBXzTZokXaU%2F-LP_9OTudzcwhyVh5agj%2Fchip-seq-pipeline2.png?alt=media\&token=f99d0f1e-cfdb-412c-b76e-4f2d8d6213e5)

## 2) Data Structure

### 2a) getting software & data

* 本流程所有步骤都是用homer提供的工具完成。
* [GSE61210](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE61210)利用chip-seq在酵母中对转录调控因子SNF2的结合位点进行了研究。我们这里用到了其中的[GSM1499619](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=gsm1499619)(`input.bam`)和[GSM1499607](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=gsm1499607)(`ip.bam`)两个样本。

#### 方法1: 使用docker

docker images的下载链接如[附表](https://book.ncrnalab.org/teaching/appendix/appendix-iv.-teaching#teaching-docker)所示，docker中我们已经准备好所需文件，例如 `.bam` 文件位于 Docker 中的 `/home/test/chip-seq/input`。

#### 方法2: 自行下载和安装

1. Install software：[HOMER](http://homer.ucsd.edu/homer/)
2. Download data: 从 \*\*\*\* [**Files needed** ](https://courses.ncrnalab.org/files)中的**Files/** 路径下的相应文件夹中下载

### 2b) input

| Format | Description               | Notes |
| ------ | ------------------------- | ----- |
| `.bam` | 将CHIP-seq的 Reads 比对到参考基因组 | -     |

### 2c) output

#### 2c.1) 主要输出文件如下所示：

|                | File format         | File description                         |
| -------------- | ------------------- | ---------------------------------------- |
| Peak calling   | peak file           | each row contatins information of a peak |
| Motif analysis | `homerResults.html` | de novo motif table in HTML format       |

peak table in peak file

```
# Column Headers:
#PeakID chr     start   end     strand  Normalized Tag Count    focus ratio   findPeaks Score  Total Tags      Control Tags (normalized to IP Experiment)    Fold Change vs Control   p-value vs Control      Fold Change vs Local    p-value vs Local       Clonal Fold Change
chrIII-1        chrIII  78346   78578   +       69987.1 0.862   5971.000000   5966.0   201.6   29.60   0.00e+00        26.54   0.00e+00        0.50
chrIII-2        chrIII  133     365     +       61226.9 0.775   5364.000000   5227.0   116.3   44.93   0.00e+00        59.92   0.00e+00        0.61
chrI-1  chrI    141663  141895  +       41225.6 0.854   3515.000000     3514.0169.1    20.78   0.00e+00        17.09   0.00e+00        0.50
chrII-1 chrII   165145  165377  +       35334.5 0.845   3015.000000     3018.0171.1    17.64   0.00e+00        14.47   0.00e+00        0.50
chrII-2 chrII   555827  556059  +       34817.1 0.790   2973.000000     2970.0159.6    18.61   0.00e+00        10.12   0.00e+00        0.50
```

![](https://4115668567-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-LPVsf5VZbQ7h14X29qW%2F-LPVv7obRlTivTDgBNhr%2F-LPVvFOC4GAP0TOAcSDr%2Fchipseq-motif-output.png?generation=1540298185745137\&alt=media)

图1. motif table in `homerResults.html`

#### 2c.2) detailed description of output peaks

peak 文件每一列的含义描述如下:

| column | information               | description                                                                                                                                                                                                 |
| ------ | ------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| 1      | PeakID                    | a *unique* name for each peak                                                                                                                                                                               |
| 2      | chr                       | chromosome where peak is located                                                                                                                                                                            |
| 3      | starting position of peak |                                                                                                                                                                                                             |
| 4      | ending position of peak   |                                                                                                                                                                                                             |
| 5      | Strand (+/-)              |                                                                                                                                                                                                             |
| 6      | Normalized Tag Counts     | number of tags found at the peak, normalized to 10 million total mapped tags (or defined by the user)                                                                                                       |
| 7      | Focus Ratio               | fraction of tags found appropriately upstream and downstream of the peak center                                                                                                                             |
| 8      | Peak score                | position adjusted reads from initial peak region - reads per position may be limited                                                                                                                        |
| 9      | total Tags                | number of tags found at the peak                                                                                                                                                                            |
| 10     | Control Tags              | normalized to IP Experiment                                                                                                                                                                                 |
| 11     | Fold Change vs Control    | putative peaks have 4-fold more normalized tags in the target experiment than the control                                                                                                                   |
| 12     | p-value vs Control        | HOMER uses the poisson distribution to determine the chance that the differences in tag counts are statistically significant (sequencing-depth dependent), requiring a cumulative poisson p-value of 0.0001 |
| 13     | Fold Change vs Local      | HOMER requires the tag density at peaks to be 4-fold greater than in the surrounding 10 kb region                                                                                                           |
| 14     | p-value vs Local          | the comparison must also pass a poisson p-value threshold of 0.0001                                                                                                                                         |
| 15     | Clonal Fold Change        | The fold threshold can be set with the `-C <#>` option (default: `-C 2`), if this ratio gets too high, the peak may arise from expanded repeats, and should be discarded                                    |

#### 2c.3) detailed description of Motif analysis output

Detailed output files of Motif analysis will produce many files, we only explain the main output -- `homerResults.html` in above. Here we will briefly introduce other files.

| file name                    | description                                                                                                                            |
| ---------------------------- | -------------------------------------------------------------------------------------------------------------------------------------- |
| `homerMotifs.all.motifs`     | Simply the concatenated file composed of all the `homerMotifs.motifs<#>` files.                                                        |
| `motifFindingParameters.txt` | command used to execute `findMotifsGenome.pl`.                                                                                         |
| `knownResults.txt`           | text file containing statistics about known motif enrichment (open in EXCEL).                                                          |
| `seq.autonorm.tsv`           | autonormalization statistics for lower-order oligo normalization.                                                                      |
| `homerResults.html`          | HTML formatted output of de novo motif finding:                                                                                        |
| `knownResults.html`          | HTML formatted output of known motif finding.                                                                                          |
| `knownResults/ directory`    | contains files for the knownResults.html webpage, including `known<#>.motif` files for use in finding specific instance of each motif. |

## 3) Running Steps

首先进入到容器（在宿主机的终端中运行，详情请参见 [这里](https://lulab.gitbooks.io/teaching/getting-started.html#use-container)）：

```bash
docker exec -it bioinfo_tsinghua bash
```

以下步骤均在 `/home/test/chip-seq/` 下进行:

```bash
cd /home/test/chip-seq/
```

准备输出目录

```bash
mkdir output
```

### 3a) Peak Calling

我们首先使用`makeTagDirectory`命令产生peak calling所需的中间文件:

```bash
makeTagDirectory input/ip    input/ip.part.bam
makeTagDirectory input/input input/input.part.bam
```

这里的"tag"指的就是二代测序的read,只是CHIP-seq研究中的一种习惯叫法。

After this step, `input/ip` & `input/input` will contain several `.tags.tsv` files, as well as a file named `tagInfo.txt`. This file contains information about your sequencing run, including the total number of tags considered. This file is used by later peak-calling programs to quickly reference information about the experiment.

然后我们从这两组中间文件出发进行peak calling:

```bash
findPeaks input/ip/ -style factor -o output/part.peak -i input/input/
```

参数解释:

* `-style`: 有factor,histone两种选择。如前所述，转录因子和组蛋白修饰的CHIP-seq peak有不同的特性，所以在peak calling中也会使用不同的策略。
* `-o` : 输出文件路径
* `-i` : 存储input样本中间文件的"tag directory"

输出文件为 `/home/test/chip-seq/output/part.peak`, 示例如下

```
#PeakID chr     start   end     strand  Normalized Tag Count    focus ratio     findPeaks Score Total TagControl Tags (normalized to IP Experiment)       Fold Change vs Control  p-value vs Control      Fold Change vs Local      p-value vs Local        Clonal Fold Change
chrIII-1        chrIII  78346   78578   +       69987.1 0.862   5971.000000     5966.0  201.6   29.60   0.00e+00  26.54   0.00e+00        0.50
chrIII-2        chrIII  133     365     +       61226.9 0.775   5364.000000     5227.0  116.3   44.93   0.00e+00  59.92   0.00e+00        0.61
chrI-1  chrI    141663  141895  +       41225.6 0.854   3515.000000     3514.0  169.1   20.78   0.00e+00 17.09    0.00e+00        0.50
chrII-1 chrII   165145  165377  +       35334.5 0.845   3015.000000     3018.0  171.1   17.64   0.00e+00 14.47    0.00e+00        0.50
chrII-2 chrII   555827  556059  +       34817.1 0.790   2973.000000     2970.0  159.6   18.61   0.00e+00 10.12    0.00e+00        0.50
chrIII-3        chrIII  163527  163759  +       31266.1 0.826   2662.000000     2670.0  186.0   14.35   0.00e+00  14.16   0.00e+00        0.51
```

### 3b) Motif Analysis

* `findMotifsGenome.pl`是HOMER用来做motif finding的一个perl脚本。

```bash
findMotifsGenome.pl output/part.peak sacCer2 output/part.motif.output -len 8
```

参数解释:

* `-len`: motif长度，可以设置多个长度，用逗号隔开。默认为8,10,12。

还有两个比较重要的参数我们使用的是默认值:

* Region Size (`-size <#>`, `-size <#>,<#>`, `-size given`, default: 200) The size of the region used for motif finding is important. If analyzing ChIP-Seq peaks from a transcription factor, Chuck would recommend 50 bp for establishing the primary motif bound by a given transcription factor and 200 bp for finding both primary and "co-enriched" motifs for a transcription factor. When looking at histone marked regions, 500-1000 bp is probably a good idea。
* Number of motifs to find (`-S <#>`, default 25) Specifies the number of motifs of each length to find. 25 is already quite a bit.

最重要的输出文件为 `/home/test/chip-seq/output/part.motif.output/homerResults.html`, 示例输出参见 [这里](http://homer.ucsd.edu/homer/motif/HomerMotifDB/homerResults.html)

## 4) Tips/Utilities

### 4a) Preparation of bam file

前面我们直接从bam文件出发进行后续分析，大家也可以从fastq文件出发自己mapping获得bam文件:

* （1）下载CHIP-seq数据
  * The fastq data for yeast ChIP-seq was downloaded from [GSE61210](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE61210) .
  * Input data was downloaded from [GSM1499619](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=gsm1499619);
  * IP data was downloaded from [GSM1499607](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=gsm1499607).
  * 建立酵母参考基因组的bowtie index

```bash
# download yeast sacCer2 reference genome
wget http://hgdownload.soe.ucsc.edu/goldenPath/sacCer2/bigZips/chromFa.tar.gz
tar -xvf chromfa.tar.gz
# concatenate fasta files for each chromosome into a single fasta file
cat *.fa > yeast.allchrom.fa
# build bowtie index
mkdir bowtie_index_yeast
bowtie-build yeast.allchrom.fa bowtie_index_yeast/sacCer2
```

* （2）mapping

```bash
bowtie -p 4  -m 1  -v 3  --best --strata bowtie_index_yeast/sacCer2 -q input/ip.fastq -S input/ip.sam
samtools sort input/ip.sam > input/ip.sorted.bam
samtools index input/ip.sorted.bam
```

* （3）sampling

为降低计算开销供学习之用，我们只sampling一部分alignment record用于后续分析:

```bash
samtools view input/ip.sorted.bam chrI chrII chrIII -b > input/ip.part.bam
```

### 4b) peak calling using MACS <a href="#chip-seq-macs" id="chip-seq-macs"></a>

[MACS](https://github.com/macs3-project/MACS)是CHIP-seq数据分析的另外一个常用工具。

我们的docker容器已经安装了MACS2。在容器的`/home/test/chip-seq`目录下，我们可以用如下命令进行peak calling:

```bash
macs2 callpeak -t input/ip.part.bam -c input/input.part.bam --outdir output/macs_peak \
    --name=yeast_macs_p05 --format=BAM --gsize=1.2e7 --tsize=50 --pvalue=1e-5
```

输出的结果都在`output/macs_peak`目录中, 其中`yeast_macs_p05_peaks.xls`包含了显著的peak的相关信息。这个文件虽然以'.xls'为后缀，实际上却是一个文本文件。文件中每行对应着一个peak，每列的含义如下:

* chromosome name
* start position of peak (1-based)
* end position of peak (1-based)
* length of peak region
* absolute peak summit position
* pileup height at peak summit, -log10(pvalue) for the peak summit (e.g. pvalue =1e-10, then this value should be 10)
* fold enrichment for this peak summit against random Poisson distribution with local lambda, -log10(qvalue) at peak summit

### 4c) Visualize peaks

[ChIPseeker](https://bioconductor.org/packages/release/bioc/html/ChIPseeker.html)是一个国人开发的R包，主要用来做chip-seq peaks的注释和可视化。在获得了peak之后，我们会希望对这些peak在基因组上的分布有一些直观的认识，例如peak的genomic interval相对于转录起始位点(transcription start site,TSS)的分布，peak在基因组不同区域(intergenic region)的分布等等。ChIPseeker提供了一些方便使用的函数进行这些可视化，我们可以直接调用，用不着自己去手动实现了。

下面的脚本提供了一个简单的例子。这个脚本需要用到ChIPseeker，GenomicFeatures和org.Sc.sgd.db三个bioconductor的R packages，请大家在宿主机上用`BiocManager::install`自行安装。所需的输入是我们用MACS2做peak calling在输出目录产生的`yeast_macs_p05_peaks.narrowPeak`文件。ChIPseeker还提供了很多其他的功能，有兴趣的同学请自行参考它的文档。

```r
library("ChIPseeker")
library("GenomicFeatures")
library("org.Sc.sgd.db")

# makeTxDbFromUCSC function is provided by GenomicFeatures package
# download gene annotation from UCSC, and save the transcript model in txdb object
# it may take a while ...
txdb <- makeTxDbFromUCSC(genome ="sacCer2", tablename = "ensGene") 
# Read peak file generated by MACS
# readPeakFile fucntion is provided by ChIPseeker
yeast <- readPeakFile("yeast_macs_p05_peaks.narrowPeak")

# ChIP peaks binding TSS region
# getPromoters function is provided by GenomicFeatures
# define promoters as upstream 3000 nt and downstream 3000 nt flanking TSS
promoter <- getPromoters(TxDb = txdb, upstream = 3000, downstream = 3000)
tagMatrix <- getTagMatrix(peak = yeast, windows = promoter)
tagHeatmap(tagMatrix = tagMatrix, xlim = c(-3000, 3000), color = "red")

# Average profile of ChIP peaks binding to TSS region
plotAvgProf(tagMatrix, xlim = c(-3000, 3000), conf = 0.95, resample = 1000,
            xlab = "Genomone Region (5'->3')", ylab = "Read Count Frequency")
```

`tagHeatmap`输出一个heatmap，每一行对应一个转录本，每一列对应相对TSS的一个位置(每个转录本被根据TSS的位置align到一起)，颜色表示落在这个位置的peaks的数量:

![Tag Heatmap](https://4115668567-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-LPVsf5VZbQ7h14X29qW%2Fuploads%2Fgit-blob-eab712050644049e83f5ed813ff8ac1a45db5611%2F3.3.tag.heatmap.png?alt=media)

`plotAvgProf`输出peaks在不同的转录本中相对于TSS总体的coverage profile:

![Average Profile](https://4115668567-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-LPVsf5VZbQ7h14X29qW%2Fuploads%2Fgit-blob-f3abb1ffeaa271ef0f0f0801dc6e2c68b4a90129%2F3.3.tag.profile.png?alt=media)

## 5) Homework

1. 请解释在ChIP-seq实验中为什么一般都要平行做一个 control （通常叫 input）的实验。
2. 请解释 `findPeaks` 和 `findMotifsGenome.pl` 主要参数的含义。
3. 我们在容器的`/home/test/chip-seq/homework`目录中提供了酵母Snf1蛋白CHIP-seq的bam文件，`ip.chrom_part.bam`为IP实验数据，`input.chrom\_part.bam`为背景数据。请大家从这两个文件出发，用homer重复本章中介绍的peak calling和motif finding分析。请大家提交找到的motif的截图，以及`Fold Change (vs Control)` >=8且`p-value (vs Control)` < $$10^{-8}$$\`的peaks(建议放在同一个文件中提交)。

## 6) Teaching Video

* see Videos in the \*\*\*\* [**Files needed**](https://courses.ncrnalab.org/files)

## 7) 延伸阅读

<https://github.com/crazyhottommy/ChIP-seq-analysis> 收集了大量关于CHIP-seq分析的参考资源，推荐阅读
