3.ChIP-seq

  • 染色质免疫共沉淀测序(Chromatin immunoprecipitation sequencing,简称ChIP-seq)是一种用于分析蛋白质与DNA相互作用的高通量实验技术。常见的ChIP-seq实验主要用来确定特定一种转录因子的结合位点,以及特定组蛋白修饰在基因组上的的分布。ChIP-seq是现代生物学研究基因调控的一个重要的手段。本节中我们对ChIP-seq数据的基本分析方法进行介绍。

  • ChIP-seq实验通常会有一组针对感兴趣的蛋白或表观修饰利用特定的抗体进行富集的样本,还有一组不做富集的control样本(一般被称为"input")。ChIP-seq数据分析最主要的目的就是通过比较这两组数据,推断出基因组的哪些位置是我们感兴趣的位点。

  • 我们对于ChIP-seq数据的处理也是从mapping开始。chip-seq中基因组上被富集到的区域,在reads coverage上会体现为一个尖峰(peak)。由于高通量测序数据往往有比较大的噪声,我们需要利用一些算法把peak和背景区分开来,这个过程一般被称为"peak calling"。peak calling的结果就是一组genomic intervals。我们认为这样一些genomic intervals就对应着转录因子的结合位点(或者存在特定表观修饰的区域)。

  • 转录因子的结合区域一般比较短,对应的peaks比较狭窄,被称为"narrow peaks"。存在组蛋白修饰的区域通常更长一些,对应的peaks也比较宽,被称为"broad peaks"。

  • 转录因子的结合通常有比较强的序列偏好性,这种序列偏好性可以用"motif"来描述。在后续章节中,我们将进行更细致的介绍。

  • 我们本节主要介绍的homer这个工具,也会对MACS和R包ChIPseeker进行简要介绍。

1) Pipeline

mapping一章我们已经对DNA测序的reads向基因组的比对,这对于chip-seq同样是适用的。所以我们跳过mapping一步,从实验组和input对应的两个bam文件出发开始分析。这两个bam文件的产生过程请参考这里。本节主要参考homer的文档,感兴趣的同学可以直接阅读他们提供的文档。

2) Data Structure

2a) getting software & data

  • 本流程所有步骤都是用homer提供的工具完成。

  • GSE61210利用chip-seq在酵母中对转录调控因子SNF2的结合位点进行了研究。我们这里用到了其中的GSM1499619(input.bam)和GSM1499607(ip.bam)两个样本。

方法1: 使用docker

docker images的下载链接如附表所示,docker中我们已经准备好所需文件,例如 .bam 文件位于 Docker 中的 /home/test/chip-seq/input

方法2: 自行下载和安装

  1. Install software:HOMER

  2. Download data: 从 **** Files needed 中的Files/ 路径下的相应文件夹中下载

2b) input

FormatDescriptionNotes

.bam

将CHIP-seq的 Reads 比对到参考基因组

-

2c) output

2c.1) 主要输出文件如下所示:

File formatFile description

Peak calling

peak file

each row contatins information of a peak

Motif analysis

homerResults.html

de novo motif table in HTML format

peak table in peak file

# Column Headers:
#PeakID chr     start   end     strand  Normalized Tag Count    focus ratio   findPeaks Score  Total Tags      Control Tags (normalized to IP Experiment)    Fold Change vs Control   p-value vs Control      Fold Change vs Local    p-value vs Local       Clonal Fold Change
chrIII-1        chrIII  78346   78578   +       69987.1 0.862   5971.000000   5966.0   201.6   29.60   0.00e+00        26.54   0.00e+00        0.50
chrIII-2        chrIII  133     365     +       61226.9 0.775   5364.000000   5227.0   116.3   44.93   0.00e+00        59.92   0.00e+00        0.61
chrI-1  chrI    141663  141895  +       41225.6 0.854   3515.000000     3514.0169.1    20.78   0.00e+00        17.09   0.00e+00        0.50
chrII-1 chrII   165145  165377  +       35334.5 0.845   3015.000000     3018.0171.1    17.64   0.00e+00        14.47   0.00e+00        0.50
chrII-2 chrII   555827  556059  +       34817.1 0.790   2973.000000     2970.0159.6    18.61   0.00e+00        10.12   0.00e+00        0.50

图1. motif table in homerResults.html

2c.2) detailed description of output peaks

peak 文件每一列的含义描述如下:

columninformationdescription

1

PeakID

a unique name for each peak

2

chr

chromosome where peak is located

3

starting position of peak

4

ending position of peak

5

Strand (+/-)

6

Normalized Tag Counts

number of tags found at the peak, normalized to 10 million total mapped tags (or defined by the user)

7

Focus Ratio

fraction of tags found appropriately upstream and downstream of the peak center

8

Peak score

position adjusted reads from initial peak region - reads per position may be limited

9

total Tags

number of tags found at the peak

10

Control Tags

normalized to IP Experiment

11

Fold Change vs Control

putative peaks have 4-fold more normalized tags in the target experiment than the control

12

p-value vs Control

HOMER uses the poisson distribution to determine the chance that the differences in tag counts are statistically significant (sequencing-depth dependent), requiring a cumulative poisson p-value of 0.0001

13

Fold Change vs Local

HOMER requires the tag density at peaks to be 4-fold greater than in the surrounding 10 kb region

14

p-value vs Local

the comparison must also pass a poisson p-value threshold of 0.0001

15

Clonal Fold Change

The fold threshold can be set with the -C <#> option (default: -C 2), if this ratio gets too high, the peak may arise from expanded repeats, and should be discarded

2c.3) detailed description of Motif analysis output

Detailed output files of Motif analysis will produce many files, we only explain the main output -- homerResults.html in above. Here we will briefly introduce other files.

file namedescription

homerMotifs.all.motifs

Simply the concatenated file composed of all the homerMotifs.motifs<#> files.

motifFindingParameters.txt

command used to execute findMotifsGenome.pl.

knownResults.txt

text file containing statistics about known motif enrichment (open in EXCEL).

seq.autonorm.tsv

autonormalization statistics for lower-order oligo normalization.

homerResults.html

HTML formatted output of de novo motif finding:

knownResults.html

HTML formatted output of known motif finding.

knownResults/ directory

contains files for the knownResults.html webpage, including known<#>.motif files for use in finding specific instance of each motif.

3) Running Steps

首先进入到容器(在宿主机的终端中运行,详情请参见 这里):

docker exec -it bioinfo_tsinghua bash

以下步骤均在 /home/test/chip-seq/ 下进行:

cd /home/test/chip-seq/

准备输出目录

mkdir output

3a) Peak Calling

我们首先使用makeTagDirectory命令产生peak calling所需的中间文件:

makeTagDirectory input/ip    input/ip.part.bam
makeTagDirectory input/input input/input.part.bam

这里的"tag"指的就是二代测序的read,只是CHIP-seq研究中的一种习惯叫法。

After this step, input/ip & input/input will contain several .tags.tsv files, as well as a file named tagInfo.txt. This file contains information about your sequencing run, including the total number of tags considered. This file is used by later peak-calling programs to quickly reference information about the experiment.

然后我们从这两组中间文件出发进行peak calling:

findPeaks input/ip/ -style factor -o output/part.peak -i input/input/

参数解释:

  • -style: 有factor,histone两种选择。如前所述,转录因子和组蛋白修饰的CHIP-seq peak有不同的特性,所以在peak calling中也会使用不同的策略。

  • -o : 输出文件路径

  • -i : 存储input样本中间文件的"tag directory"

输出文件为 /home/test/chip-seq/output/part.peak, 示例如下

#PeakID chr     start   end     strand  Normalized Tag Count    focus ratio     findPeaks Score Total TagControl Tags (normalized to IP Experiment)       Fold Change vs Control  p-value vs Control      Fold Change vs Local      p-value vs Local        Clonal Fold Change
chrIII-1        chrIII  78346   78578   +       69987.1 0.862   5971.000000     5966.0  201.6   29.60   0.00e+00  26.54   0.00e+00        0.50
chrIII-2        chrIII  133     365     +       61226.9 0.775   5364.000000     5227.0  116.3   44.93   0.00e+00  59.92   0.00e+00        0.61
chrI-1  chrI    141663  141895  +       41225.6 0.854   3515.000000     3514.0  169.1   20.78   0.00e+00 17.09    0.00e+00        0.50
chrII-1 chrII   165145  165377  +       35334.5 0.845   3015.000000     3018.0  171.1   17.64   0.00e+00 14.47    0.00e+00        0.50
chrII-2 chrII   555827  556059  +       34817.1 0.790   2973.000000     2970.0  159.6   18.61   0.00e+00 10.12    0.00e+00        0.50
chrIII-3        chrIII  163527  163759  +       31266.1 0.826   2662.000000     2670.0  186.0   14.35   0.00e+00  14.16   0.00e+00        0.51

3b) Motif Analysis

  • findMotifsGenome.pl是HOMER用来做motif finding的一个perl脚本。

findMotifsGenome.pl output/part.peak sacCer2 output/part.motif.output -len 8

参数解释:

  • -len: motif长度,可以设置多个长度,用逗号隔开。默认为8,10,12。

还有两个比较重要的参数我们使用的是默认值:

  • Region Size (-size <#>, -size <#>,<#>, -size given, default: 200) The size of the region used for motif finding is important. If analyzing ChIP-Seq peaks from a transcription factor, Chuck would recommend 50 bp for establishing the primary motif bound by a given transcription factor and 200 bp for finding both primary and "co-enriched" motifs for a transcription factor. When looking at histone marked regions, 500-1000 bp is probably a good idea。

  • Number of motifs to find (-S <#>, default 25) Specifies the number of motifs of each length to find. 25 is already quite a bit.

最重要的输出文件为 /home/test/chip-seq/output/part.motif.output/homerResults.html, 示例输出参见 这里

4) Tips/Utilities

4a) Preparation of bam file

前面我们直接从bam文件出发进行后续分析,大家也可以从fastq文件出发自己mapping获得bam文件:

  • (1)下载CHIP-seq数据

    • The fastq data for yeast ChIP-seq was downloaded from GSE61210 .

    • Input data was downloaded from GSM1499619;

    • IP data was downloaded from GSM1499607.

    • 建立酵母参考基因组的bowtie index

# download yeast sacCer2 reference genome
wget http://hgdownload.soe.ucsc.edu/goldenPath/sacCer2/bigZips/chromFa.tar.gz
tar -xvf chromfa.tar.gz
# concatenate fasta files for each chromosome into a single fasta file
cat *.fa > yeast.allchrom.fa
# build bowtie index
mkdir bowtie_index_yeast
bowtie-build yeast.allchrom.fa bowtie_index_yeast/sacCer2
  • (2)mapping

bowtie -p 4  -m 1  -v 3  --best --strata bowtie_index_yeast/sacCer2 -q input/ip.fastq -S input/ip.sam
samtools sort input/ip.sam > input/ip.sorted.bam
samtools index input/ip.sorted.bam
  • (3)sampling

为降低计算开销供学习之用,我们只sampling一部分alignment record用于后续分析:

samtools view input/ip.sorted.bam chrI chrII chrIII -b > input/ip.part.bam

4b) peak calling using MACS

MACS是CHIP-seq数据分析的另外一个常用工具。

我们的docker容器已经安装了MACS2。在容器的/home/test/chip-seq目录下,我们可以用如下命令进行peak calling:

macs2 callpeak -t input/ip.part.bam -c input/input.part.bam --outdir output/macs_peak \
    --name=yeast_macs_p05 --format=BAM --gsize=1.2e7 --tsize=50 --pvalue=1e-5

输出的结果都在output/macs_peak目录中, 其中yeast_macs_p05_peaks.xls包含了显著的peak的相关信息。这个文件虽然以'.xls'为后缀,实际上却是一个文本文件。文件中每行对应着一个peak,每列的含义如下:

  • chromosome name

  • start position of peak (1-based)

  • end position of peak (1-based)

  • length of peak region

  • absolute peak summit position

  • pileup height at peak summit, -log10(pvalue) for the peak summit (e.g. pvalue =1e-10, then this value should be 10)

  • fold enrichment for this peak summit against random Poisson distribution with local lambda, -log10(qvalue) at peak summit

4c) Visualize peaks

ChIPseeker是一个国人开发的R包,主要用来做chip-seq peaks的注释和可视化。在获得了peak之后,我们会希望对这些peak在基因组上的分布有一些直观的认识,例如peak的genomic interval相对于转录起始位点(transcription start site,TSS)的分布,peak在基因组不同区域(intergenic region)的分布等等。ChIPseeker提供了一些方便使用的函数进行这些可视化,我们可以直接调用,用不着自己去手动实现了。

下面的脚本提供了一个简单的例子。这个脚本需要用到ChIPseeker,GenomicFeatures和org.Sc.sgd.db三个bioconductor的R packages,请大家在宿主机上用BiocManager::install自行安装。所需的输入是我们用MACS2做peak calling在输出目录产生的yeast_macs_p05_peaks.narrowPeak文件。ChIPseeker还提供了很多其他的功能,有兴趣的同学请自行参考它的文档。

library("ChIPseeker")
library("GenomicFeatures")
library("org.Sc.sgd.db")

# makeTxDbFromUCSC function is provided by GenomicFeatures package
# download gene annotation from UCSC, and save the transcript model in txdb object
# it may take a while ...
txdb <- makeTxDbFromUCSC(genome ="sacCer2", tablename = "ensGene") 
# Read peak file generated by MACS
# readPeakFile fucntion is provided by ChIPseeker
yeast <- readPeakFile("yeast_macs_p05_peaks.narrowPeak")

# ChIP peaks binding TSS region
# getPromoters function is provided by GenomicFeatures
# define promoters as upstream 3000 nt and downstream 3000 nt flanking TSS
promoter <- getPromoters(TxDb = txdb, upstream = 3000, downstream = 3000)
tagMatrix <- getTagMatrix(peak = yeast, windows = promoter)
tagHeatmap(tagMatrix = tagMatrix, xlim = c(-3000, 3000), color = "red")

# Average profile of ChIP peaks binding to TSS region
plotAvgProf(tagMatrix, xlim = c(-3000, 3000), conf = 0.95, resample = 1000,
            xlab = "Genomone Region (5'->3')", ylab = "Read Count Frequency")

tagHeatmap输出一个heatmap,每一行对应一个转录本,每一列对应相对TSS的一个位置(每个转录本被根据TSS的位置align到一起),颜色表示落在这个位置的peaks的数量:

plotAvgProf输出peaks在不同的转录本中相对于TSS总体的coverage profile:

5) Homework

  1. 请解释在ChIP-seq实验中为什么一般都要平行做一个 control (通常叫 input)的实验。

  2. 请解释 findPeaksfindMotifsGenome.pl 主要参数的含义。

  3. 我们在容器的/home/test/chip-seq/homework目录中提供了酵母Snf1蛋白CHIP-seq的bam文件,ip.chrom_part.bam为IP实验数据,input.chrom\_part.bam为背景数据。请大家从这两个文件出发,用homer重复本章中介绍的peak calling和motif finding分析。请大家提交找到的motif的截图,以及Fold Change (vs Control) >=8且p-value (vs Control) < 10810^{-8}`的peaks(建议放在同一个文件中提交)。

6) Teaching Video

7) 延伸阅读

https://github.com/crazyhottommy/ChIP-seq-analysis 收集了大量关于CHIP-seq分析的参考资源,推荐阅读

Last updated