3.ChIP-seq
- 染色质免疫共沉淀测序(Chromatin immunoprecipitation sequencing,简称ChIP-seq)是一种用于分析蛋白质与DNA相互作用的高通量实验技术。常见的ChIP-seq实验主要用来确定特定一种转录因子的结合位点,以及特定组蛋白修饰在基因组上的的分布。ChIP-seq是现代生物学研究基因调控的一个重要的手段。本节中我们对ChIP-seq数据的基本分析方法进行介绍。
- ChIP-seq实验通常会有一组针对感兴趣的蛋白或表观修饰利用特定的抗体进行富集的样本,还有一组不做富集的control样本(一般被称为"input")。ChIP-seq数据分析最主要的目的就是通过比较这两组数据,推断出基因组的哪些位置是我们感兴趣的位点。
- 我们对于ChIP-seq数据的处理也是从mapping开始。chip-seq中基因组上被富集到的区域,在reads coverage上会体现为一个尖峰(peak)。由于高通量测序数据往往有比较大的噪声,我们需要利用一些算法把peak和背景区分开来,这个过程一般被称为"peak calling"。peak calling的结果就是一组genomic intervals。我们认为这样一些genomic intervals就对应着转录因子的结合位点(或者存在特定表观修饰的区域)。
- 转录因子的结合区域一般比较短,对应的peaks比较狭窄,被称为"narrow peaks"。存在组蛋白修饰的区域通常更长一些,对应的peaks也比较宽,被称为"broad peaks"。
- 转录因子的结合通常有比较强的序列偏好性,这种序列偏好性可以用"motif"来描述。在后续章节中,我们将进行更细致的介绍。

- 本流程所有步骤都是用homer提供的工具完成。
- GSE61210利用chip-seq在酵母中对转录调控因子SNF2的结合位点进行了研究。我们这里用到了其中的GSM1499619(
input.bam
)和GSM1499607(ip.bam
)两个样本。
- 1.
- 2.
Format | Description | Notes |
---|---|---|
.bam | 将CHIP-seq的 Reads 比对到参考基因组 | - |
Text | File format | File description |
---|---|---|
Peak calling | peak file | each row contatins information of a peak |
Motif analysis | homerResults.html | de novo motif table in HTML format |
peak table in peak file
# Column Headers:
#PeakID chr start end strand Normalized Tag Count focus ratio findPeaks Score Total Tags Control Tags (normalized to IP Experiment) Fold Change vs Control p-value vs Control Fold Change vs Local p-value vs Local Clonal Fold Change
chrIII-1 chrIII 78346 78578 + 69987.1 0.862 5971.000000 5966.0 201.6 29.60 0.00e+00 26.54 0.00e+00 0.50
chrIII-2 chrIII 133 365 + 61226.9 0.775 5364.000000 5227.0 116.3 44.93 0.00e+00 59.92 0.00e+00 0.61
chrI-1 chrI 141663 141895 + 41225.6 0.854 3515.000000 3514.0169.1 20.78 0.00e+00 17.09 0.00e+00 0.50
chrII-1 chrII 165145 165377 + 35334.5 0.845 3015.000000 3018.0171.1 17.64 0.00e+00 14.47 0.00e+00 0.50
chrII-2 chrII 555827 556059 + 34817.1 0.790 2973.000000 2970.0159.6 18.61 0.00e+00 10.12 0.00e+00 0.50

图1. motif table in
homerResults.html
peak 文件每一列的含义描述如下:
column | information | description |
---|---|---|
1 | PeakID | a unique name for each peak |
2 | chr | chromosome where peak is located |
3 | starting position of peak | |
4 | ending position of peak | |
5 | Strand (+/-) | |
6 | Normalized Tag Counts | number of tags found at the peak, normalized to 10 million total mapped tags (or defined by the user) |
7 | Focus Ratio | fraction of tags found appropriately upstream and downstream of the peak center |
8 | Peak score | position adjusted reads from initial peak region - reads per position may be limited |
9 | total Tags | number of tags found at the peak |
10 | Control Tags | normalized to IP Experiment |
11 | Fold Change vs Control | putative peaks have 4-fold more normalized tags in the target experiment than the control |
12 | p-value vs Control | HOMER uses the poisson distribution to determine the chance that the differences in tag counts are statistically significant (sequencing-depth dependent), requiring a cumulative poisson p-value of 0.0001 |
13 | Fold Change vs Local | HOMER requires the tag density at peaks to be 4-fold greater than in the surrounding 10 kb region |
14 | p-value vs Local | the comparison must also pass a poisson p-value threshold of 0.0001 |
15 | Clonal Fold Change | The fold threshold can be set with the -C <#> option (default: -C 2 ), if this ratio gets too high, the peak may arise from expanded repeats, and should be discarded |
Detailed output files of Motif analysis will produce many files, we only explain the main output --
homerResults.html
in above. Here we will briefly introduce other files.file name | description |
---|---|
homerMotifs.all.motifs | Simply the concatenated file composed of all the homerMotifs.motifs<#> files. |
motifFindingParameters.txt | command used to execute findMotifsGenome.pl . |
knownResults.txt | text file containing statistics about known motif enrichment (open in EXCEL). |
seq.autonorm.tsv | autonormalization statistics for lower-order oligo normalization. |
homerResults.html | HTML formatted output of de novo motif finding: |
knownResults.html | HTML formatted output of known motif finding. |
knownResults/ directory | contains files for the knownResults.html webpage, including known<#>.motif files for use in finding specific instance of each motif. |
docker exec -it bioinfo_tsinghua bash
以下步骤均在
/home/test/chip-seq/
下进行:cd /home/test/chip-seq/
准备输出目录
mkdir output
我们首先使用
makeTagDirectory
命令产生peak calling所需的中间文件:makeTagDirectory input/ip input/ip.part.bam
makeTagDirectory input/input input/input.part.bam
这里的"tag"指的就是二代测序的read,只是CHIP-seq研究中的一种习惯叫法。
After this step,
input/ip
& input/input
will contain several .tags.tsv
files, as well as a file named tagInfo.txt
. This file contains information about your sequencing run, including the total number of tags considered. This file is used by later peak-calling programs to quickly reference information about the experiment.然后我们从这两组中间文件出发进行peak calling:
findPeaks input/ip/ -style factor -o output/part.peak -i input/input/
参数解释:
-style
: 有factor,histone两种选择。如前所述,转录因子和组蛋白修饰的CHIP-seq peak有不同的特性,所以在peak calling中也会使用不同的策略。-o
: 输出文件路径-i
: 存储input样本中间文件的"tag directory"
输出文件为
/home/test/chip-seq/output/part.peak
, 示例如下#PeakID chr start end strand Normalized Tag Count focus ratio findPeaks Score Total TagControl Tags (normalized to IP Experiment) Fold Change vs Control p-value vs Control Fold Change vs Local p-value vs Local Clonal Fold Change
chrIII-1 chrIII 78346 78578 + 69987.1 0.862 5971.000000 5966.0 201.6 29.60 0.00e+00 26.54 0.00e+00 0.50
chrIII-2 chrIII 133 365 + 61226.9 0.775 5364.000000 5227.0 116.3 44.93 0.00e+00 59.92 0.00e+00 0.61
chrI-1 chrI 141663 141895 + 41225.6 0.854 3515.000000 3514.0 169.1 20.78 0.00e+00 17.09 0.00e+00 0.50
chrII-1 chrII 165145 165377 + 35334.5 0.845 3015.000000 3018.0 171.1 17.64 0.00e+00 14.47 0.00e+00 0.50
chrII-2 chrII 555827 556059 + 34817.1 0.790 2973.000000 2970.0 159.6 18.61 0.00e+00 10.12 0.00e+00 0.50
chrIII-3 chrIII 163527 163759 + 31266.1 0.826 2662.000000 2670.0 186.0 14.35 0.00e+00 14.16 0.00e+00 0.51
findMotifsGenome.pl
是HOMER用来做motif finding的一个perl脚本。
findMotifsGenome.pl output/part.peak sacCer2 output/part.motif.output -len 8
参数解释:
-len
: motif长度,可以设置多个长度,用逗号隔开。默认为8,10,12。
还有两个比较重要的参数我们使用的是默认值:
- Region Size (
-size <#>
,-size <#>,<#>
,-size given
, default: 200) The size of the region used for motif finding is important. If analyzing ChIP-Seq peaks from a transcription factor, Chuck would recommend 50 bp for establishing the primary motif bound by a given transcription factor and 200 bp for finding both primary and "co-enriched" motifs for a transcription factor. When looking at histone marked regions, 500-1000 bp is probably a good idea。 - Number of motifs to find (
-S <#>
, default 25) Specifies the number of motifs of each length to find. 25 is already quite a bit.
前面我们直接从bam文件出发进行后续分析,大家也可以从fastq文件出发自己mapping获得bam文件:
- (1)下载CHIP-seq数据
- 建立酵母参考基因组的bowtie index
# download yeast sacCer2 reference genome
wget http://hgdownload.soe.ucsc.edu/goldenPath/sacCer2/bigZips/chromFa.tar.gz
tar -xvf chromfa.tar.gz
# concatenate fasta files for each chromosome into a single fasta file
cat *.fa > yeast.allchrom.fa
# build bowtie index
mkdir bowtie_index_yeast
bowtie-build yeast.allchrom.fa bowtie_index_yeast/sacCer2
- (2)mapping
bowtie -p 4 -m 1 -v 3 --best --strata bowtie_index_yeast/sacCer2 -q input/ip.fastq -S input/ip.sam
samtools sort input/ip.sam > input/ip.sorted.bam
samtools index input/ip.sorted.bam
- (3)sampling
为降低计算开销供学习之用,我们只sampling一部分alignment record用于后续分析:
samtools view input/ip.sorted.bam chrI chrII chrIII -b > input/ip.part.bam
我们的docker容器已经安装了MACS2。在容器的
/home/test/chip-seq
目录下,我们可以用如下命令进行peak calling:macs2 callpeak -t input/ip.part.bam -c input/input.part.bam --outdir output/macs_peak \
--name=yeast_macs_p05 --format=BAM --gsize=1.2e7 --tsize=50 --pvalue=1e-5
输出的结果都在
output/macs_peak
目录中, 其中yeast_macs_p05_peaks.xls
包含了显著的peak的相关信息。这个文件虽然以'.xls'为后缀,实际上却是一个文本文件。文件中每行对应着一个peak,每列的含义如下:- chromosome name
- start position of peak (1-based)
- end position of peak (1-based)
- length of peak region
- absolute peak summit position
- pileup height at peak summit, -log10(pvalue) for the peak summit (e.g. pvalue =1e-10, then this value should be 10)
- fold enrichment for this peak summit against random Poisson distribution with local lambda, -log10(qvalue) at peak summit
ChIPseeker是一个国人开发的R包,主要用来做chip-seq peaks的注释和可视化。在获得了peak之后,我们会希望对这些peak在基因组上的分布有一些直观的认识,例如peak的genomic interval相对于转录起始位点(transcription start site,TSS)的分布,peak在基因组不同区域(intergenic region)的分布等等。ChIPseeker提供了一些方便使用的函数进行这些可视化,我们可以直接调用,用不着自己去手动实现了。
下面的脚本提供了一个简单的例子。这个脚本需要用到ChIPseeker,GenomicFeatures和org.Sc.sgd.db三个bioconductor的R packages,请大家在宿主机上用
BiocManager::install
自行安装。所需的输入是我们用MACS2做peak calling在输出目录产生的yeast_macs_p05_peaks.narrowPeak
文件。ChIPseeker还提供了很多其他的功能,有兴趣的同学请自行参考它的文档。library("ChIPseeker")
library("GenomicFeatures")
library("org.Sc.sgd.db")
# makeTxDbFromUCSC function is provided by GenomicFeatures package
# download gene annotation from UCSC, and save the transcript model in txdb object
# it may take a while ...
txdb <- makeTxDbFromUCSC(genome ="sacCer2", tablename = "ensGene")
# Read peak file generated by MACS
# readPeakFile fucntion is provided by ChIPseeker
yeast <- readPeakFile("yeast_macs_p05_peaks.narrowPeak")
# ChIP peaks binding TSS region
# getPromoters function is provided by GenomicFeatures
# define promoters as upstream 3000 nt and downstream 3000 nt flanking TSS
promoter <- getPromoters(TxDb = txdb, upstream = 3000, downstream = 3000)
tagMatrix <- getTagMatrix(peak = yeast, windows = promoter)
tagHeatmap(tagMatrix = tagMatrix, xlim = c(-3000, 3000), color = "red")
# Average profile of ChIP peaks binding to TSS region
plotAvgProf(tagMatrix, xlim = c(-3000, 3000), conf = 0.95, resample = 1000,
xlab = "Genomone Region (5'->3')", ylab = "Read Count Frequency")
tagHeatmap
输出一个heatmap,每一行对应一个转录本,每一列对应相对TSS的一个位置(每个转录本被根据TSS的位置align到一起),颜色表示落在这个位置的peaks的数量:
Tag Heatmap
plotAvgProf
输出peaks在不同的转录本中相对于TSS总体的coverage profile:
Average Profile
- 1.请解释在ChIP-seq实验中为什么一般都要平行做一个 control (通常叫 input)的实验。
- 2.请解释
findPeaks
和findMotifsGenome.pl
主要参数的含义。 - 3.我们在容器的
/home/test/chip-seq/homework
目录中提供了酵母Snf1蛋白CHIP-seq的bam文件,ip.chrom_part.bam
为IP实验数据,input.chrom\_part.bam
为背景数据。请大家从这两个文件出发,用homer重复本章中介绍的peak calling和motif finding分析。请大家提交找到的motif的截图,以及Fold Change (vs Control)
>=8且p-value (vs Control)
<`的peaks(建议放在同一个文件中提交)。