# 4.1.Sequence Motif

## 0) 背景

在[CHIP-seq](https://book.ncrnalab.org/teaching/part-iii.-ngs-data-analyses/3.chip-seq)一节中，我们用[homer](http://homer.ucsd.edu/homer/)中已经用提供的工具从ChIP-seq的peak对应的序列出发进行过motif discovery的分析。homer主要是针对CHIP-seq的分析开发的，本节介绍的[meme suite](https://meme-suite.org/)是一套更通用的专门针对motif分析开发的工具。meme最早是1994年开发出来的一种用EM算法进行motif discovery的程序。后来meme和其他多个相关的工具被整合到了一个工具包[meme suite](https://meme-suite.org/)里面。[meme suite](https://meme-suite.org/)是目前用来做motif相关分析的最流行的一套工具。

本节中，我们用meme suite提供的[meme](https://meme-suite.org/meme/doc/meme.html)和[ame](https://meme-suite.org/meme/doc/ame.html)两个工具分别解决以下两个问题:

* motif discovery: 假设有一组核酸序列，生物学知识告诉我们这组序列中很可能有相当一部分共享了一个调控因子的结合位点(例如ChIP-seq和CLIP-seq的peaks，共表达基因的启动子，翻译存在共调控的转录本的UTR区域等等)。我们希望从这组序列出发，计算出一组PWMs，用来描述这种可能存在的调控因子的序列偏好性。
* motif enrichment: 假设已知一组PWMs，我们希望判断这组PWMs在给定的一组序列中是否相对于另外一组背景序列有富集。

meme suite里面的大多数工具都已经非常成熟我们只需要选择合适的参数，用一行命令就可以实现motif discovery。相对而言，准备输入序列反而更需要一些生物学知识，在编码上也更需要花费精力。所以在本例中我们对准备输入序列的过程，以及这一过程用到的R包[GenomicFeatures](https://bioconductor.org/packages/release/bioc/html/GenomicFeatures.html)进行介绍。

## 1) workflow

![](https://4115668567-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-LPVsf5VZbQ7h14X29qW%2F-Ll_k3bEdWyNMbJ7KL0f%2F-Ll_k4iTT8LueXdHnuVu%2Fseq_motif.pipeline.png?generation=1565075202446213\&alt=media)

在我们给出的这个例子中，假设我们通过某种方式得到了一个gene list，对这些基因的调控序列(promoter, 5' UTR和3' UTR)，我们现在考虑两个问题：

1. 它们中是否存在共有的motif
2. 它们是否富集了已知的motif

本节的操作主要包括两个步骤，一是准备输入序列，二是进行motif finding和motif enrichment。

* 在第一个步骤中，我们需要获得感兴趣的基因的promoter, 5' UTR和3' UTR，作为motif discovery的输入。motif的"enrichment"是相对于背景序列而言的，所以我们还需要准备一些背景序列。背景序列有很多种合理的选取方法，如shuffle输入序列，从基因组上随机选取一些genomic intervals等。我们在docker镜像中已经提供了准备好的序列文件，可以跳过这一步。
* 在第二个步骤中，我们调用[meme](https://meme-suite.org/meme/doc/meme.html)和[ame](https://meme-suite.org/meme/doc/ame.html)，产生所需的结果。

## 2) running steps

### 2a) 准备环境和文件

#### docker

使用docker的同学可以从[共享目录](https://cloud.tsinghua.edu.cn/d/ad22768345664924b202/?p=%2FDocker\&mode=list)下载`bioinfo_motif.2.0.tar.gz`镜像，然后操作如下:

```bash
docker load -i bioinfo_motif.2.0.tar.gz
docker run -dt --name motif --restart unless-stopped -v {host.machine.shared.directory}:/data bioinfo/motif:2.0 /bin/bash
docker exec -it -u test motif /bin/bash
```

如果希望自己准备输入序列，还需要从[这里](https://cloud.tsinghua.edu.cn/d/ad22768345664924b202/?p=%2FFiles%2FPART_III%2F4.Motif%2Fsequence_motif%2Fgenome\&mode=list)下载人的hg38参考基因组和gencode的注释文件，放在共享目录的`genome`子目录下。

```bash
cd /home/test/motif/sequence_motif
ln -s /data/genome . #这一步相当于建立一个快捷方式，方便访问
gunzip genome/GRCh38.p10.genome.fa.gz #解压
gunzip genome/gencode.v27.annotation.gtf.gz
```

### 2b) 准备输入序列

如前所述，我们提供了准备好的序列文件，如果只希望了解motif分析本身可以跳过这一节。 我们提供了下面两个脚本，根据提供的gene list和gtf注释文件从基因组中提取这些基因的一些可能具有调控作用的序列(promoter, 5'UTR和3' UTR)作为motif discovery的输入。

* 在第一个脚本中，我们利用R包[GenomicFeatures](https://bioconductor.org/packages/release/bioc/html/GenomicFeatures.html)读取gtf注释文件，对于感兴趣的基因，用`fiveUTRsByTranscript`，`threeUTRsByTranscript`和`promoters`三个函数获取相应的genomic intervals。对于一个基因产生多个isoform的情况，我们这里只考虑最长的一个转录本。

```r
#!/usr/bin/env Rscript

message("Load required packages ...")
suppressPackageStartupMessages(library("GenomicFeatures"))

message("Load trasncripts in gencode human genome annotation ...")

# specify the location of your genome annotation in gtf format
gtf.path <- "genome/gencode.v27.annotation.gtf"

# GenomicFeatures package use "TxDb" to store annotation for each transcripts
tx.db <- makeTxDbFromGFF(gtf.path, format="gtf")

# Here we only consider longest trasncript of each gene to avoid over representation of genes with multiple isoform
tx.lengths <- transcriptLengths(tx.db, with.cds_len=FALSE,with.utr5_len=FALSE, with.utr3_len=FALSE)
# tx.lengths is a data.frame with 5 fields:
# tx_id, tx_name, gene_id, nexon (numer of exons), tx_len
gene.id.2.longest.tx.id <- lapply(split(tx.lengths,tx.lengths$gene_id),
                                  function(isoform.length.table){isoform.length.table$tx_name[which.max(isoform.length.table$tx_len)];})
# get a look up table: gene id to longest transcript id  
gene.id.2.longest.tx.id <- unlist(gene.id.2.longest.tx.id)


message("Load gene ids of interest ...")
gene.ids <- read.delim("gene.ensembl.ids.txt",sep="\n",stringsAsFactors = F,header = F)$V1
message(length(gene.ids)," genes were loaded .")

considered.tx.ids <- gene.id.2.longest.tx.id[gene.ids]


# longest transcript id to gene id look up
names(gene.ids) <- considered.tx.ids


# fetch 5' UTR sequences 
message("Fetch 5' UTR genomic intervals ...")
five.prime.utrs = fiveUTRsByTranscript(tx.db, use.names=T)
five.prime.utrs <- data.frame(five.prime.utrs)
# five.prime.utrs is now a data.frame like this:
#  group      group_name       seqnames  start    end     width strand exon_id     exon_name            exon_rank
#1   1     ENST00000641515.1     chr1    65419    65433    15      +      21	   ENSE00003812156.1         1
#2   1     ENST00000641515.1     chr1    65520    65573    54      +      22       ENSE00003813641.1         2
#3   1     ENST00000641515.1     chr1    69037    69090    54      +      23       ENSE00003813949.1         3

message("Saving 5' UTR genomic intervals in bed format ...")
fields <- c("seqnames","start","end","group_name","exon_id","strand")
five.prime.utrs <- five.prime.utrs[five.prime.utrs$group_name %in% considered.tx.ids,fields]
five.prime.utrs$group_name <- paste0(gene.ids[five.prime.utrs$group_name],"-",five.prime.utrs$group_name)
five.prime.utrs$start <- five.prime.utrs$start - 1
write.table(five.prime.utrs,"five.prime.utrs.bed",quote=FALSE,sep="\t",col.names=FALSE,row.names=FALSE)

# fetch 3' UTR sequences
message("Fetch 3' UTR genomic intervals ...")
three.prime.utrs = threeUTRsByTranscript(tx.db, use.names=T)
three.prime.utrs <- data.frame(three.prime.utrs)
message("Saving 3' UTR genomic intervals in bed format ...")
three.prime.utrs <- three.prime.utrs[three.prime.utrs$group_name %in% considered.tx.ids,fields]
three.prime.utrs$group_name <- paste0(gene.ids[three.prime.utrs$group_name],"-",three.prime.utrs$group_name)
three.prime.utrs$start <- three.prime.utrs$start - 1
write.table(three.prime.utrs , "three.prime.utrs.bed",quote=FALSE,sep="\t",col.names=FALSE,row.names=FALSE)


# fetch promoter sequences 
message("Fetch promoter genomic intervals ...")
# how are so called "promoters" defined in GenomicFeatures package? please see its manual.
promoters.ivs <- promoters(tx.db)
promoters.ivs <- data.frame(promoters.ivs)
message("Saving promoter genomic intervals ...")
fields <- c("seqnames","start","end","tx_name","tx_id","strand")
promoters.ivs <- promoters.ivs[promoters.ivs$tx_name %in% considered.tx.ids,fields]
promoters.ivs$start <- promoters.ivs$start - 1
write.table(promoters.ivs, "promoters.bed",quote=FALSE,sep="\t",col.names=FALSE,row.names=FALSE)

message("All done .")
```

* 以上脚本输出了三个文件:`promoters.bed`，`five.prime.utrs.bed`，`three.prime.utrs.bed`。promoter是基因组上的序列，而5'UTR和3'UTR是剪接后的mRNA上的序列。以上获取的对应5'UTR和3'UTR的genomic interval实际上是组成5'UTR和3'UTR的exons在基因组上的坐标。所以对于promoter序列，一个genomic interval就对应一个转录本的promoter；但是为了5'UTR和3'UTR序列，我们还需要把属于一个转录本的exon序列拼接到一起。我们把这三个文件放在`bed`目录下:

```bash
mkdir bed
mv *.bed bed
```

* 我们先用`bedtools getfasta`命令获取promoter以及5'UTR和3'UTR的exon序列:

```bash
mkdir -p sequences
bedtools getfasta -s -name+ -fi genome/GRCh38.p10.genome.fa -bed bed/promoters.bed -fo sequences/promoters.fa
bedtools getfasta -s -name+ -fi genome/GRCh38.p10.genome.fa -bed bed/five.prime.utrs.bed -fo sequences/five.prime.utrs.fa
bedtools getfasta -s -name+ -fi genome/GRCh38.p10.genome.fa -bed bed/three.prime.utrs.bed -fo sequences/three.prime.utrs.fa
```

* 接下来我们把3'UTR的exons连接到一起。为了比较方便的读写fasta文件，我们使用了[Biostrings](https://bioconductor.org/packages/release/bioc/html/Biostrings.html) package提供的`readBStringSet`和`writeXStringSet`两个函数。

```r
#!/usr/bin/env Rscript

message("Load required packages ...")
suppressPackageStartupMessages(library("Biostrings"))


input.sequences.path <- "sequences/three.prime.utrs.fa"
output.sequences.path <- "sequences/three.prime.utrs.spliced.fa"

message("Load sequences ...")
# load fasta file with readBStringSet function in Biostrings package
sequences <- readBStringSet(input.sequences.path, format="fasta")

#example: ENSG00000257315.2-ENST00000550078.2::chr1:203848330-203848407(+)
seq.names.long <- names(sequences)

#example: ENSG00000257315.2-ENST00000550078.2
seq.names <- unlist(lapply(seq.names.long,function(s){strsplit(s,"::")[[1]][1]}))

#example: chr1:203848330-203848407(+)
regions <- unlist(lapply(seq.names.long,function(s){strsplit(s,"::")[[1]][2]}))

#example: 203848330
start.positions <- as.numeric(unlist(lapply(regions,function(s){strsplit(strsplit(s,":")[[1]][2],"-")[[1]][1]})))
strands <- unlist(lapply(regions,function(s){l=nchar(s);substr(s,l-1,l-1);}))


region.table <- data.frame(seq.names.long=seq.names.long,start.positions=start.positions,strands=strands)

intervals.by.tx <- split(region.table,seq.names)

message("Concatenate sequences ...")

string.set <- DNAStringSet()
for(tx.id in names(intervals.by.tx)){
 intervals <- intervals.by.tx[[tx.id]]
  if(intervals$strands[1]=="+"){
   intervals <- intervals[order(intervals$start.positions,decreasing=F),];
 }else{
   intervals <- intervals[order(intervals$start.positions,decreasing=T),];
 }
 # concatenate exons that belongs to same UTR into a single spliced sequence
 spliced.sequence <- paste(apply(intervals,1,function(x){toString(sequences[x[1]]);}),collapse="")
 string.set[[tx.id]] = spliced.sequence
}

# save results in fasta format

message("Saving results ...")
writeXStringSet(string.set,output.sequences.path)
message("All done.")
```

* 把以上脚本的输入输出做一点改动，再重新运行，我们可以得到剪接后的5'UTR的序列:

```r
input.sequences.path <- "sequences/five.prime.utrs.fa"
output.sequences.path <- "sequences/five.prime.utrs.spliced.fa"
```

* 我们用meme suite提供的`fasta-dinucleotide-shuffle`脚本通过shuffle输入序列产生背景序列:

```bash
fasta-dinucleotide-shuffle -f sequences/promoters.fa > sequences/promoters.dinuc.shuffled.fa
fasta-dinucleotide-shuffle -f sequences/five.prime.utrs.spliced.fa > sequences/five.prime.utrs.spliced.dinuc.shuffled.fa
fasta-dinucleotide-shuffle -f sequences/three.prime.utrs.spliced.fa > sequences/three.prime.utrs.spliced.dinuc.shuffled.fa
```

### 2c) motif 分析

#### motif discovery

* `-dna`: 如果不加这个参数，meme会默认输入的是蛋白序列...
* `-minw`, `-maxw`: motif的最小宽度和最大宽度
* `-nmotifs`: 需要discovery多少个motif

```bash
meme -dna -minw 6 -maxw 10 -oc promoters.motif.discovery -nmotifs 5 sequences/promoters.fa
```

* 我们可以把输出目录`promoters.motif.discovery`复制到宿主机的共享目录下。输出目录中会有一个html文件，用浏览器打开后我们可以看到类似下面这样的内容:

![](https://4115668567-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-LPVsf5VZbQ7h14X29qW%2Fuploads%2Fgit-blob-c6b44c316e62a1278664ed98e3cdeff5cd415374%2Fpromoter.motifs.PNG?alt=media)

#### motif enrichment

```bash
ame --control sequences/promoters.dinuc.shuffled.fa --oc promoter.motif.enrichment sequences/promoters.fa known-motifs/JASPAR2018_CORE_vertebrates_non-redundant.meme 
```

* 我们可以把输出目录\`promoter.motif.enrichment\`\`复制到宿主机的共享目录下。用浏览器打开后我们可以看到类似下面这样的内容:

![](https://4115668567-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-LPVsf5VZbQ7h14X29qW%2Fuploads%2Fgit-blob-f4b74b9bfaef0f0ce64def69f83bffb2f1165879%2Fpromoter.motifs.known.PNG?alt=media)

## 3) Homework

1. 阅读文档回答，在`GenomicFeatures` package中，默认情况下`promoters`函数是如何定义promoter对应的genomic intervals的?
2. meme的`-mod`参数有oops，zoops，anr三个可选项，阅读文档回答，它们分别是什么含义?
3. 在5'UTR序列上重复motif discovery分析，对`known-motifs/Ray2013_rbp_Homo_sapiens.meme`中提供的motif在5'UTR序列上进行motif enrichment分析(提示:meme要求每一条输入序列的长度都要大于指定的motif宽度，所以需要事先去除长度过短的序列，大家可以写脚本实现，也可以手动删除)。
4. 对我们在[CHIP-seq](https://book.ncrnalab.org/teaching/part-iii.-ngs-data-analyses/3.chip-seq)一节的作业中获得的peaks对应的序列，用meme进行motif finding的分析。

{% hint style="info" %}
利用homer的如下命令可以获得peak对应的序列，再利用这些序列作为meme的输入:

```
# CHIP.peak为homer findPeaks输出的peak文件, /home/test/homer/data/genomes/sacCer2/为homer使用的酵母基因组所在的目录
homerTools extract CHIP.peak /home/test/homer/data/genomes/sacCer2/ -fa
```

{% endhint %}


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://book.ncrnalab.org/teaching/part-iii.-ngs-data-analyses/4.motif/sequence_motif.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
