7.1.Translation: Ribo-seq

1) Background

Ribo-seq(有时又称为ribosome profiling)是2009年Weissman课题组首次发表的研究细胞内蛋白翻译组的二代测序技术。这种技术选择性的对有RNA上有核糖体结合的片段进行测序,这样就能获得很多翻译组的信息。

Ribo-seq大致步骤为:

  • 裂解细胞,富集结合着核糖体的mRNA

  • 用核酸酶消化掉mRNA上没有核糖体保护的片段

  • 富集核糖体,进而纯化得到有核糖体所保护的片段(ribosome protected fragment, RPF)

  • 对RPF建库测序

  • 生物信息学的分析

Ribo-seq数据测得的RNA片段长短与small RNA-seq相似,大约分布在25~35nt区间。由于Ribo-seq选择性的捕获正在翻译的序列,所以其数据的测序片段大多比对到基因组的CDS(coding sequence)。此外,Ribo-seq还有一个明显区别于其他RNA-seq的特点,即Ribo-seq的信号在CDS区域往往呈现3-nt的周期性(图1)。这主要因为一个codon是3 nt长,翻译过程中核糖体的移动模式也具有3 nt的周期性。

图1

Ribo-seq数据可以用来做的分析主要有以下两种:

  • 对翻译组进行定量。Ribo-seq数据可以像RNA-seq数据一样得到表达矩阵用作差异分析。如果同样的样本还有RNA-seq的数据,那么还可以进行翻译效率的分析

  • ORF calling。基于ribo-seq数据,我们可以注释一些novel的ORF。

2) Workflow

本文将介绍两个Ribo-seq处理软件(Ribocode和Ribowave),以及一个专门用来对翻译效率进行差异分析的R包(Xtail)。我们主要介绍Ribocode和Xtail的使用,Ribowave作为补充学习。

2a)Ribocode

Ribocode是一种包含Ribo-seq数据预处理、可视化、识别翻译ORF的的Ribo-seq全流程分析工具。Ribocode的分析流程如下:

2b)Xtail

Xtail是一个用于计算翻译效率(TE)和差异翻译效率的R package。计算流程如下:

2c)Ribowave

RiboWave是一种基于小波变换对Ribo-seq数据的进行预处理,从而识别翻译ORF,计算翻译效率的Ribo-seq全流程分析工具。RiboWave的分析流程如下:

3) RiboCode

3a) 准备工作环境

  • 使用docker 启动ribo-seq所用的 Docker,进入工作目录

cd /home/test/rna_regulation
cd /home/test/rna_regulation/ribo-code
  • 下载软件和文件 1.Install software:Ribocode 2.Download data: link

3b) 准备注释文件

mkdir RiboCode_annot
/home/test/software/miniconda3/bin/prepare_transcripts \
-g /home/test/rna_regulation/ribo-code/GTF/Arabidopsis_thaliana.TAIR10.34.gtf \
-f /home/test/rna_regulation/ribo-code/GTF/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa \
-o /home/test/rna_regulation/ribo-code/RiboCode_annot

产生transcripts annotation文件,用于后续分析。

3c) 确定P-site的位置

  • 我们知道,一个read对应一个RPF,一个RPF对应一个核糖体,而一个核糖体有A(aminoacyl-site),P(peptidyl-site)和E(exit-site)三个位点。我们通常用P位点的位置来代表翻译的信号。

  • 推测P site相对于RPF的位置有不同的策略。RiboCode的做法是,对于长度相同的RPF统一设定一个offset作为P site的位置,使得这个位置可以产生周期性最强的翻译信号。这样我们就能根据一个read的长度直接推测出P site的位置。

  • 请大家注意,这里的bam文件是位于转录本的坐标上的

RiboCode_annot=/home/test/rna_regulation/ribo-code/RiboCode_annot
OUT=/home/test/rna_regulation/ribo-code/wtuvb1/
mkdir -p ${OUT}
mkdir -p ${OUT}/metaplots

/home/test/software/miniconda3/bin/metaplots -a ${RiboCode_annot} \
-r /home/test/rna_regulation/ribo-code/wtuvb1/wtuvb1.Aligned.toTranscriptome.out.sorted.bam \
-o ${OUT}/metaplots/wtuvb1. \
-m 26 -M 50 -s yes -pv1 1 -pv2 1

Output:

  • wtuvb1.wtuvb1.Aligned.toTranscriptome.out.sorted.pdf:a PDF file plots the aggregate profiles of the distance from the 5'-end of reads to the annotated start codons (or stop codons), which is used for examining the P-site periodicity of RPF reads on CDS regions.

  • wtuvb1._pre_config.txt:The P-site config file, which defines the read lengths with strong 3-nt periodicity and the associated P-site locations for each length.

3d) ORFs calling

RiboCode_annot=/home/test/rna_regulation/ribo-code/RiboCode_annot
OUT=/home/test/rna_regulation/ribo-code/wtuvb1/
config=${OUT}/metaplots/wtuvb1._pre_config.txt
mkdir -p ${OUT}/ORF
/home/test/software/miniconda3/bin/RiboCode -a ${RiboCode_annot} -l no -g -c ${config} -o ${OUT}/ORF/ORF

这一步可以获取有翻译数据支持的ORF。 Output:

  • ORF.gtf/ORF.txt:contains the information of all predicted ORFs in each transcript.

  • ORF_collapsed.gtf /RF_collapsed.txt :file combines the ORFs having the same stop codon in different transcript isoforms: the one harboring the most upstream in-frame ATG will be kept.

Some column names of the result file:

- ORF_ID: The identifier of predicated ORF.
- ORF_type: The type of predicted ORF, which is annotated according to its location to associated CDS. The following ORF categories are reported:

 "annotated" (overlapping with annotated CDS, have the same stop codon with annotated CDS)
 "uORF" (upstream of annotated CDS, not overlapping with annotated CDS)
 "dORF" (downstream of annotated CDS, not overlapping with annotated CDS)
 "Overlap_uORF" (upstream of annotated CDS and overlapping annotated with CDS)
 "Overlap_dORF" (downstream of annotated CDS and overlapping annotated CDS"
 "Internal" (internal ORF of annotated CDS, but in a different reading frame)
 "novel" (from non-coding genes or non-coding transcripts of the coding genes).

- ORF_tstart, ORF_tstop: the start and end position of ORF relative to its transcript (1-based coordinate)
- ORF_gstart, ORF_gstop: the start and end position of ORF in the genome (1-based coordinate)
- pval_frame0_vs_frame1: significance levels of P-site densities of frame0 greater than of frame1
- pval_frame0_vs_frame2: significance levels of P-site densities of frame0 greater than of frame2
- pval_combined: integrated P-value by combining pval_frame0_vs_frame1 and pval_frame0_vs_frame2

3e) 计算ORF的count matrix

"ORFcount" command which will call the HTSeq-count program. Only the reads of a given length will be counted. For those ORF with length longer than a specified value (set by "-e"), the RPF reads located in first few and last few codons can be excluded by adjusting the parameters "-f" and "-l".

RiboCode_annot=/home/test/rna_regulation/ribo-code/RiboCode_annot
OUT=/home/test/rna_regulation/ribo-code/wtuvb1/
config=${OUT}/metaplots/wtuvb1._pre_config.txt
mkdir -p ${OUT}/ORF_count
ORFcount -g ${OUT}/ORF/ORF.gtf \
-r ${OUT}/wtuvb1.Aligned.sortedByCoord.out.bam \
-f 15 -l 5 -e 100 -m 24 -M 35 -s yes \
-o ${OUT}/ORF_count/data_all.txt

The reads with length between 24-35 nt aligned on predicted ORF.The reads aligned to first 15 codons and last 5 codons of ORFs and had the length longer than 100 nt will be excluded.

4) Xtail

Xtail是对翻译效率进行差异检验的R包,它内部调用了DESeq2来实现负二项分布检验。

4a) Pre-processing

cd /home/test/rna_regulation
cd /home/test/rna_regulation/xtail
  • 自行下载和安装

1.下载RNA-seq表达矩阵(RNA_count.txt)和Ribo-seq表达矩阵(Ribo_count.txt),在这里

2.Xtail

  • Users should install DESeq2 before using Xtail: start R and enter:

    source("https://bioconductor.org/biocLite.R")
    biocLite("DESeq2")
#For Windows:
#Before install, you need install "Rtools", "MainGW";
#Then using the following command to install: 
install.packages("xtail_1.1.5-source.tar.gz",repo=NULL,type="source")
#For Linux or Mac:
install.packages("xtail_1.1.5-source.tar.gz")

4b) Calculate differential TE

library(xtail)
ribo <- read.table('Ribo_count.txt',header=T, quote='',check.names=F, sep='\t',row.names=1)
mrna <- read.table('RNA_count.txt',header=T, quote='',check.names=F, sep='\t',row.names=1)

ribo <- ribo[,c("wtnouvb1","wtnouvb2","wtnouvb3","wtuvb1","wtuvb2","wtuvb3")]
mrna <- mrna[c("CD1_1","CD1_2","CD1_3","CD0_1","CD0_2","CD0_3")]

condition <- c("control","control","control","treat","treat","treat")
results <- xtail(mrna,ribo,condition,minMeanCount=1,bins=10000)
results_tab <- resultsTable(results,sort.by="pvalue.adjust",log2FCs=TRUE, log2Rs=TRUE)
write.table(results_tab,"TE.xls",quote=F,sep="\t")

5) RiboWave

5a) Pre-processing

  • 使用docker 启动ribo-seq所用的 Docker,进入工作目录

cd /home/test/rna_regulation
cd /home/test/rna_regulation/ribo-wave
  • 下载软件和文件 1.Install software:Ribowave 2.Download data: link

5b) Create annotation

# bedtools2没有添加到环境变量中,需要临时添加
export PATH=$PATH:test@bioinfo_docker:~/software/bedtools2/bin
# 运行annotation脚本(这一步时间会很久,生成文件已经提前跑好放在了annotation_fly目录下,可以跳过直接进行下一步)
script/create_annotation.sh \
-G annotation_fly/dmel-all-r6.18.gtf \
-f annotation_fly/dmel-all-chromosome-r6.18.fasta  \
-o annotation_fly  \
-s script

input files

  • annotation.gtf: the annotation gtf should contain start_codon and stop_codon information, eg: dmel-all-r6.18.gtf

  • genome.fasta: genome fasta, eg: dmel-all-chromosome-r6.18.fasta

  • annotation_dir: the directory for all the annotation output

  • scripts_dir: the directory of all the scripts in the package

output files

annotation directory, including :

  • start_codon.bed: the bed file annotating start codon

  • final.ORFs: all identified ORFs, eg: FBtr0300105_0_31_546 where FBtr0300105 refers to the transcript, 0 refers to the reading frame relative to the start of transcript, 31 refers to the start site, 546 refers to the stop codon.

5c) P-site determination

This step determines the P-site position for each Ribo-seq reads length by overlapping with the annotated start codons from previous step

# 运行P-site_determination.sh脚本
script/P-site_determination.sh \
-i GSE52799/SRR1039770.sort.bam \
-S annotation_fly/start_codon.bed \
-o GSE52799 \
-n SRR1039770 \
-s script;

input files

  • Ribo_bam: secondary alignment removed to ensure one genomic position per aligned read and sorted

  • start_codon.bed: annotated start site start_codon.bed. It is generated in the create_annotation.sh step

  • out_dir: the directory of the output result, eg: GSE52799

  • study_name: the name of all the output file, default: test. eg: SRR1039770

  • scripts_dir: the directory of all the scripts in the package

查看输出

$ cd /home/test/rna_regulation/ribo-wave/GSE52799/P-site
$ ls
SRR1039770.psite1nt.txt  SRR1039770.psite.pdf  SRR1039770.psite.txt
#正常时输出这三个文件
$ cp *pdf /home/test/share/
# 拷贝pdf文件到容器与计算机互通的文件夹,可以用pdf阅读工具打开查看pdf

output files

P-site directory, including :

  • name.psite1nt.txt: the Ribo-seq reads length and its corresponding P-sites position(= offset + 1). It may look this this :

    30  13
  • name.psite.pdf: the PDF displaying the histogram of aggregated reads

    我们收集了所有已被注释的起始密码子并将 这些起始密码子和Ribo-seq 比对上的序列进行重合,分别计算Ribo-seq序列的5’端偏离起始密码子的第一个碱基A的距离(offset)。根据 Ribo-seq测序片段长度的不同,我们进一步将Ribo-seq片段分成多个组分。在每个长度对应的组分里,作出Ribo-seq片段5’端偏离起始密码子A的距离(offset)的直方图。

每一行代表不同长度的 Ribo-seq 测序片段的直方图。该数据中,30nt的reads数目最多,在30nt长度的Ribo-seq片段中,我们可以明显的看到在距离为13nt的位点含有一个峰值(peak)。鉴于大部分核糖体会在翻译起始位点停滞较多的时间,因此对于30nt长的Ribo-seq片段,其P-site位点的定义应该代表直方图中绝大多数的核糖体,因此我们将P-site位点应该定义为峰值最高的第13个碱基(13nt)的位置。

5d) Generating P-site track

基于Ribo-seq序列及其确定的P-site位点,将规律推广到所有Ribo-seq的片段中,直接根据Ribo-seq序列的长度推断其对应的P-site位点。根据这种方法,我们可以将每一条转录本上所有的Ribo-seq片段转化为对应的P-site位点的信号点并获得转录组水平的Ribo-seq信号轨迹(signal track)。由于是由P-site位点定义出的信号轨迹,通常也被叫做P-site信号轨迹(P-sites track)。转录本上每一个位点的信号丰度代表了有多少Ribo-seq片段对应的P-site位点落在该位置上。

This step creats the P-site track for transcripts of interests using determined P-sites position from previous step. look at transcripts from chromosome X :

查看输出

script/create_track_Ribo.sh \
-i GSE52799/SRR1039770.sort.bam \
-G annotation_fly/X.exons.gtf \
-g annotation_fly/genome \
-P GSE52799/P-site/SRR1039770.psite1nt.txt \
-o GSE52799 \
-n SRR1039770 \
-s script

input files

  • Ribo_bam

  • exons.gtf: a gtf file for only the exons from transcripts of interest, eg: X.exons.gtf

  • chromosome_size: the file including all the chromosomes and its genome size. Noted: genome can be obtained by using samtools faidx function with the input of fasta file. genome may look like this:

    2L    23513712
    2R    25286936
    3L    28110227
    3R    32079331
  • P-site_position: the file listing the P-site position for each read length. This file can be found in the output of previous step, eg: name.psite1nt.txt

  • out_dir: the directory of the output result, eg: GSE52799

  • study_name: the name of all the output file, default: test. eg: SRR1039770

  • scripts_dir: the directory of all the scripts in the package

查看生成文件

$ cd /home/test/rna_regulation/ribo-wave/GSE52799/bedgraph/SRR1039770
$ ls
final.psite
$ less final.psite

output files

  • bedgraph/name directory, including :

    final.psite : P-site track at transcriptome wide. It may look like this :

    FBtr0070533    0,0,0,0,0,0,0,0,0,0,0,0,6,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,1,0,0,0,0,1,0,2,1,0,0,0,0,0,0,4,8,0,0,3,0,5,12,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
    FBtr0073886    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,25,0,0,0,0,0,0,0,0,0,0,0,0,0
    FBtr0070604    0,0,0,0,0,0,0,0,0,0,0,0,59,6,0,1,0,0,2,6,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
    FBtr0070603    0,0,0,0,0,0,0,0,0,0,0,0,75,2,7,10,7,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,3,3,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0

5e) RiboWave main function

This step can achieve multiple functions :

  • denoising [denoise]

降噪:在降噪过程运用了小波变换来去除原始信号中非3-nt周期性的信号使得保留的信号均具有翻译的3-nt周期性,即P-site信号。

  • providing predicted p.value for each given ORF to identify its translation status [pvalue,-P]

鉴定ORF的翻译潜能:根据该转录本上经降噪后的P-sites是否富集在该ORF所在的阅读框内,判断ORF的翻译潜能。P<0.05即具有翻译活性。

  • providing reads density (P-site/PF P-site) for each given ORF [density,-D]

reads density = ORF上reads数/ORF长度,反映翻译水平的abundance。

  • providing translation efficiency (TE) estimation for each given ORF [TE,-T]

TE = 翻译水平的abundance/转录水平的abundance,反映翻译效率。

  • providing frameshift potential (CRF score) for each given ORF [CRF,-F]

鉴定潜在核糖体移码现象。

It might take hours to perform the analysis if the input is large. It is recommended to specify the number of CPU cores through the -p option.

Run Ribowave on example:

Denoise the P-site track

# 在/home/test/rna_regulation/ribo-wave目录下
mkdir -p /home/test/rna_regulation/ribo-wave/GSE52799/Ribowave
script/Ribowave  \
-a GSE52799/bedgraph/SRR1039770/final.psite \
-b annotation_fly/final.ORFs \
-o GSE52799/Ribowave \
-n SRR1039770 \
-s script \
-p 8

Identifying translated ORF

mkdir -p /home/test/rna_regulation/ribo-wave/GSE52799/Ribowave
script/Ribowave \
-P \
-a GSE52799/bedgraph/SRR1039770/final.psite \
-b annotation_fly/final.ORFs \
-o GSE52799/Ribowave \
-n SRR1039770 \
-s script \
-p 8

Estimating abundance

mkdir -p /home/test/rna_regulation/ribo-wave/GSE52799/Ribowave
script/Ribowave \
-D \
-a GSE52799/bedgraph/SRR1039770/final.psite \
-b annotation_fly/final.ORFs \
-o GSE52799/Ribowave \
-n SRR1039770 \
-s script \
-p 8

Estimating TE

IMPORTANT : when estimating TE, user should input the sequenced depth of Ribo-seq and the FPKM value from paired RNA-seq

mkdir -p /home/test/rna_regulation/ribo-wave/GSE52799/Ribowave
script/Ribowave \
-T 9012445  GSE52799/mRNA/SRR1039761.RPKM \
-a GSE52799/bedgraph/SRR1039770/final.psite \
-b annotation_fly/final.ORFs \
-o GSE52799/Ribowave \
-n SRR1039770 \
-s script \
-p 8

Calculating frameshift potential

on annotated ORFs

mkdir -p /home/test/rna_regulation/ribo-wave/GSE52799/Ribowave
awk -F '\t' '$3=="anno"'  annotation_fly/final.ORFs  >   annotation_fly/aORF.ORFs;
script/Ribowave \
-F \
-a GSE52799/bedgraph/SRR1039770/final.psite \
-b annotation_fly/aORF.ORFs \
-o GSE52799/Ribowave \
-n SRR1039770 \
-s script \
-p 8

Multiple functions

mkdir -p /home/test/rna_regulation/ribo-wave/GSE52799/Ribowave
script/Ribowave \
-PD \
-T 9012445  GSE52799/mRNA/SRR1039761.RPKM \
-a GSE52799/bedgraph/SRR1039770/final.psite \
-b annotation_fly/final.ORFs \
-o GSE52799/Ribowave \
-n SRR1039770 \
-s script \
-p 8

input files

  • bedgraph/name:

    P-site track: output from the previous step, containing the P-site track of transcripts of interest, eg: final.psite

  • ORF_list: ORFs of interest ,eg : final.ORFs. It is generated in the step of create_annotation.sh

  • Ribo-seq sequenced depth: the sequenced depth of Ribo-seq to calculate FPKM , eg: 9012445

  • RNA FPKM: FPKM table. It may look like this :

 FBtr0100871    22262
 FBtr0070604    18682
 FBtr0100231    14746.5
 FBtr0100874    14024.5
 FBtr0100864    11475.6

output files

  • name.PF_psite: the denoised signal track(PF P-sites signal track) at transcriptome wide. It looks similar as the input final psite.

  • including chi-square P-value information. It may look like this :

    column1-4: basic information about the ORF
    column5: reads coverage within the ORF
    column6: P-value predicted by RiboWave
    column7: Values estimating the relative abundance of PF P-sites outside of the studied ORF
    column8: Reads intensity at the current start codon

    result directory, including :

  • name.95%.mx: RiboWave makes the prediction on the translation initiation sites and gives the final translated product output (p.value < 0.05) . It may look like this :

    FBtr0070007_2_93_1028
    FBtr0070008_1_128_943
    FBtr0070025_2_135_1094
  • name.density: reads density ( PF P-site ) of given ORFs. It may look like this :

    column1-4: basic information about the ORF
    column5: number of PF P-sites in transcript
    column6: number of PF P-sites in given ORF
    column7: density of PF P-sites in given ORF
  • name.TE: TE of given ORFs. It may look like this :

    column1: transcript
    column2: ORF
    column3: TE
  • name.CRF.final: ORFs that might experience reading frame translocation. It may look like this :

    column1: ORF
    column2: start of frameshift
    column3: stop of frameshift
    column4: PF P-sites' reading frames after the change point ,eg: 2_2,0_1 where 2_2 refers to continuous two PF P-sites of frame 2 followed by continuous one PF P-sites of frame 0.
    column5: Relative position of PF P-sites after the shift ,eg : 1413,1440;1789 where 1413,1440 corresponds to the exact position of 2_2 within the transcript. Discontinuity in the reading frame is separated by ;
    column6: CRF score describing the potential of frameshift

6) More Software

PreprocessingORF Iden.Quantification TETranslation elongationIntegrative & visualization

Rfoot

DeepRibo

scikit-ribo

diricore

Plastid

RiboWaltz

FLOSS

RiboDiff

Iχnos

Rfeet

RiboWave

ORF-RATER

riborex

PausePred

RiboGalaxy

PRICE

Xtail

ROSE

RiboProfiling

RiboCode

RiboMiner

riboSeqR

RiboHMM

RiboTools

RibORF

RIVIT

RiboTaper

Shoelaces

RP-BP

tRanslatome

SPECtre

TITER

7) Homework

1)解释翻译效率(translation efficiency,TE)的含义。

2)利用RiboWave过程中示例文件算出TE,并画出TE的分布。

3)我们给出作业文件wtuvb2(在/home/test/rna_regulation/ribo-code/wtuvb2,或者在这里下载),利用Ribocode,画出该样本的3nt周期性示意图,计算翻译ORF,并计算每类ORF的个数,尝试解释ORF不同类别的含义。

4)我们给出作业文件RNA_count.txt和Ribo_count_2.txt(在这里下载)利用Xtail过程计算差异翻译,并绘制火山图。

Last updated