2.2.Ribo-seq Analysis

本章为Ribo-seq数据处理的说明，分为Prepare Data Matrix和Data analysis两大部分。

Part I.Prepare Data Matrix：完成样本的Reads Processing、Remove RNA and Mapping工作，得到Mapped reads(bam)并绘制质量控制相关图，计算Ribo-seq reads count matrix。
Part II.Data analysis：完成差异翻译效率分析。
- 利用Xtail基于Ribo-seq的count matrix 和RNA-seq的count matrix计算差异翻译效率。为了方便对比RNAseq只统计CDS区域的reads。

Part I.Prepare Data Matrix

0) 数据说明

Ribo-seq测序数据位于/data/TA_QUIZ_RNA_regulation/data/riboshape_liulab_batch4/Ribo-seq 这里我们以(CR1_1,CR1_2,CR1_3)为例。

1) Mapping

这里我们通过脚本进行样本的批量处理，分为fastqc，trimmed，remove_rRNA，mapping，read_count_HTSeq 5个脚本，每个脚本路径我们在下面详细说明。

1.1) QC of raw data

这步操作主要是检查数据的质量，脚本参考/data/TA_QUIZ_RNA_regulation/script/PartII.Ribo-seq_analysis/1.fastqc.sh

Input：

data type

path

raw data

/CR1*.fq.gz

Software/Parameters：

fastqc

options

function

-q --quiet

禁止在标准输出上显示所有进度，仅报告错误；安静模式

-o --outdir

指定输出目录

-h --help

选项的详细介绍

-t 4

设置线程数

--noextract

指定结果文件压缩

Output：

\*.html：质控报告解读的网页文件。
\*.zip：网页文件中画图的数值及图片。

因此fastqc结果仅看html即可。

1.2) Cut adaptor & trim long read

这里使用fastp默认参数，对数据自动进行全方位质控，脚本参考：/data/TA_QUIZ_RNA_regulation/script/PartII.Ribo-seq_analysis/2.trimmed.sh

Software/Parameters：

fastp

options

function

-i

read1 输入文件名

-o

read1 输出文件名

-I

read2 输入文件名

-O

read2 输出文件名

--thread=4

指定线程数为4

-l --length_required 15

短于指定长度的reads将被丢弃，默认为15，即长度<15的read被去掉

--length_limit

设置read最大长度，默认为0，即没有最大长度限制。

-j --json filename

设置输出的json格式的质控结果文件名，不设置则默认json文件名为fastp.json

-h --html filename

设置输出html格式的质控结果文件名，不设置则默认html 文件名为fastp.html

usage: fastp -i <in1> -o <out1> [-I <in1> -O <out2>] [options...]

Output :

eg:

CR1_1.clean.fq.gz fastp质控后的数据。
CR1_1.html/CR1_1.json:质控结果报告文件。

1.3) Clean rRNA reads

使用bowtie将上一步中得到的fastp质控后的*.clean.[1/2].fastq.gz文件比对到rRNA index上，除去这部分比对到rRNA index的部分，从而得到不含rRNA reads的文件.fastq文件。参考脚本:/data/TA_QUIZ_RNA_regulation/script/PartII.Ribo-seq_analysis/3.remove_rRNA.sh

Input :

1.2)操作结束后的\*.clean.fq.gz。

Software/Parameters :

bowtie 可以Clean rRNA reads得到不含rRNA reads的.fastq文件。

options with Parameter Setting

function

-n N

代表在高保真区内的错配不能超过N个，这里设置为0，即不允许任何错配

-a

在允许的错配碱基数量下的全部可能的比对情况都输出

-m1 --best -strata

只报告最好的比对情况的那一个输出

-p 4

指定使用4个线程

-l L

代表序列高保真区的长度，最短不能少于5，这里我们设置为L=15

--un *rm_rRNA.fq

输出不能map到指定基因组上的reads，fasta格式。即我们所需要的去除rRNA后的文件

-S

输出文件名，格式为.sam

-1/-2

对于PE测序数据，-1指定输入的read1文件，-2指定输入的read2文件

reference：

bowtie -n 0 -y -a --norc --best --strata -S -p 4 -l 15 \
--un ~/*.rm_rRNA.fq \
~/Arabidopsis_thaliana.TAIR10.34.rRNA \
-1 ~/*.clean.fastq.gz \
-2 ~/*.clean.fastq.gz \
~/*.aligned_rRNA.txt

Output

*rm_rRNA.fq：不含rRNA reads的.fastq文件,位于fastq文件夹。
map到rRNA index上的*aligned_rRNA.txt。

gzip压缩*rm_rRNA.fq文件，得到*rm_rRNA.fq.gz文件。

1.4) Mapping

1.4a) Generating genome index

由于PartI.RNA-seq analysis中我们已经创建了参考基因组目录。这里直接使用已经创建好的genome directory即可，不做详细描述。

1.4b) Running mapping jobs

我们使用STAR软件对1c)操作得到的*rm_rRNA_[1/2].fq进行比对。参考脚本：/data/TA_QUIZ_RNA_regulation/script/PartII.Ribo-seq_analysis/4.mapping.sh

Software/Parameters :

STAR

options with Parameter Setting

function

--runThreadN Number_of_Threads

set number of threads

--limitBAMsortRAM 20000000000

maximum available RAM for sorting BAM

--outFilterType BySJout

reduces the number of "spurious" junctions

--outFilterMismatchNmax 10

alignment will be output only if it has no more mismatches than this value

--genomeDir /path/to/genomeDir

path to genom directory

--readFilesIn /path/to/read1 /path/to/read2

path to input files

--readFilesCommand 'zcat'

If the read files are compressed, use the --readFilesCommand UncompressionCommand option,for gzipped files (.gz) use --readFilesCommand zcat

--outFileNamePrefix

output files name prefix

--outSAMtype BAM Unsorted

output unsorted Aligned.out.bam file

--quantMode TranscriptomeSAM GeneCounts

With --quantMode TranscriptomeSAM option STAR will output alignments translated into transcript coordinates in the Aligned.toTranscriptome.out.bam file

--outSAMattributes All

The SAM attributes can be specified by the user using --outSAMattributes

--outSAMstrandField intronMotif

For unstranded RNA-seq data, Cufflinks/Cuffdiff require spliced alignments with XS strand attribute, which STAR will generate with --outSAMstrandField intronMotif option

--outBAMcompression 6

int:-1 to 10 BAM compression level

--outReadsUnmapped Fastx

output of unmapped and partially mapped reads in separate file;Fastx:output in separate fasta/fastq files, Unmapped.out.mate1/2

reference：

以CR1_1样本为例：

STAR \
--runThreadN 4 \
--outFilterType BySJout \
--outFilterMismatchNmax 2 \
--outFilterMultimapNmax 1 \
--genomeDir genomedir \
--readFilesIn ~/CR1_1.rm_rRNA.fq.gz \
--readFilesCommand 'zcat' \
--outFileNamePrefix CR1_1. \
--outSAMtype BAM SortedByCoordinate \
--quantMode TranscriptomeSAM GeneCounts \
--outSAMattributes All \
--outSAMattrRGline ID:1 LB:ribo_seq PL:ILLUMINA SM:CR1_1 \
--outBAMcompression 6 \
--outReadsUnmapped Fastx

outFilterMultimapNmax=1，是因为不考虑Multimap reads
注意部分参数的选择与differential expression的选择不同，为什么不同，请大家自行思考。

output：

.Aligned.out.sorted :比对结果文件。
.toTranscriptome.out.sorted：比对到转录本上的reads组成的文件。

1.4c) samtools sort &index

这里我们使用samtools sort -T按TAG值排序，随后使用samtools index对排序后的序列建立索引。输出为bai文件。

output :

.Aligned.sortedByCoord.out.bam
.Aligned.toTranscriptome.out.sorted.bam
.Aligned.sortedByCoord.out.bam.bai
.Aligned.toTranscriptome.out.sorted.bam.bai

1.4d) Reads counts

这里我们使用mapping后并且经过samtools排序索引后的比对结果文件作为输入。使用htseq-count软件来计算reads数。参考脚本：/data/TA_QUIZ_RNA_regulation/script/PartII.Ribo-seq_analysis/5.read_count_HTSeq.sh

input：

.Aligned.sortedByCoord.out.bam

Software/Parameters :

htseq-count

options with Parameter Setting

function

-f bam

设置输入文件的格式

-s no

设置是否是链特异性测序

-i gene_id

设置feature ID是由gtf/gff文件第9列哪一个标签决定的

-t CDS

设置指定的feature进行表达量计算

-m union

设置表达量计算模式，可有的参数值有union, intersection-strict and intersection-nonempty。真核生物推荐使用union模式

reference：

htseq-count \
-f bam \
-s no \
-i gene_id \
-t CDS \
-m union \
~/*.Aligned.sortedByCoord.out.bam \
~/Arabidopsis_thaliana.TAIR10.34.gtf > ~/.read_count.HTSeq.txt

注意，由于Ribo-seq测序数据，仅测到被核糖体保护的片段，因此这里我们仅设置CDS模式。

得到的read_count.HTSeq.txt中包含着以__开头的注释信息。因此需要进行以下处理。

# 提取其中的__开头的信息。
grep __ ~/*.read_count.HTSeq.txt > ~/*.read_count.HTSeq.txt.summary

# 除去其中的__开头的信息
sed -i '/^__/d' ~/*.read_count.HTSeq.txt

output：

read_count.HTSeq.txt：包含gene_id与reads counts的信息。
read_count.HTSeq.txt.summary：汇总信息。

Part II.Data analysis

0) 数据说明

在这里我们给出Ribo-seq的6个样本的bam和reads count结果，以及RNA-seq的6个样本的reads counts在统计CDS模式下的结果。

Ribo-seq 6个样本的Bam文件：/data/TA_QUIZ_RNA_regulation/result/PartII.Ribo-seq_analysis/3.mapping
Ribo-seq 6个样本合并后reads matrix文件是：/data/TA_QUIZ_RNA_regulation/result/PartII.Ribo-seq_analysis/6.Differential_translation_efficiency/Ribo-seq
RNA-seq 6个样本CDS区域reads matrix文件是:/data/TA_QUIZ_RNA_regulation/result/PartII.Ribo-seq_analysis/6.Differential_translation_efficiency/RNA-seq-CDS/count_CDS.txt,这里我们只用每个RNA的CDS区域的reads

1) 周期性和ORF分析

Ribosome profiling data with a good quality tend to have a good 3-nt periodicity. 这里我们使用RiboCode中的metaplots来获得3-nt periodicity分析报告。参考脚本为：/data/TA_QUIZ_RNA_regulation/script/PartII.Ribo-seq_analysis/6.RiboCode_metaplot.sh

input：

.Aligned.toTranscriptome.out.bam：1.d.3)中mapping后并且经过samtools排序索引后的比对结果文件。
/data/TA_QUIZ_RNA_regulation/data/Ribocode/RiboCode_annot:由RiboCode产生的一些注释文件目录。

Software/Parameters：

metaplots

options with Parameter Setting

function

-a

设置RiboCode注释文件夹路径

-r

设置输入的sam文件路径

-o

设置输出路径

-m 26

设置最小值

-M 34

设置最大值

reference：

metaplots \
-a /data/TA_QUIZ_RNA_regulation/data/Ribocode/RiboCode_annot \
-r ~/*.Aligned.toTranscriptome.out.sorted.bam \
-o ~/RiboCode/metaplot \
-m 26 \
-M 34 \
-s no \
-pv1 0.05 \
-pv2 0.05

output：

~/RiboCode/metaplot：输出目录，目录中包含3-nt periodicity分析的pdf文件。
*.Aligned.toTranscriptome.out.sorted.pdf：pdf文件。

2) Differential translation efficiency

我们利用Xtail基于Riboseq的count matrix和RNAseq的count matrix计算差异翻译效率。为了方便对比RNAseq只统计CDS区域的reads。

使用R包xtail进行差异翻译分析。参考脚本：/data/TA_QUIZ_RNA_regulation/script/PartII.Ribo-seq_analysis/9.xtail.R

注：/data/zhaoyizi/software/anaconda3/envs/Riboshape/bin/下的R未安装xtail软件，安装了xtail的R脚本为：/BioII/lulab_b/liuxiaofan/software/conda2/bin/Rscript

运行xtail软件步骤如下：

##### 以下操作均在服务器命令行窗口运行 #####
### 进入文件传输通道，未进入文件传输通道中是无法访问/BioII文件夹的
ssh tmgt 
### 运行脚本
/BioII/lulab_b/liuxiaofan/software/conda2/bin/Rscript /path/to/your/9.xtail.R

input：

WT_count.txt
count_CDS.txt

reference：

# 读取riboseq和mrna的reads counts matrix。
ribo <- read.table('~/wt_count.txt',header=T,quote='',check.names=F, sep='\t',row.names=1)
mrna<- read.table('~/count_CDS.txt',header=T,quote='',check.names=F, sep='\t',row.names=1)

# 样本信息
condition <- c("control","control","treat","treat","treat")

# 获得xtail分析结果，按照q-value排序，选择输出log2FCs和log2R
results <- xtail(mrna,ribo,condition,minMeanCount=1,bins=10000)
results_tab <-resultsTable(results,sort.by="pvalue.adjust",log2FCs=TRUE,log2Rs=TRUE)

# 输出xtail分析结果
write.table(results_tab,"~/wt.0-vs-1.TE_new.xls",quote=F,sep="\t")

output：

输出结果为/data/TA_QUIZ_RNA_regulation/result/PartII.Ribo-seq_analysis/7.TE/wt.0-vs-1.TE_new.csv 输出结果未进行差异基因过滤。

Previous2.1.RNA-seq Analysis Next2.3.SHAPE Data Analysis

Last updated 10 months ago

Was this helpful?