这章主要内容是不同类型数据之间的整合分析 (data integration):
1 转录本丰度变化和翻译效率(TE)变化程度之间关系
绘制Log2FoldChange(TE)和Log2FoldChange(RNA-seq)的关系图。
2 转录本结构改变程度和翻译效率TE变化程度之间关系
通过假设检验计算结构改变程度是否影响翻译效率TE变化程度。
3 翻译效率变化基因的motif分析
TE上/下调的3'UTR和5'UTR的motif富集。
1) 转录本丰度变化和翻译效率(TE)变化程度之间关系
我们通过散点图表示转录本丰度光照前后变化和翻译效率(TE)光照前后变化程度之间关系,横轴为log2FC(RNA-seq),这里仅比较WT。如下图
1.1) 计算所需文件
2.1.RNA-seq Analysis 中(DESeq)计算得到WT的差异表达数据(使用自己计算得到文件),参考文件/data/TA_QUIZ_RNA_regulation/result/PartI.RNA-seq_analysis/differential_expression/5.DESeq2/wt/wt_rawdata.csv
2.2.Ribo-seq Analysis 中计算得到的差异翻译矩阵(使用自己计算得到文件),参考文件/data/TA_QUIZ_RNA_regulation/result/PartII.Ribo-seq_analysis/7.TE/wt.0-vs-1.TE_new.csv
1.2) 计算过程
这里我们给出参考脚本。
import numpy as np
import pandas as pd
import re
import seaborn as sns
import matplotlib.pyplot as plt
RF_data=pd.read_csv('/data/TA_QUIZ_RNA_regulation/result/PartII.Ribo-seq_analysis/7.TE/wt.0-vs-1.TE_new.csv',sep='\t')
RF_data=RF_data.reset_index()
RF_data=RF_data.rename(columns={'index':'gene'})
RS_data=pd.read_csv('/data/TA_QUIZ_RNA_regulation/result/PartI.RNA-seq_analysis/differential_expression/5.DESeq2/wt/wt_rawdata.csv',sep=',')
RS_data=RS_data.rename(columns={'Row.names':'gene'})
data=pd.merge(RS_data[['gene','pvalue','padj','log2FoldChange']],RF_data[['gene','pvalue_final','pvalue.adjust','log2FC_TE_final']],on='gene',how='right')
data.columns=['gene','pvalue(RNA-seq)','padj(RNA-seq)','log2FoldChange(RNA-seq)','pvalue(TE)','padj(TE)','log2FoldChange(TE)']
data['group'] = 'darkgray'
result_1=data.loc[(data['pvalue(TE)']<0.05)&(data['log2FoldChange(TE)']>0)&(data['padj(RNA-seq)']>0.05),:]
result_2=data.loc[(data['pvalue(TE)']<0.05)&(data['log2FoldChange(TE)']<0)&(data['padj(RNA-seq)']>0.05),:]
data_= data[["log2FoldChange(RNA-seq)","log2FoldChange(TE)"]]
data_corr = data_.corr().iloc[0,1]
#确定坐标轴显示范围(自己定义调整)
xmin=-3
xmax=6
ymin=-8
ymax=8
#绘制散点图
fig = plt.figure(figsize=plt.figaspect(5/6)) #确定fig比例(h/w)
ax = fig.add_subplot()
ax.set(xlim=(xmin, xmax), ylim=(ymin, ymax), title='')
ax.scatter(data['log2FoldChange(RNA-seq)'], data['log2FoldChange(TE)'], s=15, c=data['group'])
ax.scatter(result_1['log2FoldChange(RNA-seq)'], result_1['log2FoldChange(TE)'], s=20, marker='.',c='#cc0000',label = str(len(result_1))+' TE up mRNAs')
ax.scatter(result_2['log2FoldChange(RNA-seq)'], result_2['log2FoldChange(TE)'], s=20, marker='.',c='steelblue',label = str(len(result_2))+' TE down mRNAs')
ax.spines['right'].set_visible(False) #去掉右边框
ax.spines['top'].set_visible(False) #去掉上边框
ax.spines['bottom'].set_linewidth(2)###设置底部坐标轴的粗细
ax.spines['left'].set_linewidth(2)####设置左边坐标轴的粗细
plt.tick_params(labelsize=15)
ax.set_xticks(range(xmin,xmax,1)) #设置x轴刻度起点和步长
ax.set_yticks(range(ymin,ymax,2)) #设置y轴刻度起点和步长
# font2 = {'family': 'Times New Roman','weight': 'normal','size': 15}
font2 = {'weight': 'normal','size': 18}
plt.xlabel('log2FoldChange(RNA-seq)',font2)
plt.ylabel('log2FoldChange(TE)',font2)
plt.legend(fontsize=10)
font3 = {'weight': 'normal','size': 16}
plt.text(3,4, 'r = '+str(round(data_corr,2)),font3)
plt.tight_layout()
# plt.show()
# 图片输出路径
plt.savefig('logFC_TE_corr.png')
plt.close()
Questions:
1)你得到的转录本丰度变化和翻译效率(TE)变化关系图是怎样的?
2)二者是否存在相关关系,若存在,试解释产生这种相关性的原因。
提示:可以尝试利用Riboseq count matrix计算Riboseq丰度的差异,然后画出了logFC(RNA-seq)和logFC(Ribo-seq)之间的关系。 提示:可思考翻译效率计算方式解释。
2) 结构改变程度是否影响翻译效率TE变化程度
这里我们利用假设检验来检验结构改变程度是否影响翻译效率TE变化程度。
Whether translation up-regulation is related to structural down-regulation
Whether translation down-regulation is related to structural up-regulation
我们需定义结构改变基因为hit level>1且Have structure changed region(|delta gini index|>0.1)。当delta gini index>0.1时为structural up-regulation,当delta gini index<-0.1时为structural down-regulation。
我们需定义翻译效率改变基因为P0.5。
这里我们仅计算WT数据。
2.1) 计算所需文件
2.2.Ribo-seq Analysis计算得到的差异翻译矩阵(使用自己计算得到文件),参考文件/data/TA_QUIZ_RNA_regulation/result/PartII.Ribo-seq_analysis/7.TE/wt.0-vs-1.TE_new.csv
2.3.SHAPE Data Analysis计算得到的结构改变区域数据(使用自己计算得到文件),参考文件/data/TA_QUIZ_RNA_regulation/result/PartIII.SHAPE-seq_analysis/merge/merge_data_WT.csv
2.2) 计算过程
这里我们给出参考脚本。
import pandas as pd # Data analysis
import numpy as np # Scientific computing
import matplotlib.pyplot as plt # Plotting
import matplotlib.colors as colors # Coloring
from scipy.stats import chi2_contingency
from scipy.stats import fisher_exact
#####翻译改变###################
exp_data_wt = pd.read_csv('/data/TA_QUIZ_RNA_regulation/result/PartI.RNA-seq_analysis/differential_expression/5.DESeq2/wt/wt_rawdata.csv',sep=',')
exp_data_wt =exp_data_wt.rename(columns={'Row.names':'gene','log2FoldChange':'log2FoldChange(EXP)','pvalue':'pvalue(EXP)','padj':'FDR(EXP)'})
exp_data_wt = exp_data_wt[['gene','log2FoldChange(EXP)','pvalue(EXP)','FDR(EXP)']]
ribo_wt = pd.read_csv('/data/TA_QUIZ_RNA_regulation/result/PartII.Ribo-seq_analysis/7.TE/wt.0-vs-1.TE_new.csv',sep='\t')
ribo_wt = ribo_wt.reset_index()
ribo_wt =ribo_wt.rename(columns={'index':'gene','log2FC_TE_final':'log2FoldChange(TE)','pvalue_final':'pvalue(TE)','pvalue.adjust':'FDR(TE)'})
ribo_wt = ribo_wt[['gene','log2FoldChange(TE)','pvalue(TE)','FDR(TE)']]
result = pd.merge(ribo_wt,exp_data_wt,on='gene',how='left')
gene_all_=set(result.loc[((result['log2FoldChange(TE)']>0.5)|(result['log2FoldChange(TE)']<-0.5))&(result['pvalue(TE)']<0.05)&(result['FDR(EXP)']>0.05),'gene'])
gene_TE_up_2=set(result.loc[(result['log2FoldChange(TE)']>0.5)&(result['pvalue(TE)']<0.05)&(result['FDR(EXP)']>0.05),'gene'])
gene_TE_no_up_2=gene_all_-gene_TE_up_2
gene_TE_down_2=set(result.loc[(result['log2FoldChange(TE)']<-0.5)&(result['pvalue(TE)']<0.05)&(result['FDR(EXP)']>0.05),'gene'])
gene_TE_no_down_2=gene_all_-gene_TE_down_2
#####结结构改变###################
#Have structure changed region(|delta gini index|>0.1)
shape_data_wt = pd.read_csv('/data/TA_QUIZ_RNA_regulation/result/PartIII.SHAPE-seq_analysis/merge/merge_data_WT.csv',sep='\t')
shape_data_wt = shape_data_wt.loc[(shape_data_wt['hit_z']>1)|(shape_data_wt['hit_f']>1),:]
shape_data_wt['up_down']= shape_data_wt['delta'].map(lambda x: 'up' if x>0 else 'down')
shape_data_wt['gene']=shape_data_wt['transcript_id'].map(lambda x:x.split('.')[0])
shape_up_1=set(shape_data_wt.loc[(shape_data_wt['up_down']=='up'),'gene'])&gene_all_
shape_no_up_1=gene_all_-shape_up_1
shape_down_1=set(shape_data_wt.loc[(shape_data_wt['up_down']=='down'),'gene'])&gene_all_
shape_no_down_1=gene_all_-shape_down_1
########################
dtest1=np.array([[len(gene_TE_down_2&shape_up_1),len(gene_TE_down_2&shape_no_up_1)],
[len(gene_TE_no_down_2&shape_up_1),len(gene_TE_no_down_2&shape_no_up_1)]])
# k,p,f,expctd =chi2_contingency(dtest)
o,p=fisher_exact(dtest1,alternative='greater')
print('shape up,TE down')
print(p)
dtest3=np.array([[len(gene_TE_up_2&shape_down_1),len(gene_TE_up_2&shape_no_down_1)],
[len(gene_TE_no_up_2&shape_down_1),len(gene_TE_no_up_2&shape_no_down_1)]])
o,p=fisher_exact(dtest3,alternative='greater')
print('shape down,TE up')
print(p)
给出假设检验结果
shape up,TE down
5.387478374073877e-23
shape down,TE up
0.7150732345900125
由此可知,结构变多可导致翻译效率下降。我们接下来用图片表示富集程度。计算脚本参考如下。
import pandas as pd # Data analysis
import numpy as np # Scientific computing
import matplotlib.pyplot as plt # Plotting
import matplotlib.colors as colors # Coloring
exp_data_wt = pd.read_csv('/data/TA_QUIZ_RNA_regulation/result/PartI.RNA-seq_analysis/differential_expression/5.DESeq2/wt/wt_rawdata.csv',sep=',')
exp_data_wt =exp_data_wt.rename(columns={'Row.names':'gene','log2FoldChange':'log2FoldChange(EXP)','pvalue':'pvalue(EXP)','padj':'FDR(EXP)'})
exp_data_wt = exp_data_wt[['gene','log2FoldChange(EXP)','pvalue(EXP)','FDR(EXP)']]
exp_wt_list=list(exp_data_wt.loc[exp_data_wt['FDR(EXP)']>0.05,'gene'])
ribo_wt = pd.read_csv('/data/TA_QUIZ_RNA_regulation/result/PartII.Ribo-seq_analysis/7.TE/wt.0-vs-1.TE_new.csv',sep='\t')
ribo_wt = ribo_wt.reset_index()
ribo_wt =ribo_wt.rename(columns={'index':'gene','log2FC_TE_final':'log2FoldChange(TE)','pvalue_final':'pvalue(TE)','pvalue.adjust':'FDR(TE)'})
ribo_wt = ribo_wt[['gene','log2FoldChange(TE)','pvalue(TE)','FDR(TE)']]
shape_data_wt = pd.read_csv('/data/TA_QUIZ_RNA_regulation/result/PartIII.SHAPE-seq_analysis/merge/merge_data_WT.csv',sep='\t')
shape_data_wt = shape_data_wt.loc[(shape_data_wt['hit_z']>1)|(shape_data_wt['hit_f']>1),:]
shape_data_wt['up_down']= shape_data_wt['delta'].map(lambda x: 'up' if x>0 else 'down')
shape_data_wt['gene']=shape_data_wt['transcript_id'].map(lambda x:x.split('.')[0])
shape_up_list=list(shape_data_wt.loc[(shape_data_wt['up_down']=='up'),'gene'])
shape_down_list=list(shape_data_wt.loc[(shape_data_wt['up_down']=='down'),'gene'])
result = pd.merge(ribo_wt,exp_data_wt,on='gene',how='left')
result=result.loc[result['FDR(EXP)']>0.05,:]
result['shape_up']=result['gene'].map(lambda x:1 if x in shape_up_list else 0)
result['shape_down']=result['gene'].map(lambda x:1 if x in shape_down_list else 0)
result['x'] = result['log2FoldChange(TE)']
result['y'] = -np.log10(result['pvalue(TE)'])
#设置显著性阈值
x_threshold=0.5
y_threshold=-np.log10(0.05)
result_2=result.loc[((result['x']<-0.5)|(result['x'] >0.5))&(result['y'] >y_threshold)&(result['shape_up']==1),:]
result_3=result.loc[((result['x']<-0.5)|(result['x'] >0.5))&(result['y'] >y_threshold)&(result['shape_down']==1),:]
#分组为up, normal, down
result['group'] = 'dimgrey'
# result.loc[(result.x > x_threshold)&(result.y > y_threshold),'group'] = 'tab:red' #x=-+x_threshold直接截断
# result.loc[(result.x < -x_threshold)&(result.y > y_threshold),'group'] = 'tab:blue' #x=-+x_threshold直接截断
# result.loc[result.y < y_threshold,'group'] = 'dimgrey' #阈值以下点为灰色
#确定坐标轴显示范围
xmin=-8
xmax=8
ymin=-1
ymax=8
#绘制散点图
fig = plt.figure(figsize=plt.figaspect(5/7)) #确定fig比例(h/w)
ax = fig.add_subplot()
ax.set(xlim=(xmin, xmax), ylim=(ymin, ymax), title='')
ax.scatter(result['x'], result['y'], s=2, c=result['group'])
# ax.scatter(result_2['x'], result_2['y'], s=10, marker='o',c='',edgecolors='tab:purple',label = 'More Structure in UV+')
ax.scatter(result_2['x'], result_2['y'], s=8, marker='o',c='',edgecolors='fuchsia',label = 'More Structure in UV+')
ax.scatter(result_3['x'], result_3['y'], s=5, marker='o',c='',edgecolors='tab:orange',label = 'Less Structure in UV+')
ax.spines['right'].set_visible(False) #去掉右边框
ax.spines['top'].set_visible(False) #去掉上边框
#水平和竖直线
ax.vlines(-x_threshold, ymin, ymax, color='dimgrey',linestyle='dashed', linewidth=1) #画竖直线
ax.vlines(x_threshold, ymin, ymax, color='dimgrey',linestyle='dashed', linewidth=1) #画竖直线
ax.hlines(y_threshold, xmin, xmax, color='dimgrey',linestyle='dashed', linewidth=1) #画竖水平线
plt.tick_params(labelsize=13)
ax.set_xticks(range(xmin,xmax,2)) #设置x轴刻度起点和步长
ax.set_yticks(range(ymin,ymax,2)) #设置y轴刻度起点和步长
# font2 = {'family': 'Times New Roman','weight': 'normal','size': 15}
font2 = {'weight': 'normal','size': 18}
plt.xlabel('Log2FoldChange(TE)',font2)
plt.ylabel('Log10Pvalue(TE)',font2)
plt.legend(fontsize=13)
plt.tight_layout()
plt.savefig('/data/liuxiaofan/test/volcano.png')
plt.close()
结果如下图
3) 翻译效率变化基因的motif分析
本节我们将分别对TE上调和TE下调的所有基因的3‘UTR和5'UTR富集序列motif。
3.1) 数据说明
2.2.Ribo-seq Analysis中计算得到的差异翻译矩阵(使用自己计算得到文件),参考文件/data/TA_QUIZ_RNA_regulation/result/PartII.Ribo-seq_analysis/7.TE/wt.0-vs-1.TE_new.csv
所有基因3’UTR/5'UTR的fastq文件
5'UTR的fastq文件:/data/TA_QUIZ_RNA_regulation/data/gene_list/motif/merge_5UTR.fasta
3'UTR的fastq文件:/data/TA_QUIZ_RNA_regulation/data/gene_list/motif/merge_3UTR.fasta
3.2) 提取所有3’UTR/5'UTR的fastq文件(可跳过)
input:
/data/TA_QUIZ_RNA_regulation/data/ATH/GFF/Ath_genes.gff
:参考基因组注释文件
/data/TA_QUIZ_RNA_regulation/data/riboshape_liulab_batch4/final.modified_umodified/col_nouv/
:包含转录本全长序列信息文件
/data/TA_QUIZ_RNA_regulation/data/ATH/GTF/shape_map/result/transcript_exon_location.csv
:转录本位置信息文件
import numpy as np
import pandas as pd
col_uv_f = '/data/TA_QUIZ_RNA_regulation/data/riboshape_liulab_batch4/final.modified_umodified/col_nouv/'
gff_path = '/data/TA_QUIZ_RNA_regulation/data/ATH/GFF/Ath_genes.gff'
data_gff = pd.read_csv(gff_path, sep='\t')
data_location = pd.read_csv( '/data/TA_QUIZ_RNA_regulation/data/ATH/GTF/shape_map/result/transcript_exon_location.csv',sep='\t')
data_gff=pd.merge(data_gff,data_location[['transcript_id','strand']],on='transcript_id',how='left')
gene=pd.read_csv('/data/TA_QUIZ_RNA_regulation/data/gene_list/wt/ribo_wt_gene_list.txt',sep='\t',header=None)
gene_list=list(gene[0])
data_1 = pd.read_csv(col_uv_f + '/final.modified_unmodified_new', sep='\t')
data_1.columns = ['transcript_id', 'Nucleotide', 'Modified_mutations', 'Modified_effective_depth',
'Untreated_mutations', 'Untreated_effective_depth', 'R1']
data_1 = data_1[['transcript_id', 'Nucleotide', 'Modified_mutations', 'Modified_effective_depth',
'Untreated_mutations', 'Untreated_effective_depth', 'R1']]
data_1 = data_1.drop_duplicates()
data_1['gene']=data_1['transcript_id'].map(lambda x:x.split('.')[0])
data=data_1
data_nuc=pd.DataFrame(columns={'gene','transcript_id','5UTR','CDS','3UTR'})
j=0
for i in gene_list:
data_ = data.loc[data['gene'] == i, :]
transcript_id = list(data_['transcript_id'])[0]
data_location_ = data_location.loc[data_location['transcript_id'] == transcript_id, :]
data_gff_ = data_gff.loc[data_gff['transcript_id'] == transcript_id, :]
if len(data_gff_) == 0:
continue
else:
if list(data_gff_['strand'])[0] == '+':
five_UTR_CDS = list(data_gff_.loc[data_gff_['location'] == 'CDS', 'strat'])[0]
three_UTR_CDS = list(data_gff_.loc[data_gff_['location'] == 'CDS', 'end'])[0]
else:
five_UTR_CDS = list(data_gff_.loc[data_gff_['location'] == 'CDS', 'end'])[0]
three_UTR_CDS = list(data_gff_.loc[data_gff_['location'] == 'CDS', 'strat'])[0]
Nuc = list(data_['Nucleotide'])[0]
location = np.array(list(data_location_['location'])[0].split(',')).astype('int')
five_UTR_CDS_ = np.where(location == five_UTR_CDS)[0][0]
three_UTR_CDS_ = np.where(location == three_UTR_CDS)[0][0]
data_5UTR_ = Nuc[:five_UTR_CDS_]
data_CDS_ = Nuc[five_UTR_CDS_:three_UTR_CDS_ + 1]
data_3UTR_ = Nuc[three_UTR_CDS_ + 1:]
data_nuc.loc[j, 'gene'] = i
data_nuc.loc[j, 'transcript_id'] = transcript_id
data_nuc.loc[j, '5UTR'] = data_5UTR_
data_nuc.loc[j, '5UTR_len'] = len(data_5UTR_)
data_nuc.loc[j, 'CDS'] = data_CDS_
data_nuc.loc[j, 'CDS_len'] = len(data_CDS_)
data_nuc.loc[j, '3UTR'] = data_3UTR_
data_nuc.loc[j, '3UTR_len'] = len(data_3UTR_)
j = j + 1
print(j)
data_nuc=data_nuc[['gene','transcript_id','5UTR_len','CDS_len','3UTR_len','5UTR','CDS','3UTR']]
print('ALL',len(data_nuc))
n=15
data_nuc_=data_nuc.loc[data_nuc['5UTR_len']>n]
data_nuc_.index=range(len(data_nuc_))
print('5UTR',len(data_nuc_))
f = open('/data/TA_QUIZ_RNA_regulation/data/gene_list/motif/merge_5UTR.fasta', 'w')
for i in range(len(data_nuc_)):
transcript = data_nuc_.loc[i, 'transcript_id']
nucleotide= data_nuc_.loc[i, '5UTR']
f.write('>' + transcript+'\n')
f.write(nucleotide + '\n')
f.close()
data_nuc_=data_nuc.loc[data_nuc['3UTR_len']>n]
data_nuc_.index=range(len(data_nuc_))
print('3UTR',len(data_nuc_))
f = open('/data/TA_QUIZ_RNA_regulation/data/gene_list/motif/merge_3UTR.fasta', 'w')
for i in range(len(data_nuc_)):
transcript = data_nuc_.loc[i, 'transcript_id']
nucleotide= data_nuc_.loc[i, '3UTR']
f.write('>' + transcript+'\n')
f.write(nucleotide + '\n')
f.close()
output:
merge_5UTR.fasta
:汇总的所有5‘UTR的序列信息
merge_3UTR.fasta
:汇总的所有3’UTR的序列信息
3.3) 提取TE变化区域的3‘UTR、5’UTR序列信息
请自行根据Part II 得到的TE上/下调的gene list
,在merge_5UTR.fasta和merge_3UTR.fasta中筛选,最终得到的即为TE变化基因的5'UTR,3‘UTR序列信息。
output:
merge_TE_up_5UTR.fasta
/merge_TE_down_5UTR.fasta
:汇总的TE上/下调的所有基因5’UTR的fasta信息。参考:/data/TA_QUIZ_RNA_regulation/data/gene_list/motif/TE_DOWN/merge_TE_down_5UTR.fasta;/data/TA_QUIZ_RNA_regulation/data/gene_list/motif/TE_UP/merge_TE_up_5UTR.fasta
merge_TE_up_3UTR.fasta
/merge_TE_down_3UTR.fasta
:汇总的TE上/下调的所有基因3’UTR的fasta信息。参考:/data/TA_QUIZ_RNA_regulation/data/gene_list/motif/TE_DOWN/merge_TE_down_3UTR.fasta;/data/TA_QUIZ_RNA_regulation/data/gene_list/motif/TE_UP/merge_TE_up_3UTR.fasta
MEME在线motif分析
我们利用MEME在线版进行motif分析,地址为http://meme-suite.org/tools/meme。 使用MEME的Motif discovery工具,预测输入序列上的motif信息
我们上传merge_TE_up_5UTR.fasta
/merge_TE_down_5UTR.fasta
后提交运算。
在得到的结果中我们根据Evalue进行筛选,Sites为找到的motif在几个结构改变区域中出现过,由此可以计算出该motif的覆盖率。我们选择覆盖率较高的motif,这里取大于4%。
4) 更多思考和分析
翻译效率(TE)发生变化的基因的3‘UTR和5'UTR分别富集到的motif是什么?有什么不同?有没有之前的研究的支持?
教程中利用 MEME分析的是 sequence motif,我们能否利用一些预测structure motif 的软件和方法探索一些影响和调控翻译的structure motif?