2.3.SHAPE Data Analysis
Last updated
Last updated
本章为SHAPE-MaP测序数据数据处理的说明教程,分为Prepare Data Matrix和Data analysis两大部分。
Prepare Data Matrix
利用Shapemapper 进行数据预处理并计算每个转录本的reactivity。
Data analysis
计算hit level,根据hit level分布确定阈值。
计算归一化因子,对Reactivity进行预处理和归一化。
计算滑动窗口Gini index,利用滑动窗口判断每个转录本外界刺激前后的结构改变区域,对连续的滑动窗口进行合并,汇总结构。
SHAPE-MaP测序数据为12个样本包括加药和不加药两个情况。由于shapemapper
的运行速度非常慢,这里我们仅选择部分example文件供大家练习Shapemapper
的使用,本节最后会提供使用Shapemapper
处理好的所有数据的每个转录本的reactivity
文件。
这里,我们将SHAPE-MaP
测序数据中C1_1选取1/20进行下面计算,加药前的fasta数据为CD1_1.clean.part1.R1.fastq.gz和CD1_1.clean.part1.R2.fastq.gz;加药后的fasta数据为CN1_1.clean.part1.R1.fastq.gz和CN1_1.clean.part1.R2.fastq.gz。为了进一步降低运行时间,我们参考转录本文件也进行了简化,仅取三条转录本的Arabidopsis_thaliana.TAIR10.34.transcripts_new_2.fa作练习使用。
参考脚本:/data/TA_QUIZ_RNA_regulation/data/test/test.shapemapper.sh
input:
加药前fasta数据:CD1_1.clean.part1.R1.fastq.gz和CD1_1.clean.part1.R2.fastq.gz
加药后fasta数据:CN1_1.clean.part1.R1.fastq.gz和CN1_1.clean.part1.R2.fastq.gz
比对转录本的序列信息:Arabidopsis_thaliana.TAIR10.34.transcripts_new_2.fa
Software/Parameters:
shapemapper
reference:
shapemapper \
--target ~/Arabidopsis_thaliana.TAIR10.34.transcripts_new_1_1.fa \
--name "Ath_test_C1_1" \
--min-depth 100 \
--min-qual-to-count 20 \
--overwrite \
--modified --folder ~/data/modified/ \
--untreated --folder ~/data/control/ \
--star-aligner \
--nproc 8 \
--verbose
result:
输出结果参考文件夹/data/TA_QUIZ_RNA_regulation/data/test/C1_1_result/shapemapper_out。以AT1G20620.4
这个转录本为例,C1_1_AT1G20620.4_profile.txt为输出文件,其中文件中Sequence
、Modified_mutations
、Modified_effective_depth
、Untreated_mutations
、Untreated_effective_depth
列为我们需要的主要信息。每个位点的Reactivity可以从中计算得到。
在后续计算中,使用我们计算好的文件。我们把同一个实验条件三个replica合并后计算reactivity,得到final.modified_unmodified文件。文件每一行是一个转录本,每列内容为transcript_id
,Nucleotide
,Modified_mutations
,Modified_effective_depth
,Untreated_mutations
,Untreated_effective_depth
。
我们以Shapemapper计算得到的整合好的col_nouv
和col_uv
中所有转录本结果来计算得到结构改变区域。
我们的计算过程主要包括以下几步:
计算hit level
查看hit level的分布,并确定阈值
计算归一化因子
Reactivity预处理和归一化
合并结构变化区域
为了方便理解,我们介绍Reactivity
、Hit level
、Gini index
等几个基础概念。
Reactivity R = mutation rate(S)- mutation rate(B) S corresponds to a SHAPE modified sample, B to untreated.
Hit level
The hit level metric quantifies the total background-subtracted signal per nucleotide of transcript.
Gini index
We then applied these metrics to windows containing a total of 50 nucleotides Requiring a minimum number of 100 total reads, number of nan in reactivity less than 10%.
Reactivity
、Hit level
refer to rnA motif discovery by shAPe and mutational profiling (shAPe-maP); Gini index
refer to A Stress Response that Monitors and Regulates mRNA Structure Is Central to Cold Shock Adaptation
根据结果计算hit level 值,给出每条对应的hit level,按照深度进行区间划分的hit level平均值。 hit_level.py
refer to /data/TA_QUIZ_RNA_regulation/data/script/PartIII.SHAPE-seq_analysis/hit_level.py
脚本参考:/data/TA_QUIZ_RNA_regulation/script/PartIII.SHAPE-seq_analysis/structure_changes_analysis/1.calculate_hit_level.sh
reference: 以col_nouv
为例
### hit level
### 1)calculate the average mutation events above background for each transcript
path0=/data/TA_QUIZ_RNA_regulation/data/script/PartIII.SHAPE-seq_analysis
path1=/data/TA_QUIZ_RNA_regulation/data/riboshape_liulab_batch4/final.modified_umodified/col_nouv
## 注意,这里的输入文件是final.modified_unmodified而非final.modified_unmodified.hit
python $path0/hit_level.py \
--data_path $path1/final.modified_unmodified \
--savepath_hit $path1/final.modified_unmodified.hit
echo -e "cutoff\ttranscript_id\tmodified_depth_median\tunmodified_depth_median\tmodified_depth_sum\tunmodified_depth_sum\thit" > $path1/cutoff.hit.group;
awk -F '\t' '$3>0 && $2<=25 && $2>0{print "0\t"$0}' $path1/final.modified_unmodified.hit >> $path1/cutoff.hit.group;
awk -F '\t' '$3>0 && $2<=50 && $2 > 25{print "25\t"$0}' $path1/final.modified_unmodified.hit >> $path1/cutoff.hit.group;
awk -F '\t' '$3>0 && $2<=100 && $2 > 50{print "50\t"$0}' $path1/final.modified_unmodified.hit >> $path1/cutoff.hit.group;
awk -F '\t' '$3>0 && $2<=200 && $2 > 100{print "100\t"$0}' $path1/final.modified_unmodified.hit >> $path1/cutoff.hit.group;
awk -F '\t' '$3>0 && $2<=300 && $2 > 200{print "200\t"$0}' $path1/final.modified_unmodified.hit >> $path1/cutoff.hit.group;
awk -F '\t' '$3>0 && $2<=500 && $2 >300{print "300\t"$0}' $path1/final.modified_unmodified.hit >> $path1/cutoff.hit.group;
awk -F '\t' '$3>0 && $2<=750 && $2 >500{print "500\t"$0}' $path1/final.modified_unmodified.hit >> $path1/cutoff.hit.group;
awk -F '\t' '$3>0 && $2<=1000 && $2 >750{print "750\t"$0}' $path1/final.modified_unmodified.hit >> $path1/cutoff.hit.group;
awk -F '\t' '$3>0 && $2<=2000 && $2 >1000{print "1000\t"$0}' $path1/final.modified_unmodified.hit >> $path1/cutoff.hit.group;
awk -F '\t' '$3>0 && $2<=5000 && $2 >2000{print "2000\t"$0}' $path1/final.modified_unmodified.hit >> $path1/cutoff.hit.group;
awk -F '\t' '$3>0 && $2 > 5000{print "5000\t"$0}' $path1/final.modified_unmodified.hit >> $path1/cutoff.hit.group;
result:
final.modified_unmodified.hit
:其中第6列即为转录本对应的hit level
的值。
cutoff.hit.group
:是对final.modified_unmodified.hit
文件的质控过滤后的结果。筛选,其中unmodified.median>0,modified.median处于各种范围
的行。
脚本参考:/data/TA_QUIZ_RNA_regulation/script/PartIII.SHAPE-seq_analysis/structure_changes_analysis/2.hit_level.py
#####1.统计不同hit level的转录本个数#######
import numpy as np
import pandas as pd
import re
import seaborn as sns
import matplotlib.pyplot as plt
from numpy import median
from numba import jit
#########转录本注释和数据读入#################
#### path #####
exon_gtf_path='/data/TA_QUIZ_RNA_regulation/data/ATH/GTF/shape_map/ath_exons.gtf'
col_uv_f_path='/data/TA_QUIZ_RNA_regulation/data/riboshape_liulab_batch4/final.modified_umodified/col_nouv/'
col_uv_z_path='/data/TA_QUIZ_RNA_regulation/data/riboshape_liulab_batch4/final.modified_umodified/col_uv/'
### 读取注释文件,在exon_gtf文件的第8列中提取'transcript_id','gene_type','gene_name','chr'等信息
gtf_data=pd.read_csv(exon_gtf_path,sep='\t',header=None)
gtf_data_new=pd.DataFrame(columns={'transcript_id','gene_type','gene_name','chr'})
gtf_data_new['transcript_id']=gtf_data.iloc[:,8].apply(lambda x:x.split(';')[1].split('"')[1])
gtf_data_new['gene_type'] = gtf_data.iloc[:,8].apply(lambda x:re.findall('gene_biotype ".*?"',x)[0].split('"')[1])
gtf_data_new['gene_name'] = gtf_data.iloc[:,8].apply(lambda x:re.findall('gene_name ".*?"',x)[0].split('"')[1] if 'gene_name' in x else np.nan)
gtf_data_new['chr'] = gtf_data.iloc[:,0].apply(lambda x: 6 if x=='Mt' else 7 if x=='Pt' else x ).astype('int')
gtf_data_new = gtf_data_new.drop_duplicates()
gtf_data_new.index = range(len(gtf_data_new))
# 提取col_nouv、col_uv中的'transcript_id','modified.median','unmodified.median','modified.sum','unmodified.sum','hit'列,并将ath_exons参考基因组中的'gene_type','gene_name','chr'的信息合并到原数据中
hit_level_col_uv_f = pd.read_csv(col_uv_f_path+'/final.modified_unmodified.hit',sep='\t',header=None)
hit_level_col_uv_f.columns =['transcript_id','modified.median','unmodified.median','modified.sum','unmodified.sum','hit']
hit_level_col_uv_f = pd.merge(hit_level_col_uv_f,gtf_data_new,on='transcript_id',how='left')
hit_level_col_uv_f['Type'] = 'WT_UV-'
hit_level_col_uv_z = pd.read_csv(col_uv_z_path+'/final.modified_unmodified.hit',sep='\t',header=None)
hit_level_col_uv_z.columns =['transcript_id','modified.median','unmodified.median','modified.sum','unmodified.sum','hit']
hit_level_col_uv_z = pd.merge(hit_level_col_uv_z,gtf_data_new,on='transcript_id',how='left')
hit_level_col_uv_z['Type'] = 'WT_UV+'
##计算col_nouv、col_uv中hit_level>-1000,hit_level>0.hit_level>1,hit_level>2,hit_level>5,hit_level>10,hit_level>15的数目并整理成数据框格式。
data1=pd.DataFrame(columns={'Type','Number of transcripts','hit'})
for num in [-1000,0,1,2,5,10,15]:
WT_UV_f = len(hit_level_col_uv_f.loc[(hit_level_col_uv_f['hit'] > num) & (hit_level_col_uv_f['modified.median'] > 100), :])
WT_UV_z = len(hit_level_col_uv_z.loc[(hit_level_col_uv_z['hit'] > num) & (hit_level_col_uv_z['modified.median'] > 100)])
data2 = pd.DataFrame(columns={'Type', 'Number of transcripts', 'hit'})
data2['Type'] = ['WT_UV-', 'WT_UV+']
data2['Number of transcripts'] = [WT_UV_f, WT_UV_z]
if num==-1000:
data2['hit'] ='ALL transcripts'
else:
data2['hit']='Hit level>'+str(num)
data1 = pd.concat([data1,data2])
# 将hit level分布情况画图
#注意matplotlib.pyplot的默认画板在linux下无法被调用,需要用plt.switch_backend('agg')命令来设定画板。
plt.switch_backend('agg')
plt.figure(figsize=(8, 6))
sns.set(style="ticks", context="talk")
sns.axes_style({'font.family': ['sans-serif'],'font.sans-serif': ['Arial']})
g = sns.barplot(y='Number of transcripts',x='hit',hue='Type',data=data1,palette=["#3498DB", "#1F618D"])
sns.despine()
font1 = {'family' : 'Arial','weight' : 'roman','size': 22}
plt.xticks(rotation=60)
plt.legend(fontsize='small')
plt.tight_layout()
plt.savefig('/data/TA_QUIZ_RNA_regulation/result/PartIII.SHAPE-seq_analysis/plot/2a.transcript_num_hit.png')
plt.close()
注意matplotlib.pyplot的默认画板在linux下无法被调用,需要用plt.switch_backend('agg')命令来设定画板。
result:
输出为2a.transcript_num_hit.png
:转录本个数随hit level 变化趋势图。
根据转录本个数变化趋势,我们取hit level大于2的转录本为可信转录本,进行后续分析。
我们利用rRNAs,lncRNA and tRNA, which were sequenced to high depths, and showed large hit level (hit level >10) across experimental probing conditions. 找到用于归一化的rRNAs/lncRNA/tRNA的脚本参考:/data/TA_QUIZ_RNA_regulation/script/PartIII.SHAPE-seq_analysis/structure_changes_analysis/3.normalization_factor.py
import numpy as np
import pandas as pd
import re
import seaborn as sns
import matplotlib.pyplot as plt
from numpy import median
from numba import jit
#########转录本注释和数据读入#################
#### path #####
exon_gtf_path='/data/TA_QUIZ_RNA_regulation/data/ATH/GTF/shape_map/ath_exons.gtf'
col_uv_f_path='/data/TA_QUIZ_RNA_regulation/data/riboshape_liulab_batch4/final.modified_umodified/col_nouv/'
col_uv_z_path='/data/TA_QUIZ_RNA_regulation/data/riboshape_liulab_batch4/final.modified_umodified/col_uv/'
### 读取注释文件,在exon_gtf文件的第8列中提取'transcript_id','gene_type','gene_name','chr'等信息
gtf_data=pd.read_csv(exon_gtf_path,sep='\t',header=None)
gtf_data_new=pd.DataFrame(columns={'transcript_id','gene_type','gene_name','chr'})
gtf_data_new['transcript_id']=gtf_data.iloc[:,8].apply(lambda x:x.split(';')[1].split('"')[1])
gtf_data_new['gene_type'] = gtf_data.iloc[:,8].apply(lambda x:re.findall('gene_biotype ".*?"',x)[0].split('"')[1])
gtf_data_new['gene_name'] = gtf_data.iloc[:,8].apply(lambda x:re.findall('gene_name ".*?"',x)[0].split('"')[1] if 'gene_name' in x else np.nan)
gtf_data_new['chr'] = gtf_data.iloc[:,0].apply(lambda x: 6 if x=='Mt' else 7 if x=='Pt' else x ).astype('int')
gtf_data_new = gtf_data_new.drop_duplicates()
gtf_data_new.index = range(len(gtf_data_new))
# 提取col_nouv、col_uv中'cutoff.hit.group'的'group','transcript_id','modified.median','unmodified.median','modified.sum','unmodified.sum','hit'列,并将ath_exons参考基因组中的'gene_type','gene_name','chr'等信息合并到原数据中
hit_level_col_uv_f = pd.read_csv(col_uv_f_path+'/cutoff.hit.group',sep='\t')
hit_level_col_uv_f.columns =['group','transcript_id','modified.median','unmodified.median','modified.sum','unmodified.sum','hit']
hit_level_col_uv_f = pd.merge(hit_level_col_uv_f,gtf_data_new,on='transcript_id',how='left')
hit_level_col_uv_f['spe'] = 'WT_UV-'
hit_level_col_uv_z = pd.read_csv(col_uv_z_path+'/cutoff.hit.group',sep='\t')
hit_level_col_uv_z.columns =['group','transcript_id','modified.median','unmodified.median','modified.sum','unmodified.sum','hit']
hit_level_col_uv_z = pd.merge(hit_level_col_uv_z,gtf_data_new,on='transcript_id',how='left')
hit_level_col_uv_z['spe'] = 'WT_UV+'
###从col_uv_f、col_uv_z中筛选出gene_type属于['lncRNA','rRNA','tRNA'],hit_level的值>10 && modified.median>5000的转录本,并输出它的转录本ID###
col_uv_f_id = hit_level_col_uv_f.loc[(hit_level_col_uv_f.gene_type.isin(['lncRNA','rRNA','tRNA']))&(hit_level_col_uv_f['modified.median']>5000)&(hit_level_col_uv_f['hit']>10),'transcript_id']
col_uv_z_id = hit_level_col_uv_z.loc[(hit_level_col_uv_z.gene_type.isin(['lncRNA','rRNA','tRNA']))&(hit_level_col_uv_z['modified.median']>5000)&(hit_level_col_uv_z['hit']>10),'transcript_id']
print(col_uv_f_id)
print(col_uv_z_id)
print(set(col_uv_f_id)&set(col_uv_z_id))
输出用于归一化的因子为{'AT3G41768.1', 'AT3G06355.1', 'ATMG01390.1'}
首先我们定义深度不够、不够可信位点的Reactivity为空值。然后我们利用归一化的因子(这里为AT3G41768.1,AT3G06355.1,ATMG01390.1)进行Reactivity归一化。
(1) Reactivity is NAN Nucleotides with apparent mutation rates above 0.02 in any untreated sample were excluded from analysis. These artifacts were identified as regions of at least 10 nucleotides in which three or more of the 10 nucleotides showed mutation rates above 0.03 in the absence of NAI treatment, or modified mutation rates above 0.1 in any condition, and were also excluded from analysis. Read depth of nucleotides below 100 in any condition were also excluded from analysis.
(2)Reactivity normalization Reactivities were normalized with in each probing condition to the mean of the 92-98th percentile reactivities of nucleotides from the rRNAs, which were sequenced to high depths, and showed large hit level (hit level >0) across experimental probing conditions.
脚本参考:/data/TA_QUIZ_RNA_regulation/script/PartIII.SHAPE-seq_analysis/structure_changes_analysis/4.data_normalization.py
import numpy as np
import pandas as pd
import re
from numba import jit
import math
@jit(nopython=False)
def calc_quartile(x, q, qtype=7):
# 计算给定的四分位数,x为所需计算的四分位数的列表,q代表是第25%百分位数还是75%百分位数。qtype代表所选用的方法。
# source: http://adorio-research.org/wordpress/?p=125
# x = array, q = quartile (in % as a decimal)
y = np.copy(x)
n = len(y)
abcd = [(0, 0, 1, 0), # inverse empirical distrib.function., R type 1
(0.5, 0, 1, 0), # similar to type 1, averaged, R type 2
(0.5, 0, 0, 0), # nearest order statistic,(SAS) R type 3
(0, 0, 0, 1), # California linear interpolation, R type 4
(0.5, 0, 0, 1), # hydrologists method, R type 5
(0, 1, 0, 1), # mean-based estimate(Weibull method), (SPSS,Minitab), type 6
(1, -1, 0, 1), # mode-based method,(S, S-Plus), R type 7
(1.0 / 3, 1.0 / 3, 0, 1), # median-unbiased , R type 8
(3 / 8.0, 0.25, 0, 1) # normal-unbiased, R type 9.
]
a, b, c, d = abcd[qtype - 1]
g, j = math.modf(a + (n + b) * q - 1)
#第1四分位数的计算公式为 (n+3)/4,这里为(n-1)/4;第3四分位数的计算公式为(3n+1)/4,这里为(3n-3)/4。是由于数组中的元素是从x[0]开始,因此需要-1.
if j < 0:
return x[0]
elif j >= n:
return x[n - 1]
j = int(math.floor(j))
if g == 0:
return x[j]
else:
return y[j] + (y[j + 1] - y[j]) * (c + d * g)
# @jit(nopython=False)
def find_boxplot_factor(array):
x, o, a = [], [], 0
# Following deprecated line is behavior that normalization and
# structure modeling were optimized with, but this behavior
# is probably not ideal. For RNAs with regions of poor sequencing
# depth, treating those regions as unreactive could bias the
# normalized reactivities. This is especially important for
# larger RNAs.
# x = np.fromiter((n if not isnan(n) else 0 for n in array))
x = array[np.where(np.isfinite(array))]
# 除去数组中的非有限数的值。
x = x[np.nonzero(x)]
# 输出不为0的元素的下标
if x.shape[0] < 10:
norm_factor = np.nan
else:
x.sort()
ten_pct = len(x) // 10
five_pct = len(x) // 20
# calculate the interquartile range *1.5
q_limit = 1.5 * abs(calc_quartile(x, 0.25) - calc_quartile(x, 0.75))
ten_limit = x[x.shape[0] - 1 - ten_pct]
five_limit = x[x.shape[0] - 1 - five_pct]
# choose the cutoff that eliminates the fewest points
limit = max(q_limit, ten_limit)
if len(x) < 100:
limit = max(q_limit, five_limit)
# make new list without the outliers
for i in range(len(x)):
if x[i] < limit:
o.append(x[i])
# avg next ten percent
try:
for i in range(-ten_pct, 0):
a = o[i] + a
norm_factor = a / ten_pct
except IndexError:
norm_factor = np.nan
return norm_factor
@jit(nopython=False)
def find_boxplot_factor_new(array):
####the mean of the 92-98th percentile reactivities#####
# x, o, a = [], [], []
x = array[np.where(np.isfinite(array))]
x = x[np.nonzero(x)]
if x.shape[0] < 10:
norm_factor = np.nan
else:
# calculate the interquartile range *1.5
o = x[(x>=np.quantile(x,0.92))&(x<=np.quantile(x,0.98))]
norm_factor = np.mean(o)
return norm_factor
@jit(nopython=False)
def filter(X,Mutru_cut_off=0.02,read_depth_cutoff=100,R_window=10,window_mutru_cutoff=0.03,window_mutru_num=3,window_mutrs_cutoff=0.1,window_mutrs_num=3):
name = X[0]
nucleotide = X[1]
modified = np.array(X[2].split(',')).astype('float').astype('int')
modified_depth = np.array(X[3].split(',')).astype('float').astype('int')
unmodified = np.array(X[4].split(',')).astype('float').astype('int')
unmodified_depth = np.array(X[5].split(',')).astype('float').astype('int')
# print(nucleotide)
Mutrs = modified/modified_depth
Mutru = unmodified/unmodified_depth
R = Mutrs - Mutru
n = len(modified)
##### 校正reactivity###########
R[Mutru>Mutru_cut_off] = np.nan
R[(modified_depth <= read_depth_cutoff)|(unmodified_depth)<=read_depth_cutoff] = np.nan
R[R<0]= np.nan
for i in range(len(R)-R_window+1):
data_Mutru = Mutru[i:i+R_window-1]
data_Mutrs = Mutrs[i:i+R_window-1]
if (len(data_Mutru[data_Mutru>window_mutru_cutoff])>=window_mutru_num)|(len(data_Mutrs[data_Mutrs>window_mutrs_cutoff])>=window_mutrs_num):
R[i:i + R_window-1] = np.nan
R_new = ",".join(list(R.astype('str'))).replace('nan','-999')
X = X.append(pd.Series(R_new))
return X
#
@jit(nopython=False)
def get_fatcor(data):
R_all = np.array(data.iloc[0,6].split(',')).astype('float')
for i in range(1,len(data)):
R_ = np.array(data.iloc[i,6].split(',')).astype('float')
R_all = np.hstack((R_all,R_))
R_all[R_all==-999]=np.nan
factor = find_boxplot_factor_new(R_all)
return factor
@jit(nopython=False)
def normalization_all(X,factor):
#########读取数据#############
R = np.array(X[6].split(',')).astype('float')
R_nan = len(R[R == -999])/len(R)
R_0 = len(R[R == 0])/len(R)
R[R == -999] = np.nan
R_new = R/factor
R_new = ",".join(list(R_new.astype('str'))).replace('nan','-999')
X[6] =R_new
return X,R_nan,R_0
def main(data,transcript_list=[]):
data_filter = pd.DataFrame(np.zeros([len(data),7]))
data_filter.columns = ['transcript_id','Nucleotide' ,'Modified_mutations', 'Modified_effective_depth',
'Untreated_mutations', 'Untreated_effective_depth', 'reactivity']
data_new = pd.DataFrame(np.zeros([len(data), 7]))
data_new.columns = ['transcript_id','Nucleotide' ,'Modified_mutations', 'Modified_effective_depth',
'Untreated_mutations', 'Untreated_effective_depth', 'reactivity']
print(len(data))
for i in range(len(data)):
if i %1000 == 0:
print(i)
X=filter(data.iloc[i,:])
data_filter.iloc[i,:] = list(X)
if len(transcript_list)> 0:
factor = get_fatcor(data_filter.loc[data_filter.transcript_id.isin(transcript_list),:])
else:
factor = get_fatcor(data_filter)
for i in range(len(data_filter)):
if i %1000 == 0:
print(i)
# print data_
data_new.iloc[i,:],R_nan,R_0 = normalization_all(data_filter.iloc[i,:],factor)
# data_new.loc[i, 'ratio_nan']=R_nan
# data_new.loc[i, 'ratio_0'] = R_0
data_new = data_new[['transcript_id','Nucleotide', 'Modified_mutations', 'Modified_effective_depth',
'Untreated_mutations', 'Untreated_effective_depth', 'reactivity']]
return data_new
def loctaion(X):
X_1 = X.split('(')[1].split(')')[0].split(',')
loctaion_list = []
for i in range(len(X_1)):
loctaion_list_ = [i for i in range(int(X_1[i].split(':')[0]),int(X_1[i].split(':')[1])+1)]
loctaion_list.extend(loctaion_list_)
loctaion_list =np.array(loctaion_list)
loctaion_list = ",".join(list(loctaion_list.astype('str')))
return loctaion_list
def mapping(exon_data):
data_location = pd.DataFrame(np.zeros([len(exon_data),4]))
data_location.columns = ['transcript_id','chr','strand','location']
for i in range(len(exon_data)):
if i%1000 ==0:
print(i)
data_location.loc[i, 'transcript_id'] = exon_data.loc[i, 'transcript_id']
data_location.loc[i,'chr'] = exon_data.loc[i,'chr'].split('.')[0]
data_location.loc[i, 'strand'] = exon_data.loc[i, 'chr'].split('.')[1]
data_location.loc[i, 'location'] = loctaion(exon_data.loc[i, 'start_end'])
return data_location
if __name__ == '__main__':
col_uv_f = '/data/TA_QUIZ_RNA_regulation/data/riboshape_liulab_batch4/final.modified_umodified/col_nouv/'
col_uv_z = '/data/TA_QUIZ_RNA_regulation/data/riboshape_liulab_batch4/final.modified_umodified/col_uv/'
path=[col_uv_f,col_uv_z]
for i in range(len(path)):
transcript_list = [ 'AT3G41768.1', 'AT3G06355.1', 'ATMG01390.1' ]
data_uv_z = pd.read_csv(path[i] + '/final.modified_unmodified', sep='\t')
data_uv_z.columns = ['transcript_id', 'Nucleotide', 'Modified_mutations', 'Modified_effective_depth','Untreated_mutations', 'Untreated_effective_depth']
data_uv_z_new=main(data_uv_z,transcript_list)
print(data_uv_z_new)
data_uv_z_new.to_csv(path[i]+'/final.modified_unmodified_new',sep='\t',header=True,index=False)
output:
输出中final.modified_unmodified_new
:相较于final.modified_unmodified
多了一列reactivity
信息。
我们利用滑动窗口计算一个转录本的结构变化区域。需要注意以下几点:
每个滑动窗口非空值大于90%(ratio_nan=0.9)。
滑动窗口大小为50nt,步长为1nt。
只计算hit_level>0,modified_depth_median>100的转录本。
UV-和UV+数据相比,存在abs(delta_gini)>0.1的滑动窗口,该转录本为结构改变的转录本,该区域为结构改变区域,一个转录本可能有多个结构改变区域。
计算了整个转录本、转录本3'UTR、转录本5'UTR、转录本CDS区域的Gini index。
脚本参考:/data/TA_QUIZ_RNA_regulation/script/PartIII.SHAPE-seq_analysis/structure_changes_analysis/5.calculate_gini_index.py
########################计算gini_index#############################
import numpy as np
import pandas as pd
import re
from numba import jit
from scipy.stats import ks_2samp
import statsmodels.stats.multitest as multi
jit(nopython=False)
def get_gini(R_, nucleotide_, cut_AC=False, ratio_nan=0.9, ratio_nan_AC=0.8):
########
if len(R_)==0:
gini=np.nan
else:
R_nonull = R_[np.isnan(R_) == False]
# R_nonull = R_[R_>0]
ratio = len(R_nonull) / len(R_)
if ratio <= ratio_nan:
gini = np.nan
else:
if cut_AC:
R_AC = R_[(nucleotide_ == b'A') | (nucleotide_ == b'C')]
R_AC_nonull = R_AC[np.isnan(R_AC) == False]
ratio_AC = len(R_AC_nonull) / len(R_AC)
if (ratio_AC <= ratio_nan_AC) | len(R_AC) <= 1 | len(R_AC_nonull) <= 1:
gini = np.nan
else:
sorted = np.sort(R_AC_nonull)
height, area = 0, 0
for i in range(0, len(sorted)):
height += sorted[i]
area += height - sorted[i] / 2.
fair_area = height * len(sorted) / 2.
if fair_area == 0:
gini = np.nan
else:
gini = (fair_area - area) / fair_area
else:
sorted = np.sort(R_nonull)
height, area = 0, 0
for i in range(0, len(sorted)):
height += sorted[i]
area += height - sorted[i] / 2.
fair_area = height * len(sorted) / 2.
if fair_area == 0:
gini = np.nan
else:
gini = (fair_area - area) / fair_area
return gini
@jit(nopython=False)
def get_p(R_1,R_2,ratio_nan=0.9):
R_1_nonull = R_1[np.isnan(R_1) == False]
ratio_1 = len(R_1_nonull) / len(R_1)
R_2_nonull = R_2[np.isnan(R_2) == False]
ratio_2 = len(R_2_nonull) / len(R_2)
if (ratio_1 <= ratio_nan)|(ratio_2 <= ratio_nan):
p = np.nan
else:
s,p=ks_2samp(R_1, R_2)
return p
@jit(nopython=False)
def get_window(X, window=50, step=1, cut_AC=False, ratio_nan=0.9, ratio_nan_AC=0.8):
#########读取数据#############
name = X[0]
nucleotide = np.array(list(X[1]))
modified = np.array(X[2].split(',')).astype('float').astype('int')
modified_depth = np.array(X[3].split(',')).astype('float').astype('int')
#
unmodified = np.array(X[4].split(',')).astype('float').astype('int')
unmodified_depth = np.array(X[5].split(',')).astype('float').astype('int')
Mutrs = modified / modified_depth
Mutru = unmodified / unmodified_depth
# R = Mutrs - Mutru
R = np.array(X[6].split(',')).astype('float')
R[R == -999] = np.nan
n = len(nucleotide)
#####滑动窗口计算change##################
nucleotide_ = np.zeros([len(range(0, n - (window), step)), window], dtype=np.string_)
R_ = np.zeros([len(range(0, n - (window), step)), window])
gini = np.zeros([len(range(0, n - (window), step)), ])
j = 0
for i in range(0, n - (window), step):
nucleotide_[j, :] = nucleotide[i:i + window]
R_[j, :] = R[i:i + window]
gini[j] = get_gini(R[i:i + window], nucleotide[i:i + window], cut_AC=cut_AC, ratio_nan=ratio_nan,
ratio_nan_AC=ratio_nan_AC)
j = j + 1
return nucleotide_, R_, gini
@jit(nopython=False)
def get_window_p(X1,X2, window=50, step=1,ratio_nan=0.9):
#########读取数据#############
name = X1[0]
nucleotide = np.array(list(X1[1]))
R_z = np.array(X1[6].split(',')).astype('float')
R_z[R_z == -999] = np.nan
R_f = np.array(X2[6].split(',')).astype('float')
R_f[R_f == -999] = np.nan
n = len(nucleotide)
#####滑动窗口计算change##################
nucleotide_ = np.zeros([len(range(0, n - (window), step)), window], dtype=np.string_)
R_z_ = np.zeros([len(range(0, n - (window), step)), window])
R_f_ = np.zeros([len(range(0, n - (window), step)), window])
p = np.zeros([len(range(0, n - (window), step)), ])
j = 0
for i in range(0, n - (window), step):
nucleotide_[j, :] = nucleotide[i:i + window]
R_z_[j, :] = R_z[i:i + window]
R_f_[j, :] = R_f[i:i + window]
p[j] = get_p(R_z[i:i + window], R_f[i:i + window],ratio_nan=ratio_nan,)
j = j + 1
p_=p.copy()
if len(p[~np.isnan(p)])>0:
a,p_bh,b,c =multi.multipletests(p[~np.isnan(p)],method='fdr_bh')
p_[~np.isnan(p_)]=p_bh
return p_,p
@jit(nopython=False)
def calculate_delta_gini(R_1, R_2, gini_1, gini_2, ratio_nan=0.9):
'''Calculates Standard RMSD on two vectors of numbers of the same length'''
# Check to see the vectors are of equal length.
#计算delta_gini,(1)当R_1,R_2长度不同时,输出nan值;(2)当滑动窗口非空值比率<90%,gini_index输出空值。(3)除(1)(2)情况外,输出delta_gini。
if len(R_1) != len(R_2):
return np.nan
else:
R = R_1 - R_2
if len(R[np.isnan(R) == False]) / len(R) <= ratio_nan:
return np.nan
else:
delta_gini = gini_1 - gini_2
return delta_gini
@jit(nopython=False)
def get_all(X,cut_AC=False, ratio_nan=0, ratio_nan_AC=0.8):
#########读取数据#############
name = X[0]
nucleotide = np.array(list(X[1]))
R = np.array(X[6].split(',')).astype('float')
R[R == -999] = np.nan
n = len(nucleotide)
gini= get_gini(R, nucleotide, cut_AC=cut_AC, ratio_nan=ratio_nan,ratio_nan_AC=ratio_nan_AC)
return gini
@jit(nopython=False)
def get_se(X,location ,start,end,strand,cut_AC=False, ratio_nan=0, ratio_nan_AC=0.8):
#########读取数据#############
name = X[0]
nucleotide = np.array(list(X[1]))
R = np.array(X[6].split(',')).astype('float')
R[R == -999] = np.nan
n = len(nucleotide)
if strand=='+':
R_=R[location<start]
nucleotide_=nucleotide[location<start]
gini_5= get_gini(R_, nucleotide_, cut_AC=cut_AC, ratio_nan=ratio_nan,ratio_nan_AC=ratio_nan_AC)
R_ = R[location > end]
nucleotide_ = nucleotide[location > end]
gini_3 = get_gini(R_, nucleotide_, cut_AC=cut_AC, ratio_nan=ratio_nan, ratio_nan_AC=ratio_nan_AC)
else:
R_ = R[location < start]
nucleotide_ = nucleotide[location < start]
gini_3 = get_gini(R_, nucleotide_, cut_AC=cut_AC, ratio_nan=ratio_nan, ratio_nan_AC=ratio_nan_AC)
R_ = R[location > end]
nucleotide_ = nucleotide[location > end]
gini_5 = get_gini(R_, nucleotide_, cut_AC=cut_AC, ratio_nan=ratio_nan, ratio_nan_AC=ratio_nan_AC)
R_ = R[(location<= end)&(location>=start)]
nucleotide_ = nucleotide[(location<= end)&(location>=start)]
gini_cds = get_gini(R_, nucleotide_, cut_AC=cut_AC, ratio_nan=ratio_nan, ratio_nan_AC=ratio_nan_AC)
return gini_3,gini_cds,gini_5
@jit(nopython=False)
def get_statistics(data_z, data_f,data_gff_,location_list,strand):
data_z = np.array(data_z).reshape([7, ])
data_f = np.array(data_f).reshape([7, ])
nucleotide_z, R_z, gini_z = get_window(data_z)
nucleotide_f, R_f, gini_f = get_window(data_f)
###uv+###
gini_z_=gini_z[pd.notnull(gini_z)]
if len(gini_z_)==0:
gini_max_z=np.nan
else:
gini_max_z=gini_z[pd.notnull(gini_z)].max()
gini_all_z=get_all(data_z)
if len(data_gff_.loc[data_gff_['location']=='CDS','strat'])==0:
gini_3_z=np.nan
gini_5_z=np.nan
gini_cds_z=np.nan
else:
CDS_strat=list(data_gff_.loc[data_gff_['location']=='CDS','strat'])[0]
CDS_end = list(data_gff_.loc[data_gff_['location']=='CDS','end'])[0]
gini_3_z, gini_cds_z, gini_5_z = get_se(data_z,location_list,CDS_strat,CDS_end,strand)
###uv-###
gini_f_ = gini_f[pd.notnull(gini_f)]
if len(gini_f_) == 0:
gini_max_f = np.nan
else:
gini_max_f = gini_f[pd.notnull(gini_f)].max()
gini_all_f = get_all(data_f)
if len(data_gff_.loc[data_gff_['location'] == 'CDS', 'strat']) == 0:
gini_3_f = np.nan
gini_5_f = np.nan
gini_cds_f = np.nan
else:
CDS_strat = list(data_gff_.loc[data_gff_['location'] == 'CDS', 'strat'])[0]
CDS_end = list(data_gff_.loc[data_gff_['location'] == 'CDS', 'end'])[0]
gini_3_f, gini_cds_f, gini_5_f = get_se(data_f, location_list, CDS_strat, CDS_end, strand)
if len(R_z) != len(R_f):
print("error")
else:
delta_gini = np.zeros([len(R_z), ])
for i in range(len(R_z)):
delta_gini[i] = calculate_delta_gini(R_z[i, :], R_f[i, :], gini_z[i], gini_f[i])
return delta_gini,gini_max_z,gini_all_z,gini_3_z, gini_cds_z, gini_5_z,gini_max_f,gini_all_f,gini_3_f, gini_cds_f, gini_5_f
def main_sum_gini(data_uv_z, data_uv_f, data_location, data_gff, transcript_id_list):
statistics_sum = pd.DataFrame(columns={'transcript_id', 'num', 'num_0.1','delta_max', 'delta_min'})
statistics_z = pd.DataFrame(columns={'transcript_id', 'gini_all', 'gini_max', 'gini_3UTR', 'gini_5UTR', 'gini_CDS'})
statistics_f = pd.DataFrame(columns={'transcript_id', 'gini_all', 'gini_max', 'gini_3UTR', 'gini_5UTR', 'gini_CDS'})
i = 0
for transcript in transcript_id_list:
data_z = data_uv_z.loc[data_uv_z['transcript_id'] == transcript, :]
data_f = data_uv_f.loc[data_uv_f['transcript_id'] == transcript, :]
data_location_ = data_location.loc[data_location['transcript_id'] == transcript,:]
data_gff_ = data_gff.loc[data_gff['transcript_id'] == transcript, :]
location_list=np.array(list(data_location_['location'])[0].split(',')).astype('int')
chr = list(data_location_['chr'])[0]
strand=list(data_location_['strand'])[0]
delta_gini, gini_max_z, gini_all_z, gini_3_z, gini_cds_z, gini_5_z, gini_max_f, gini_all_f, gini_3_f, gini_cds_f, gini_5_f= get_statistics(data_z,data_f, data_gff_, location_list, strand)
statistics_sum.loc[i, 'transcript_id'] = transcript
num = sum(np.abs(delta_gini) >=0.2)
statistics_sum.loc[i, 'num'] = num
num_1 = sum(np.abs(delta_gini) >=0.1)
statistics_sum.loc[i, 'num_0.1'] = num_1
statistics_sum.loc[i, 'delta_gini_list'] = ','.join(list(delta_gini.astype('str')))
if len(delta_gini[np.isnan(delta_gini) == False]) > 0:
statistics_sum.loc[i, 'delta_max'] = np.max(delta_gini[np.isnan(delta_gini) == False])
statistics_sum.loc[i, 'delta_min'] = np.min(delta_gini[np.isnan(delta_gini) == False])
else:
statistics_sum.loc[i, 'delta_max'] = np.nan
statistics_sum.loc[i, 'delta_min'] = np.nan
statistics_z.loc[i,'transcript_id']=transcript
statistics_z.loc[i,'gini_all']=gini_all_z
statistics_z.loc[i, 'gini_max'] = gini_max_z
statistics_z.loc[i, 'gini_3UTR'] = gini_3_z
statistics_z.loc[i, 'gini_5UTR'] = gini_5_z
statistics_z.loc[i, 'gini_CDS'] = gini_cds_z
statistics_f.loc[i,'transcript_id']=transcript
statistics_f.loc[i,'gini_all']=gini_all_f
statistics_f.loc[i, 'gini_max'] = gini_max_f
statistics_f.loc[i, 'gini_3UTR'] = gini_3_f
statistics_f.loc[i, 'gini_5UTR'] = gini_5_f
statistics_f.loc[i, 'gini_CDS'] = gini_cds_f
i = i + 1
if i % 100 == 0:
print(i)
return statistics_sum,statistics_z,statistics_f
if __name__ == '__main__':
col_uv_f = '/data/TA_QUIZ_RNA_regulation/data/riboshape_liulab_batch4/final.modified_umodified/col_nouv/'
col_uv_z = '/data/TA_QUIZ_RNA_regulation/data/riboshape_liulab_batch4/final.modified_umodified/col_uv/'
###col###
result = '/data/TA_QUIZ_RNA_regulation/result/PartIII.SHAPE-seq_analysis/col'
uv_z = col_uv_z
uv_f = col_uv_f
data_uv_z = pd.read_csv(uv_z + '/final.modified_unmodified_new', sep='\t')
data_uv_z = data_uv_z[['transcript_id', 'Nucleotide', 'Modified_mutations', 'Modified_effective_depth',
'Untreated_mutations', 'Untreated_effective_depth', 'reactivity']]
data_uv_z = data_uv_z.drop_duplicates()
data_uv_f = pd.read_csv(uv_f + '/final.modified_unmodified_new', sep='\t')
data_uv_f = data_uv_f[['transcript_id', 'Nucleotide', 'Modified_mutations', 'Modified_effective_depth',
'Untreated_mutations', 'Untreated_effective_depth', 'reactivity']]
data_uv_f = data_uv_f.drop_duplicates()
transcript_uv_z = pd.read_csv(uv_z + '/cutoff.hit.group', sep='\t')
transcript_uv_z.columns = ['cut_off', 'transcript_id', 'modified_depth_median',
'unmodified_depth_median', 'modified_depth_sum', 'unmodified_depth_sum', 'hit_level']
transcript_uv_f = pd.read_csv(uv_f + '/cutoff.hit.group', sep='\t')
transcript_uv_f.columns = ['cut_off', 'transcript_id', 'modified_depth_median',
'unmodified_depth_median', 'modified_depth_sum', 'unmodified_depth_sum', 'hit_level']
uv_z_transcript = list(transcript_uv_z.loc[(transcript_uv_z['modified_depth_median'] > 100) & (
transcript_uv_z['hit_level'] > 0), 'transcript_id'])
uv_f_transcript = list(transcript_uv_f.loc[(transcript_uv_f['modified_depth_median'] > 100) & (
transcript_uv_f['hit_level'] > 0), 'transcript_id'])
transcript_all = list(set(uv_z_transcript) & set(uv_f_transcript))
print(len(transcript_all))
gff_path = '/data/TA_QUIZ_RNA_regulation/data/ATH/GFF/Ath_genes.gff'
data_gff = pd.read_csv(gff_path, sep='\t')
data_location = pd.read_csv(
'/data/TA_QUIZ_RNA_regulation/data/ATH/GTF/shape_map/result/transcript_exon_location.csv',
sep='\t')
statistics_sum, statistics_z, statistics_f = main_sum_gini(data_uv_z, data_uv_f, data_location, data_gff, transcript_all)
pd.DataFrame(statistics_sum).to_csv(result + '/gini_summary_50_1.csv', sep='\t', index=False, header=True)
pd.DataFrame(statistics_z).to_csv(result + '/gini_summary_UV+_50_1.csv', sep='\t', index=False, header=True)
pd.DataFrame(statistics_f).to_csv(result