2.3.SHAPE Data Analysis

本章为SHAPE-MaP测序数据数据处理的说明教程,分为Prepare Data Matrix和Data analysis两大部分。

  • Prepare Data Matrix

    • 利用Shapemapper 进行数据预处理并计算每个转录本的reactivity。

  • Data analysis

    • 计算hit level,根据hit level分布确定阈值。

    • 计算归一化因子,对Reactivity进行预处理和归一化。

    • 计算滑动窗口Gini index,利用滑动窗口判断每个转录本外界刺激前后的结构改变区域,对连续的滑动窗口进行合并,汇总结构。

Part I.Prepare Data Matrix

0)数据说明

SHAPE-MaP测序数据为12个样本包括加药和不加药两个情况。由于shapemapper的运行速度非常慢,这里我们仅选择部分example文件供大家练习Shapemapper的使用,本节最后会提供使用Shapemapper处理好的所有数据的每个转录本的reactivity文件。

这里,我们将SHAPE-MaP测序数据中C1_1选取1/20进行下面计算,加药前的fasta数据为CD1_1.clean.part1.R1.fastq.gz和CD1_1.clean.part1.R2.fastq.gz;加药后的fasta数据为CN1_1.clean.part1.R1.fastq.gz和CN1_1.clean.part1.R2.fastq.gz。为了进一步降低运行时间,我们参考转录本文件也进行了简化,仅取三条转录本的Arabidopsis_thaliana.TAIR10.34.transcripts_new_2.fa作练习使用。

data

path

加药前fasta数据

/data/TA_QUIZ_RNA_regulation/data/shapemapper_test/control/

加药后fasta数据

/data/TA_QUIZ_RNA_regulation/data/shapemapper_test/modified/

简化的参考转录本文件

/data/TA_QUIZ_RNA_regulation/data/shapemapper_test/Arabidopsis_thaliana.TAIR10.34.transcripts_new_2.fa

1)Using Shapemapper

参考脚本:/data/TA_QUIZ_RNA_regulation/data/test/test.shapemapper.sh

input:

  • 加药前fasta数据:CD1_1.clean.part1.R1.fastq.gz和CD1_1.clean.part1.R2.fastq.gz

  • 加药后fasta数据:CN1_1.clean.part1.R1.fastq.gz和CN1_1.clean.part1.R2.fastq.gz

  • 比对转录本的序列信息:Arabidopsis_thaliana.TAIR10.34.transcripts_new_2.fa

Software/Parameters:

shapemapper

options

function

--target /path/to/transcript_index

FASTA file or list of files containing one or more target DNA sequences('T' not 'U')

--name "Ath_C1_1"

所有输出文件的前缀,在同一个文件下运行多个Shapemapper时十分推荐。

--out

输出文件夹、默认为"shapemapper_out"

--temp

临时文件夹,默认为"shapemapper_temp"

--min-depth 100

Minimum effective sequencing depth for including data

--min-qual-to-count 20

Only count mutations with all basecall quality scores meeting this minimum score

--overwrite

覆盖输出和临时文件夹中的现有警告。默认值=False。

--modified --folder /path/to/modified

设置modified文件夹路径

--untreated --folder /path/to/control

设置control文件夹路径

--star-aligner

使用STAR代替Bowtie2进行序列比对

--nproc 8

设置线程数

--verbose

显示每个执行过程的完整命令,并在发生错误时,显示更多过程输出信息。默认值为=False。

reference:

shapemapper \
--target ~/Arabidopsis_thaliana.TAIR10.34.transcripts_new_1_1.fa \
--name "Ath_test_C1_1" \
--min-depth 100 \
--min-qual-to-count 20 \
--overwrite \
--modified --folder ~/data/modified/  \
--untreated --folder ~/data/control/ \
--star-aligner \
--nproc 8 \
--verbose

result:

输出结果参考文件夹/data/TA_QUIZ_RNA_regulation/data/test/C1_1_result/shapemapper_out。以AT1G20620.4这个转录本为例,C1_1_AT1G20620.4_profile.txt为输出文件,其中文件中SequenceModified_mutationsModified_effective_depthUntreated_mutationsUntreated_effective_depth列为我们需要的主要信息。每个位点的Reactivity可以从中计算得到。

Part II.Data analysis

0)数据说明

在后续计算中,使用我们计算好的文件。我们把同一个实验条件三个replica合并后计算reactivity,得到final.modified_unmodified文件。文件每一行是一个转录本,每列内容为transcript_id,Nucleotide,Modified_mutations,Modified_effective_depth,Untreated_mutations,Untreated_effective_depth

data

path

WT不加光

/data/TA_QUIZ_RNA_regulation/data/riboshape_liulab_batch4/final.modified_umodified/col_nouv/final.modified_unmodified

WT加光

/data/TA_QUIZ_RNA_regulation/data/riboshape_liulab_batch4/final.modified_umodified/col_uv/final.modified_unmodified

1)Structure change analysis

我们以Shapemapper计算得到的整合好的col_nouvcol_uv中所有转录本结果来计算得到结构改变区域。

我们的计算过程主要包括以下几步:

  • 计算hit level

  • 查看hit level的分布,并确定阈值

  • 计算归一化因子

  • Reactivity预处理和归一化

  • 合并结构变化区域

1.1) 计算原理

为了方便理解,我们介绍ReactivityHit levelGini index 等几个基础概念。

  • Reactivity R = mutation rate(S)- mutation rate(B) S corresponds to a SHAPE modified sample, B to untreated.

  • Hit level

The hit level metric quantifies the total background-subtracted signal per nucleotide of transcript.

  • Gini index

We then applied these metrics to windows containing a total of 50 nucleotides Requiring a minimum number of 100 total reads, number of nan in reactivity less than 10%.

ReactivityHit level refer to rnA motif discovery by shAPe and mutational profiling (shAPe-maP); Gini index refer to A Stress Response that Monitors and Regulates mRNA Structure Is Central to Cold Shock Adaptation

1.2) 计算 hit level

根据结果计算hit level 值,给出每条对应的hit level,按照深度进行区间划分的hit level平均值。 hit_level.py refer to /data/TA_QUIZ_RNA_regulation/data/script/PartIII.SHAPE-seq_analysis/hit_level.py

脚本参考:/data/TA_QUIZ_RNA_regulation/script/PartIII.SHAPE-seq_analysis/structure_changes_analysis/1.calculate_hit_level.sh reference:col_nouv为例

### hit level
### 1)calculate the average mutation events above background for each transcript

path0=/data/TA_QUIZ_RNA_regulation/data/script/PartIII.SHAPE-seq_analysis
path1=/data/TA_QUIZ_RNA_regulation/data/riboshape_liulab_batch4/final.modified_umodified/col_nouv

## 注意,这里的输入文件是final.modified_unmodified而非final.modified_unmodified.hit
python $path0/hit_level.py \
--data_path $path1/final.modified_unmodified \
--savepath_hit $path1/final.modified_unmodified.hit

echo -e "cutoff\ttranscript_id\tmodified_depth_median\tunmodified_depth_median\tmodified_depth_sum\tunmodified_depth_sum\thit"       >       $path1/cutoff.hit.group;
awk -F '\t' '$3>0 && $2<=25 && $2>0{print "0\t"$0}'     $path1/final.modified_unmodified.hit  >>        $path1/cutoff.hit.group;
awk -F '\t' '$3>0 && $2<=50 && $2 > 25{print "25\t"$0}' $path1/final.modified_unmodified.hit  >>        $path1/cutoff.hit.group;
awk -F '\t' '$3>0 && $2<=100 && $2 > 50{print "50\t"$0}' $path1/final.modified_unmodified.hit >>      $path1/cutoff.hit.group;
awk -F '\t' '$3>0 && $2<=200 && $2 > 100{print "100\t"$0}' $path1/final.modified_unmodified.hit >>      $path1/cutoff.hit.group;
awk -F '\t' '$3>0 && $2<=300 && $2 > 200{print "200\t"$0}' $path1/final.modified_unmodified.hit >>      $path1/cutoff.hit.group;
awk -F '\t' '$3>0 && $2<=500 && $2 >300{print "300\t"$0}' $path1/final.modified_unmodified.hit >>      $path1/cutoff.hit.group;
awk -F '\t' '$3>0 && $2<=750 && $2 >500{print "500\t"$0}' $path1/final.modified_unmodified.hit >>      $path1/cutoff.hit.group;
awk -F '\t' '$3>0 && $2<=1000 && $2 >750{print "750\t"$0}' $path1/final.modified_unmodified.hit >>      $path1/cutoff.hit.group;
awk -F '\t' '$3>0 && $2<=2000 && $2 >1000{print "1000\t"$0}' $path1/final.modified_unmodified.hit >>      $path1/cutoff.hit.group;
awk -F '\t' '$3>0 && $2<=5000 && $2 >2000{print "2000\t"$0}' $path1/final.modified_unmodified.hit >>      $path1/cutoff.hit.group;
awk -F '\t' '$3>0 && $2 > 5000{print "5000\t"$0}' $path1/final.modified_unmodified.hit >>      $path1/cutoff.hit.group;

result:

  • final.modified_unmodified.hit:其中第6列即为转录本对应的hit level的值。

  • cutoff.hit.group:是对final.modified_unmodified.hit文件的质控过滤后的结果。筛选,其中unmodified.median>0,modified.median处于各种范围的行。

1.3) 查看hit level 分布并确定阈值

脚本参考:/data/TA_QUIZ_RNA_regulation/script/PartIII.SHAPE-seq_analysis/structure_changes_analysis/2.hit_level.py

#####1.统计不同hit level的转录本个数#######
import numpy as np
import pandas as pd
import re
import seaborn as sns
import matplotlib.pyplot as plt
from numpy import median
from numba import jit

#########转录本注释和数据读入#################
#### path #####
exon_gtf_path='/data/TA_QUIZ_RNA_regulation/data/ATH/GTF/shape_map/ath_exons.gtf'
col_uv_f_path='/data/TA_QUIZ_RNA_regulation/data/riboshape_liulab_batch4/final.modified_umodified/col_nouv/'
col_uv_z_path='/data/TA_QUIZ_RNA_regulation/data/riboshape_liulab_batch4/final.modified_umodified/col_uv/'


### 读取注释文件,在exon_gtf文件的第8列中提取'transcript_id','gene_type','gene_name','chr'等信息
gtf_data=pd.read_csv(exon_gtf_path,sep='\t',header=None)
gtf_data_new=pd.DataFrame(columns={'transcript_id','gene_type','gene_name','chr'})
gtf_data_new['transcript_id']=gtf_data.iloc[:,8].apply(lambda x:x.split(';')[1].split('"')[1])
gtf_data_new['gene_type'] = gtf_data.iloc[:,8].apply(lambda x:re.findall('gene_biotype ".*?"',x)[0].split('"')[1])
gtf_data_new['gene_name'] = gtf_data.iloc[:,8].apply(lambda x:re.findall('gene_name ".*?"',x)[0].split('"')[1] if 'gene_name' in x else np.nan)
gtf_data_new['chr'] = gtf_data.iloc[:,0].apply(lambda x: 6 if x=='Mt' else 7 if x=='Pt' else x ).astype('int')
gtf_data_new = gtf_data_new.drop_duplicates()
gtf_data_new.index = range(len(gtf_data_new))

#  提取col_nouv、col_uv中的'transcript_id','modified.median','unmodified.median','modified.sum','unmodified.sum','hit'列,并将ath_exons参考基因组中的'gene_type','gene_name','chr'的信息合并到原数据中

hit_level_col_uv_f = pd.read_csv(col_uv_f_path+'/final.modified_unmodified.hit',sep='\t',header=None)
hit_level_col_uv_f.columns =['transcript_id','modified.median','unmodified.median','modified.sum','unmodified.sum','hit']
hit_level_col_uv_f = pd.merge(hit_level_col_uv_f,gtf_data_new,on='transcript_id',how='left')
hit_level_col_uv_f['Type'] = 'WT_UV-'

hit_level_col_uv_z = pd.read_csv(col_uv_z_path+'/final.modified_unmodified.hit',sep='\t',header=None)
hit_level_col_uv_z.columns =['transcript_id','modified.median','unmodified.median','modified.sum','unmodified.sum','hit']
hit_level_col_uv_z = pd.merge(hit_level_col_uv_z,gtf_data_new,on='transcript_id',how='left')
hit_level_col_uv_z['Type'] = 'WT_UV+'


##计算col_nouv、col_uv中hit_level>-1000,hit_level>0.hit_level>1,hit_level>2,hit_level>5,hit_level>10,hit_level>15的数目并整理成数据框格式。
data1=pd.DataFrame(columns={'Type','Number of transcripts','hit'})
for num in [-1000,0,1,2,5,10,15]:
    WT_UV_f = len(hit_level_col_uv_f.loc[(hit_level_col_uv_f['hit'] > num) & (hit_level_col_uv_f['modified.median'] > 100), :])
    WT_UV_z = len(hit_level_col_uv_z.loc[(hit_level_col_uv_z['hit'] > num) & (hit_level_col_uv_z['modified.median'] > 100)])

    data2 = pd.DataFrame(columns={'Type', 'Number of transcripts', 'hit'})
    data2['Type'] = ['WT_UV-', 'WT_UV+']
    data2['Number of transcripts'] = [WT_UV_f, WT_UV_z]

    if num==-1000:
        data2['hit'] ='ALL transcripts'
    else:
        data2['hit']='Hit level>'+str(num)
    data1 = pd.concat([data1,data2])

# 将hit level分布情况画图
#注意matplotlib.pyplot的默认画板在linux下无法被调用,需要用plt.switch_backend('agg')命令来设定画板。
plt.switch_backend('agg')
plt.figure(figsize=(8, 6))
sns.set(style="ticks", context="talk")
sns.axes_style({'font.family': ['sans-serif'],'font.sans-serif': ['Arial']})
g = sns.barplot(y='Number of transcripts',x='hit',hue='Type',data=data1,palette=["#3498DB", "#1F618D"])

sns.despine()
font1 = {'family' : 'Arial','weight' : 'roman','size': 22}
plt.xticks(rotation=60)

plt.legend(fontsize='small')
plt.tight_layout()
plt.savefig('/data/TA_QUIZ_RNA_regulation/result/PartIII.SHAPE-seq_analysis/plot/2a.transcript_num_hit.png')
plt.close()

注意matplotlib.pyplot的默认画板在linux下无法被调用,需要用plt.switch_backend('agg')命令来设定画板。

result:

输出为2a.transcript_num_hit.png:转录本个数随hit level 变化趋势图。

根据转录本个数变化趋势,我们取hit level大于2的转录本为可信转录本,进行后续分析。

1.4) 计算归一化因子

我们利用rRNAs,lncRNA and tRNA, which were sequenced to high depths, and showed large hit level (hit level >10) across experimental probing conditions. 找到用于归一化的rRNAs/lncRNA/tRNA的脚本参考:/data/TA_QUIZ_RNA_regulation/script/PartIII.SHAPE-seq_analysis/structure_changes_analysis/3.normalization_factor.py

import numpy as np
import pandas as pd
import re
import seaborn as sns
import matplotlib.pyplot as plt
from numpy import median
from numba import jit

#########转录本注释和数据读入#################
#### path #####
exon_gtf_path='/data/TA_QUIZ_RNA_regulation/data/ATH/GTF/shape_map/ath_exons.gtf'
col_uv_f_path='/data/TA_QUIZ_RNA_regulation/data/riboshape_liulab_batch4/final.modified_umodified/col_nouv/'
col_uv_z_path='/data/TA_QUIZ_RNA_regulation/data/riboshape_liulab_batch4/final.modified_umodified/col_uv/'

### 读取注释文件,在exon_gtf文件的第8列中提取'transcript_id','gene_type','gene_name','chr'等信息
gtf_data=pd.read_csv(exon_gtf_path,sep='\t',header=None)
gtf_data_new=pd.DataFrame(columns={'transcript_id','gene_type','gene_name','chr'})
gtf_data_new['transcript_id']=gtf_data.iloc[:,8].apply(lambda x:x.split(';')[1].split('"')[1])
gtf_data_new['gene_type'] = gtf_data.iloc[:,8].apply(lambda x:re.findall('gene_biotype ".*?"',x)[0].split('"')[1])
gtf_data_new['gene_name'] = gtf_data.iloc[:,8].apply(lambda x:re.findall('gene_name ".*?"',x)[0].split('"')[1] if 'gene_name' in x else np.nan)
gtf_data_new['chr'] = gtf_data.iloc[:,0].apply(lambda x: 6 if x=='Mt' else 7 if x=='Pt' else x ).astype('int')
gtf_data_new = gtf_data_new.drop_duplicates()
gtf_data_new.index = range(len(gtf_data_new))

#  提取col_nouv、col_uv中'cutoff.hit.group'的'group','transcript_id','modified.median','unmodified.median','modified.sum','unmodified.sum','hit'列,并将ath_exons参考基因组中的'gene_type','gene_name','chr'等信息合并到原数据中

hit_level_col_uv_f = pd.read_csv(col_uv_f_path+'/cutoff.hit.group',sep='\t')
hit_level_col_uv_f.columns =['group','transcript_id','modified.median','unmodified.median','modified.sum','unmodified.sum','hit']
hit_level_col_uv_f = pd.merge(hit_level_col_uv_f,gtf_data_new,on='transcript_id',how='left')
hit_level_col_uv_f['spe'] = 'WT_UV-'

hit_level_col_uv_z = pd.read_csv(col_uv_z_path+'/cutoff.hit.group',sep='\t')
hit_level_col_uv_z.columns =['group','transcript_id','modified.median','unmodified.median','modified.sum','unmodified.sum','hit']
hit_level_col_uv_z = pd.merge(hit_level_col_uv_z,gtf_data_new,on='transcript_id',how='left')
hit_level_col_uv_z['spe'] = 'WT_UV+'


###从col_uv_f、col_uv_z中筛选出gene_type属于['lncRNA','rRNA','tRNA'],hit_level的值>10 && modified.median>5000的转录本,并输出它的转录本ID###
col_uv_f_id = hit_level_col_uv_f.loc[(hit_level_col_uv_f.gene_type.isin(['lncRNA','rRNA','tRNA']))&(hit_level_col_uv_f['modified.median']>5000)&(hit_level_col_uv_f['hit']>10),'transcript_id']
col_uv_z_id = hit_level_col_uv_z.loc[(hit_level_col_uv_z.gene_type.isin(['lncRNA','rRNA','tRNA']))&(hit_level_col_uv_z['modified.median']>5000)&(hit_level_col_uv_z['hit']>10),'transcript_id']

print(col_uv_f_id)
print(col_uv_z_id)
print(set(col_uv_f_id)&set(col_uv_z_id))

输出用于归一化的因子为{'AT3G41768.1', 'AT3G06355.1', 'ATMG01390.1'}

1.5) Reactivity预处理和归一化

首先我们定义深度不够、不够可信位点的Reactivity为空值。然后我们利用归一化的因子(这里为AT3G41768.1,AT3G06355.1,ATMG01390.1)进行Reactivity归一化。

  • (1) Reactivity is NAN Nucleotides with apparent mutation rates above 0.02 in any untreated sample were excluded from analysis. These artifacts were identified as regions of at least 10 nucleotides in which three or more of the 10 nucleotides showed mutation rates above 0.03 in the absence of NAI treatment, or modified mutation rates above 0.1 in any condition, and were also excluded from analysis. Read depth of nucleotides below 100 in any condition were also excluded from analysis.

  • (2)Reactivity normalization Reactivities were normalized with in each probing condition to the mean of the 92-98th percentile reactivities of nucleotides from the rRNAs, which were sequenced to high depths, and showed large hit level (hit level >0) across experimental probing conditions.

脚本参考:/data/TA_QUIZ_RNA_regulation/script/PartIII.SHAPE-seq_analysis/structure_changes_analysis/4.data_normalization.py

import numpy as np
import pandas as pd
import re
from numba import jit
import math

@jit(nopython=False)
def calc_quartile(x, q, qtype=7):
    # 计算给定的四分位数,x为所需计算的四分位数的列表,q代表是第25%百分位数还是75%百分位数。qtype代表所选用的方法。
    # source: http://adorio-research.org/wordpress/?p=125
    # x = array, q = quartile (in % as a decimal)
    y = np.copy(x)
    n = len(y)
    abcd = [(0, 0, 1, 0),  # inverse empirical distrib.function., R type 1
            (0.5, 0, 1, 0),  # similar to type 1, averaged, R type 2
            (0.5, 0, 0, 0),  # nearest order statistic,(SAS) R type 3
            (0, 0, 0, 1),  # California linear interpolation, R type 4
            (0.5, 0, 0, 1),  # hydrologists method, R type 5
            (0, 1, 0, 1),  # mean-based estimate(Weibull method), (SPSS,Minitab), type 6
            (1, -1, 0, 1),  # mode-based method,(S, S-Plus), R type 7
            (1.0 / 3, 1.0 / 3, 0, 1),  # median-unbiased ,  R type 8
            (3 / 8.0, 0.25, 0, 1)  # normal-unbiased, R type 9.
            ]
    a, b, c, d = abcd[qtype - 1]
    g, j = math.modf(a + (n + b) * q - 1)
    #第1四分位数的计算公式为 (n+3)/4,这里为(n-1)/4;第3四分位数的计算公式为(3n+1)/4,这里为(3n-3)/4。是由于数组中的元素是从x[0]开始,因此需要-1.
    if j < 0:
        return x[0]
    elif j >= n:
        return x[n - 1]
    j = int(math.floor(j))
    if g == 0:
        return x[j]
    else:
        return y[j] + (y[j + 1] - y[j]) * (c + d * g)

# @jit(nopython=False)
def find_boxplot_factor(array):
    x, o, a = [], [], 0
    # Following deprecated line is behavior that normalization and
    # structure modeling were optimized with, but this behavior
    # is probably not ideal. For RNAs with regions of poor sequencing
    # depth, treating those regions as unreactive could bias the
    # normalized reactivities. This is especially important for
    # larger RNAs.
    # x = np.fromiter((n if not isnan(n) else 0 for n in array))
    x = array[np.where(np.isfinite(array))]
    # 除去数组中的非有限数的值。
    x = x[np.nonzero(x)]
    # 输出不为0的元素的下标
    if x.shape[0] < 10:
        norm_factor = np.nan
    else:
        x.sort()
        ten_pct = len(x) // 10
        five_pct = len(x) // 20
        # calculate the interquartile range *1.5
        q_limit = 1.5 * abs(calc_quartile(x, 0.25) - calc_quartile(x, 0.75))
        ten_limit = x[x.shape[0] - 1 - ten_pct]
        five_limit = x[x.shape[0] - 1 - five_pct]
        # choose the cutoff that eliminates the fewest points
        limit = max(q_limit, ten_limit)
        if len(x) < 100:
            limit = max(q_limit, five_limit)
        # make new list without the outliers
        for i in range(len(x)):
            if x[i] < limit:
                o.append(x[i])
        # avg next ten percent
        try:
            for i in range(-ten_pct, 0):
                a = o[i] + a
            norm_factor = a / ten_pct
        except IndexError:
            norm_factor = np.nan
    return norm_factor

@jit(nopython=False)
def find_boxplot_factor_new(array):
    ####the mean of the 92-98th percentile reactivities#####
    # x, o, a = [], [], []
    x = array[np.where(np.isfinite(array))]
    x = x[np.nonzero(x)]
    if x.shape[0] < 10:
        norm_factor = np.nan
    else:

        # calculate the interquartile range *1.5
        o = x[(x>=np.quantile(x,0.92))&(x<=np.quantile(x,0.98))]
        norm_factor = np.mean(o)
    return norm_factor


@jit(nopython=False)
def filter(X,Mutru_cut_off=0.02,read_depth_cutoff=100,R_window=10,window_mutru_cutoff=0.03,window_mutru_num=3,window_mutrs_cutoff=0.1,window_mutrs_num=3):

    name = X[0]
    nucleotide = X[1]
    modified = np.array(X[2].split(',')).astype('float').astype('int')
    modified_depth = np.array(X[3].split(',')).astype('float').astype('int')
    unmodified = np.array(X[4].split(',')).astype('float').astype('int')
    unmodified_depth = np.array(X[5].split(',')).astype('float').astype('int')

    # print(nucleotide)
    Mutrs = modified/modified_depth
    Mutru = unmodified/unmodified_depth
    R = Mutrs - Mutru
    n = len(modified)
    ##### 校正reactivity###########
    R[Mutru>Mutru_cut_off] = np.nan
    R[(modified_depth <= read_depth_cutoff)|(unmodified_depth)<=read_depth_cutoff] = np.nan
    R[R<0]= np.nan
    for i in range(len(R)-R_window+1):
        data_Mutru = Mutru[i:i+R_window-1]
        data_Mutrs = Mutrs[i:i+R_window-1]
        if (len(data_Mutru[data_Mutru>window_mutru_cutoff])>=window_mutru_num)|(len(data_Mutrs[data_Mutrs>window_mutrs_cutoff])>=window_mutrs_num):
            R[i:i + R_window-1] = np.nan
    R_new = ",".join(list(R.astype('str'))).replace('nan','-999')
    X = X.append(pd.Series(R_new))
    return X
    #
@jit(nopython=False)
def get_fatcor(data):
    R_all = np.array(data.iloc[0,6].split(',')).astype('float')
    for i in range(1,len(data)):
        R_ = np.array(data.iloc[i,6].split(',')).astype('float')
        R_all = np.hstack((R_all,R_))
    R_all[R_all==-999]=np.nan
    factor = find_boxplot_factor_new(R_all)
    return factor

@jit(nopython=False)
def normalization_all(X,factor):
    #########读取数据#############
    R = np.array(X[6].split(',')).astype('float')
    R_nan = len(R[R == -999])/len(R)
    R_0 = len(R[R == 0])/len(R)
    R[R == -999] = np.nan
    R_new = R/factor
    R_new = ",".join(list(R_new.astype('str'))).replace('nan','-999')
    X[6] =R_new
    return X,R_nan,R_0
def main(data,transcript_list=[]):

    data_filter = pd.DataFrame(np.zeros([len(data),7]))
    data_filter.columns = ['transcript_id','Nucleotide' ,'Modified_mutations', 'Modified_effective_depth',
                             'Untreated_mutations', 'Untreated_effective_depth', 'reactivity']
    data_new = pd.DataFrame(np.zeros([len(data), 7]))
    data_new.columns = ['transcript_id','Nucleotide' ,'Modified_mutations', 'Modified_effective_depth',
                           'Untreated_mutations', 'Untreated_effective_depth', 'reactivity']
    print(len(data))
    for i in range(len(data)):
        if i %1000 == 0:
            print(i)
        X=filter(data.iloc[i,:])
        data_filter.iloc[i,:] = list(X)
    if len(transcript_list)> 0:
        factor = get_fatcor(data_filter.loc[data_filter.transcript_id.isin(transcript_list),:])
    else:
        factor = get_fatcor(data_filter)
    for i in range(len(data_filter)):
        if i %1000 == 0:
            print(i)
            # print data_
        data_new.iloc[i,:],R_nan,R_0 = normalization_all(data_filter.iloc[i,:],factor)
        # data_new.loc[i, 'ratio_nan']=R_nan
        # data_new.loc[i, 'ratio_0'] = R_0
    data_new = data_new[['transcript_id','Nucleotide', 'Modified_mutations', 'Modified_effective_depth',
                           'Untreated_mutations', 'Untreated_effective_depth', 'reactivity']]
    return data_new

def loctaion(X):
    X_1 = X.split('(')[1].split(')')[0].split(',')
    loctaion_list = []
    for i in range(len(X_1)):
        loctaion_list_ = [i for i in range(int(X_1[i].split(':')[0]),int(X_1[i].split(':')[1])+1)]
        loctaion_list.extend(loctaion_list_)
    loctaion_list =np.array(loctaion_list)
    loctaion_list = ",".join(list(loctaion_list.astype('str')))

    return loctaion_list



def mapping(exon_data):
    data_location = pd.DataFrame(np.zeros([len(exon_data),4]))
    data_location.columns = ['transcript_id','chr','strand','location']

    for i in range(len(exon_data)):
        if i%1000 ==0:
            print(i)
        data_location.loc[i, 'transcript_id'] = exon_data.loc[i, 'transcript_id']
        data_location.loc[i,'chr'] = exon_data.loc[i,'chr'].split('.')[0]
        data_location.loc[i, 'strand'] = exon_data.loc[i, 'chr'].split('.')[1]
        data_location.loc[i, 'location'] = loctaion(exon_data.loc[i, 'start_end'])
    return data_location

if __name__ == '__main__':
    col_uv_f = '/data/TA_QUIZ_RNA_regulation/data/riboshape_liulab_batch4/final.modified_umodified/col_nouv/'
    col_uv_z = '/data/TA_QUIZ_RNA_regulation/data/riboshape_liulab_batch4/final.modified_umodified/col_uv/'
    path=[col_uv_f,col_uv_z]
    for i in range(len(path)):
        transcript_list = [ 'AT3G41768.1', 'AT3G06355.1', 'ATMG01390.1' ]
        data_uv_z = pd.read_csv(path[i] + '/final.modified_unmodified', sep='\t')
        data_uv_z.columns = ['transcript_id', 'Nucleotide', 'Modified_mutations', 'Modified_effective_depth','Untreated_mutations', 'Untreated_effective_depth']
        data_uv_z_new=main(data_uv_z,transcript_list)
        print(data_uv_z_new)
        data_uv_z_new.to_csv(path[i]+'/final.modified_unmodified_new',sep='\t',header=True,index=False)

output:

输出中final.modified_unmodified_new:相较于final.modified_unmodified多了一列reactivity信息。

1.6) 滑动窗口Gini index 计算

我们利用滑动窗口计算一个转录本的结构变化区域。需要注意以下几点:

  • 每个滑动窗口非空值大于90%(ratio_nan=0.9)。

  • 滑动窗口大小为50nt,步长为1nt。

  • 只计算hit_level>0,modified_depth_median>100的转录本。

  • UV-和UV+数据相比,存在abs(delta_gini)>0.1的滑动窗口,该转录本为结构改变的转录本,该区域为结构改变区域,一个转录本可能有多个结构改变区域。

  • 计算了整个转录本、转录本3'UTR、转录本5'UTR、转录本CDS区域的Gini index。

脚本参考:/data/TA_QUIZ_RNA_regulation/script/PartIII.SHAPE-seq_analysis/structure_changes_analysis/5.calculate_gini_index.py

########################计算gini_index#############################
import numpy as np
import pandas as pd
import re
from numba import jit
from scipy.stats import ks_2samp
import statsmodels.stats.multitest as multi

jit(nopython=False)
def get_gini(R_, nucleotide_, cut_AC=False, ratio_nan=0.9, ratio_nan_AC=0.8):
    ########
    if len(R_)==0:
        gini=np.nan
    else:
        R_nonull = R_[np.isnan(R_) == False]
        # R_nonull = R_[R_>0]
        ratio = len(R_nonull) / len(R_)
        if ratio <= ratio_nan:
            gini = np.nan
        else:
            if cut_AC:
                R_AC = R_[(nucleotide_ == b'A') | (nucleotide_ == b'C')]
                R_AC_nonull = R_AC[np.isnan(R_AC) == False]
                ratio_AC = len(R_AC_nonull) / len(R_AC)
                if (ratio_AC <= ratio_nan_AC) | len(R_AC) <= 1 | len(R_AC_nonull) <= 1:
                    gini = np.nan
                else:
                    sorted = np.sort(R_AC_nonull)
                    height, area = 0, 0
                    for i in range(0, len(sorted)):
                        height += sorted[i]
                        area += height - sorted[i] / 2.
                    fair_area = height * len(sorted) / 2.
                    if fair_area == 0:
                        gini = np.nan
                    else:
                        gini = (fair_area - area) / fair_area
            else:
                sorted = np.sort(R_nonull)
                height, area = 0, 0
                for i in range(0, len(sorted)):
                    height += sorted[i]
                    area += height - sorted[i] / 2.
                fair_area = height * len(sorted) / 2.
                if fair_area == 0:
                    gini = np.nan
                else:
                    gini = (fair_area - area) / fair_area
    return gini

@jit(nopython=False)
def get_p(R_1,R_2,ratio_nan=0.9):
    R_1_nonull = R_1[np.isnan(R_1) == False]
    ratio_1 = len(R_1_nonull) / len(R_1)
    R_2_nonull = R_2[np.isnan(R_2) == False]
    ratio_2 = len(R_2_nonull) / len(R_2)
    if (ratio_1 <= ratio_nan)|(ratio_2 <= ratio_nan):
        p = np.nan
    else:
        s,p=ks_2samp(R_1, R_2)
    return p

@jit(nopython=False)
def get_window(X, window=50, step=1, cut_AC=False, ratio_nan=0.9, ratio_nan_AC=0.8):
    #########读取数据#############
    name = X[0]

    nucleotide = np.array(list(X[1]))

    modified = np.array(X[2].split(',')).astype('float').astype('int')


    modified_depth = np.array(X[3].split(',')).astype('float').astype('int')

    #
    unmodified = np.array(X[4].split(',')).astype('float').astype('int')


    unmodified_depth = np.array(X[5].split(',')).astype('float').astype('int')
    Mutrs = modified / modified_depth
    Mutru = unmodified / unmodified_depth
    # R = Mutrs - Mutru
    R = np.array(X[6].split(',')).astype('float')
    R[R == -999] = np.nan
    n = len(nucleotide)

    #####滑动窗口计算change##################
    nucleotide_ = np.zeros([len(range(0, n - (window), step)), window], dtype=np.string_)
    R_ = np.zeros([len(range(0, n - (window), step)), window])
    gini = np.zeros([len(range(0, n - (window), step)), ])
    j = 0
    for i in range(0, n - (window), step):
        nucleotide_[j, :] = nucleotide[i:i + window]
        R_[j, :] = R[i:i + window]
        gini[j] = get_gini(R[i:i + window], nucleotide[i:i + window], cut_AC=cut_AC, ratio_nan=ratio_nan,
                           ratio_nan_AC=ratio_nan_AC)
        j = j + 1

    return nucleotide_, R_, gini

@jit(nopython=False)
def get_window_p(X1,X2, window=50, step=1,ratio_nan=0.9):
    #########读取数据#############
    name = X1[0]
    nucleotide = np.array(list(X1[1]))
    R_z = np.array(X1[6].split(',')).astype('float')
    R_z[R_z == -999] = np.nan
    R_f = np.array(X2[6].split(',')).astype('float')
    R_f[R_f == -999] = np.nan
    n = len(nucleotide)

    #####滑动窗口计算change##################
    nucleotide_ = np.zeros([len(range(0, n - (window), step)), window], dtype=np.string_)
    R_z_ = np.zeros([len(range(0, n - (window), step)), window])
    R_f_ = np.zeros([len(range(0, n - (window), step)), window])
    p = np.zeros([len(range(0, n - (window), step)), ])
    j = 0
    for i in range(0, n - (window), step):
        nucleotide_[j, :] = nucleotide[i:i + window]
        R_z_[j, :] = R_z[i:i + window]
        R_f_[j, :] = R_f[i:i + window]
        p[j] = get_p(R_z[i:i + window], R_f[i:i + window],ratio_nan=ratio_nan,)
        j = j + 1
    p_=p.copy()
    if len(p[~np.isnan(p)])>0:
        a,p_bh,b,c =multi.multipletests(p[~np.isnan(p)],method='fdr_bh')
        p_[~np.isnan(p_)]=p_bh
    return p_,p


@jit(nopython=False)
def calculate_delta_gini(R_1, R_2, gini_1, gini_2, ratio_nan=0.9):
    '''Calculates Standard RMSD on two vectors of numbers of the same length'''
    # Check to see the vectors are of equal length.
    #计算delta_gini,(1)当R_1,R_2长度不同时,输出nan值;(2)当滑动窗口非空值比率<90%,gini_index输出空值。(3)除(1)(2)情况外,输出delta_gini。
    if len(R_1) != len(R_2):
        return np.nan
    else:
        R = R_1 - R_2
        if len(R[np.isnan(R) == False]) / len(R) <= ratio_nan:
            return np.nan
        else:
            delta_gini = gini_1 - gini_2
    return delta_gini

@jit(nopython=False)
def get_all(X,cut_AC=False, ratio_nan=0, ratio_nan_AC=0.8):
    #########读取数据#############
    name = X[0]
    nucleotide = np.array(list(X[1]))
    R = np.array(X[6].split(',')).astype('float')
    R[R == -999] = np.nan
    n = len(nucleotide)
    gini= get_gini(R, nucleotide, cut_AC=cut_AC, ratio_nan=ratio_nan,ratio_nan_AC=ratio_nan_AC)
    return gini

@jit(nopython=False)
def get_se(X,location ,start,end,strand,cut_AC=False, ratio_nan=0, ratio_nan_AC=0.8):
    #########读取数据#############
    name = X[0]
    nucleotide = np.array(list(X[1]))
    R = np.array(X[6].split(',')).astype('float')
    R[R == -999] = np.nan
    n = len(nucleotide)
    if strand=='+':
        R_=R[location<start]
        nucleotide_=nucleotide[location<start]
        gini_5= get_gini(R_, nucleotide_, cut_AC=cut_AC, ratio_nan=ratio_nan,ratio_nan_AC=ratio_nan_AC)
        R_ = R[location > end]
        nucleotide_ = nucleotide[location > end]
        gini_3 = get_gini(R_, nucleotide_, cut_AC=cut_AC, ratio_nan=ratio_nan, ratio_nan_AC=ratio_nan_AC)
    else:
        R_ = R[location < start]
        nucleotide_ = nucleotide[location < start]
        gini_3 = get_gini(R_, nucleotide_, cut_AC=cut_AC, ratio_nan=ratio_nan, ratio_nan_AC=ratio_nan_AC)
        R_ = R[location > end]
        nucleotide_ = nucleotide[location > end]
        gini_5 = get_gini(R_, nucleotide_, cut_AC=cut_AC, ratio_nan=ratio_nan, ratio_nan_AC=ratio_nan_AC)
    R_ = R[(location<= end)&(location>=start)]
    nucleotide_ = nucleotide[(location<= end)&(location>=start)]
    gini_cds = get_gini(R_, nucleotide_, cut_AC=cut_AC, ratio_nan=ratio_nan, ratio_nan_AC=ratio_nan_AC)
    return gini_3,gini_cds,gini_5


@jit(nopython=False)
def get_statistics(data_z, data_f,data_gff_,location_list,strand):
    data_z = np.array(data_z).reshape([7, ])
    data_f = np.array(data_f).reshape([7, ])
    nucleotide_z, R_z, gini_z = get_window(data_z)
    nucleotide_f, R_f, gini_f = get_window(data_f)
    ###uv+###
    gini_z_=gini_z[pd.notnull(gini_z)]
    if len(gini_z_)==0:
        gini_max_z=np.nan
    else:
        gini_max_z=gini_z[pd.notnull(gini_z)].max()
    gini_all_z=get_all(data_z)
    if len(data_gff_.loc[data_gff_['location']=='CDS','strat'])==0:
        gini_3_z=np.nan
        gini_5_z=np.nan
        gini_cds_z=np.nan
    else:
        CDS_strat=list(data_gff_.loc[data_gff_['location']=='CDS','strat'])[0]
        CDS_end = list(data_gff_.loc[data_gff_['location']=='CDS','end'])[0]
        gini_3_z, gini_cds_z, gini_5_z = get_se(data_z,location_list,CDS_strat,CDS_end,strand)
    ###uv-###
    gini_f_ = gini_f[pd.notnull(gini_f)]
    if len(gini_f_) == 0:
        gini_max_f = np.nan
    else:
        gini_max_f = gini_f[pd.notnull(gini_f)].max()
    gini_all_f = get_all(data_f)
    if len(data_gff_.loc[data_gff_['location'] == 'CDS', 'strat']) == 0:
        gini_3_f = np.nan
        gini_5_f = np.nan
        gini_cds_f = np.nan
    else:
        CDS_strat = list(data_gff_.loc[data_gff_['location'] == 'CDS', 'strat'])[0]
        CDS_end = list(data_gff_.loc[data_gff_['location'] == 'CDS', 'end'])[0]
        gini_3_f, gini_cds_f, gini_5_f = get_se(data_f, location_list, CDS_strat, CDS_end, strand)

    if len(R_z) != len(R_f):
        print("error")
    else:
        delta_gini = np.zeros([len(R_z), ])
        for i in range(len(R_z)):
            delta_gini[i] = calculate_delta_gini(R_z[i, :], R_f[i, :], gini_z[i], gini_f[i])
    return delta_gini,gini_max_z,gini_all_z,gini_3_z, gini_cds_z, gini_5_z,gini_max_f,gini_all_f,gini_3_f, gini_cds_f, gini_5_f


def main_sum_gini(data_uv_z, data_uv_f, data_location, data_gff, transcript_id_list):
    statistics_sum = pd.DataFrame(columns={'transcript_id', 'num', 'num_0.1','delta_max', 'delta_min'})
    statistics_z = pd.DataFrame(columns={'transcript_id', 'gini_all', 'gini_max', 'gini_3UTR', 'gini_5UTR', 'gini_CDS'})
    statistics_f = pd.DataFrame(columns={'transcript_id', 'gini_all', 'gini_max', 'gini_3UTR', 'gini_5UTR', 'gini_CDS'})
    i = 0
    for transcript in transcript_id_list:
        data_z = data_uv_z.loc[data_uv_z['transcript_id'] == transcript, :]
        data_f = data_uv_f.loc[data_uv_f['transcript_id'] == transcript, :]
        data_location_ = data_location.loc[data_location['transcript_id'] == transcript,:]
        data_gff_ = data_gff.loc[data_gff['transcript_id'] == transcript, :]
        location_list=np.array(list(data_location_['location'])[0].split(',')).astype('int')
        chr = list(data_location_['chr'])[0]
        strand=list(data_location_['strand'])[0]
        delta_gini, gini_max_z, gini_all_z, gini_3_z, gini_cds_z, gini_5_z, gini_max_f, gini_all_f, gini_3_f, gini_cds_f, gini_5_f= get_statistics(data_z,data_f, data_gff_, location_list, strand)

        statistics_sum.loc[i, 'transcript_id'] = transcript
        num = sum(np.abs(delta_gini) >=0.2)
        statistics_sum.loc[i, 'num'] = num
        num_1 = sum(np.abs(delta_gini) >=0.1)
        statistics_sum.loc[i, 'num_0.1'] = num_1
        statistics_sum.loc[i, 'delta_gini_list'] = ','.join(list(delta_gini.astype('str')))

        if len(delta_gini[np.isnan(delta_gini) == False]) > 0:
            statistics_sum.loc[i, 'delta_max'] = np.max(delta_gini[np.isnan(delta_gini) == False])
            statistics_sum.loc[i, 'delta_min'] = np.min(delta_gini[np.isnan(delta_gini) == False])
        else:
            statistics_sum.loc[i, 'delta_max'] = np.nan
            statistics_sum.loc[i, 'delta_min'] = np.nan

        statistics_z.loc[i,'transcript_id']=transcript
        statistics_z.loc[i,'gini_all']=gini_all_z
        statistics_z.loc[i, 'gini_max'] = gini_max_z
        statistics_z.loc[i, 'gini_3UTR'] = gini_3_z
        statistics_z.loc[i, 'gini_5UTR'] = gini_5_z
        statistics_z.loc[i, 'gini_CDS'] = gini_cds_z

        statistics_f.loc[i,'transcript_id']=transcript
        statistics_f.loc[i,'gini_all']=gini_all_f
        statistics_f.loc[i, 'gini_max'] = gini_max_f
        statistics_f.loc[i, 'gini_3UTR'] = gini_3_f
        statistics_f.loc[i, 'gini_5UTR'] = gini_5_f
        statistics_f.loc[i, 'gini_CDS'] = gini_cds_f
        i = i + 1
        if i % 100 == 0:
            print(i)
    return statistics_sum,statistics_z,statistics_f

if __name__ == '__main__':
    col_uv_f = '/data/TA_QUIZ_RNA_regulation/data/riboshape_liulab_batch4/final.modified_umodified/col_nouv/'
    col_uv_z = '/data/TA_QUIZ_RNA_regulation/data/riboshape_liulab_batch4/final.modified_umodified/col_uv/'


###col###
    result = '/data/TA_QUIZ_RNA_regulation/result/PartIII.SHAPE-seq_analysis/col'
    uv_z = col_uv_z
    uv_f = col_uv_f

    data_uv_z = pd.read_csv(uv_z + '/final.modified_unmodified_new', sep='\t')
    data_uv_z = data_uv_z[['transcript_id', 'Nucleotide', 'Modified_mutations', 'Modified_effective_depth',
                         'Untreated_mutations', 'Untreated_effective_depth', 'reactivity']]
    data_uv_z = data_uv_z.drop_duplicates()

    data_uv_f = pd.read_csv(uv_f + '/final.modified_unmodified_new', sep='\t')
    data_uv_f = data_uv_f[['transcript_id', 'Nucleotide', 'Modified_mutations', 'Modified_effective_depth',
                         'Untreated_mutations', 'Untreated_effective_depth', 'reactivity']]
    data_uv_f = data_uv_f.drop_duplicates()
    transcript_uv_z = pd.read_csv(uv_z + '/cutoff.hit.group', sep='\t')
    transcript_uv_z.columns = ['cut_off', 'transcript_id', 'modified_depth_median',
                               'unmodified_depth_median', 'modified_depth_sum', 'unmodified_depth_sum', 'hit_level']

    transcript_uv_f = pd.read_csv(uv_f + '/cutoff.hit.group', sep='\t')
    transcript_uv_f.columns = ['cut_off', 'transcript_id', 'modified_depth_median',
                               'unmodified_depth_median', 'modified_depth_sum', 'unmodified_depth_sum', 'hit_level']

    uv_z_transcript = list(transcript_uv_z.loc[(transcript_uv_z['modified_depth_median'] > 100) & (
            transcript_uv_z['hit_level'] > 0), 'transcript_id'])

    uv_f_transcript = list(transcript_uv_f.loc[(transcript_uv_f['modified_depth_median'] > 100) & (
            transcript_uv_f['hit_level'] > 0), 'transcript_id'])
    transcript_all = list(set(uv_z_transcript) & set(uv_f_transcript))
    print(len(transcript_all))
    gff_path = '/data/TA_QUIZ_RNA_regulation/data/ATH/GFF/Ath_genes.gff'
    data_gff = pd.read_csv(gff_path, sep='\t')
    data_location = pd.read_csv(
        '/data/TA_QUIZ_RNA_regulation/data/ATH/GTF/shape_map/result/transcript_exon_location.csv',
        sep='\t')
    statistics_sum, statistics_z, statistics_f = main_sum_gini(data_uv_z, data_uv_f, data_location, data_gff, transcript_all)
    pd.DataFrame(statistics_sum).to_csv(result + '/gini_summary_50_1.csv', sep='\t', index=False, header=True)
    pd.DataFrame(statistics_z).to_csv(result + '/gini_summary_UV+_50_1.csv', sep='\t', index=False, header=True)
    pd.DataFrame(statistics_f).to_csv(result + '/gini_summary_UV-_50_1.csv', sep='\t', index=False, header=True)

output:

  • gini_summary_50_1.csv为滑动窗口delta gini index结果。

  • gini_summary_UV+_50_1.csvgini_summary_UV-_50_1.csv为UV-和UV+下整个转录本、转录本3'UTR、转录本5'UTR、转录本CDS区域的Gini index。

1.7) 合并结构变化区域

我们利用滑动窗口判断每个转录本加光前后的结构变化区域,一个转录本可能有多个滑动窗口满足|delta gini index|>0.1。我们对连续的滑动窗口进行合并,我们采取逐步向前的策略,也就是如果新的滑动窗口加入后,重新计算整个区域|delta gini index|,如果整个区域|delta gini index|>0.1则合并,否则不合并。

连续的滑动窗口为滑动窗口起始点连续的,例如窗口[1,51],[2,52]。有重叠的滑动窗口为存在交叉重叠位点的滑动窗口,例如[1,51],[20,70]。我们这里合并连续的滑动窗口。

参考脚本:/data/TA_QUIZ_RNA_regulation/script/PartIII.SHAPE-seq_analysis/structure_changes_analysis/6.gini_transcript_merge_windows_new.py

import numpy as np
import pandas as pd
import re
from numba import jit
from scipy.stats import ks_2samp
import statsmodels.stats.multitest as multi

####################函数定义部分####################
@jit(nopython=False)
def get_gini(R_, nucleotide_, cut_AC=False, ratio_nan=0.9, ratio_nan_AC=0.8):
    R_nonull = R_[np.isnan(R_)==False]
    ratio = len(R_nonull)/len(R_)
    if ratio <= ratio_nan:
        gini = np.nan
    else:
        if cut_AC:
            R_AC = R_[(nucleotide_==b'A')|(nucleotide_==b'C')]
            R_AC_nonull =  R_AC[np.isnan(R_AC)==False]
            ratio_AC = len(R_AC_nonull) / len(R_AC)
            if (ratio_AC <= ratio_nan_AC)|len(R_AC)<=1|len(R_AC_nonull)<=1:
                gini = np.nan
            else:
                sorted = np.sort(R_AC_nonull)
                height, area = 0, 0
                for i in range(0,len(sorted)):
                    height += sorted[i]
                    area += height - sorted[i] / 2.
                fair_area = height * len(sorted) / 2.
                if fair_area ==0:
                    gini=np.nan
                else:
                    gini =  (fair_area - area) / fair_area
        else:
            sorted = np.sort(R_nonull)
            height, area = 0, 0
            for i in range(0, len(sorted)):
                height += sorted[i]
                area += height - sorted[i] / 2.
            fair_area = height * len(sorted) / 2.
            if fair_area == 0:
                gini = np.nan
            else:
                gini = (fair_area - area) / fair_area
    return gini

@jit(nopython=False)
def get_p(R_1,R_2,ratio_nan=0.9):
    R_1_nonull = R_1[np.isnan(R_1) == False]
    ratio_1 = len(R_1_nonull) / len(R_1)
    R_2_nonull = R_2[np.isnan(R_2) == False]
    ratio_2 = len(R_2_nonull) / len(R_2)
    if (ratio_1 <= ratio_nan)|(ratio_2 <= ratio_nan):
        p = np.nan
    else:
        s,p=ks_2samp(R_1, R_2)
    return p

@jit(nopython=False)
def get_window(X,location,window=50,step=1,cut_AC=False,ratio_nan=0.9,ratio_nan_AC=0.8):
    #########读取数据#############
    name = X[0]
    nucleotide = np.array(list(X[1]))
    modified = np.array(X[2].split(',')).astype('int')
    modified_depth = np.array(X[3].split(',')).astype('int')
    unmodified = np.array(X[4].split(',')).astype('int')
    unmodified_depth = np.array(X[5].split(',')).astype('int')
    Mutrs = modified/modified_depth
    Mutru = unmodified/unmodified_depth
    # R = Mutrs - Mutru
    R = np.array(X[6].split(',')).astype('float')
    R[R == -999] = np.nan
    n = len(nucleotide)
    index = np.array(range(0, n))
    location_index = np.array(np.array(location['location'])[0].split(',')).astype('int')
    #####滑动窗口计算change##################
    nucleotide_ = np.zeros([len(range(0, n-(window), step)), window],dtype=np.string_)
    R_ = np.zeros([len(range(0, n-(window), step)), window])
    gini = np.zeros([len(range(0, n-(window), step)),])
    index_ = np.zeros([len(range(0, n-(window), step)), window])
    location_index_ = np.zeros([len(range(0, n - (window), step)), window])
    j =0
    for i in range(0, n-(window), step):
        index_[j,:]=index[i:i+window]
        location_index_[j,:] = location_index[i:i+window]
        nucleotide_[j,:] = nucleotide[i:i+window]
        R_[j,:] = R[i:i+window]
        gini[j] = get_gini(R[i:i+window],nucleotide[i:i+window],cut_AC=cut_AC,ratio_nan=ratio_nan,ratio_nan_AC=ratio_nan_AC)
        j=j+1

    return index_,location_index_,nucleotide_,R_,gini


@jit(nopython=False)
def get_window_p(X1,X2, window=50, step=1,ratio_nan=0.9):
    #########读取数据#############
    name = X1[0]
    nucleotide = np.array(list(X1[1]))
    R_z = np.array(X1[6].split(',')).astype('float')
    R_z[R_z == -999] = np.nan
    R_f = np.array(X2[6].split(',')).astype('float')
    R_f[R_f == -999] = np.nan
    n = len(nucleotide)

    #####滑动窗口计算change##################
    nucleotide_ = np.zeros([len(range(0, n - (window), step)), window], dtype=np.string_)
    R_z_ = np.zeros([len(range(0, n - (window), step)), window])
    R_f_ = np.zeros([len(range(0, n - (window), step)), window])
    p = np.zeros([len(range(0, n - (window), step)), ])
    j = 0
    for i in range(0, n - (window), step):
        nucleotide_[j, :] = nucleotide[i:i + window]
        R_z_[j, :] = R_z[i:i + window]
        R_f_[j, :] = R_f[i:i + window]
        p[j] = get_p(R_z[i:i + window], R_f[i:i + window],ratio_nan=ratio_nan,)
        j = j + 1
    p_=p.copy()
    if len(p[~np.isnan(p)])>0:
        a,p_bh,b,c =multi.multipletests(p[~np.isnan(p)],method='fdr_bh')
        p_[~np.isnan(p_)]=p_bh
    return p_,p


@jit(nopython=False)
def calculate_delta_gini(R_1, R_2, gini_1, gini_2, ratio_nan=0.9):
    '''Calculates Standard RMSD on two vectors of numbers of the same length'''
    # Check to see the vectors are of equal length.
    if len(R_1) != len(R_2):
        return np.nan
    else:
        R = R_1 - R_2
        if len(R[np.isnan(R) == False]) / len(R) <= ratio_nan:
            return np.nan
        else:
            delta_gini = gini_1 - gini_2
    return delta_gini

@jit(nopython=False)
def get_gff(X,chr,strand,data_gff_):
    gff=[]
    if len(data_gff_.loc[data_gff_['location'] == 'CDS', 'strat']) == 0:
        for i in range(len(X)):
            gff_='erro'
            gff.append(gff_)
    else:
        CDS_strat = list(data_gff_.loc[data_gff_['location'] == 'CDS', 'strat'])[0]
        CDS_end = list(data_gff_.loc[data_gff_['location'] == 'CDS', 'end'])[0]
        for i in range(len(X)):
            if strand == '+':
                if X[i]<CDS_strat:
                    gff_='five_prime_UTR'
                elif X[i]>CDS_end:
                    gff_ = 'three_prime_UTR'
                else:
                    gff_='CDS'
            else:
                if X[i]<CDS_strat:
                    gff_='three_prime_UTR'
                elif X[i]>CDS_end:
                    gff_ = 'five_prime_UTR'
                else:
                    gff_='CDS'
            gff.append(gff_)
    return gff


@jit(nopython=False)
def merge_gini(X1,X2,strat,end):
    #########读取数据#############
    name = X1[0]
    nucleotide = np.array(list(X1[1]))
    R_z = np.array(X1[6].split(',')).astype('float')
    R_z[R_z == -999] = np.nan
    R_f = np.array(X2[6].split(',')).astype('float')
    R_f[R_f == -999] = np.nan
    n = len(nucleotide)
    R_z_=R_z[strat:end+1]
    R_f_=R_f[strat:end+1]
    gini_z=get_gini(R_z_,nucleotide_=[],ratio_nan=0)
    gini_f = get_gini(R_f_, nucleotide_=[],ratio_nan=0)
    delta=gini_z-gini_f
    p=get_p(R_z_,R_f_,ratio_nan=0)
    R_f_mean = R_f_[R_f_>=0].mean()
    R_z_mean = R_z_[R_z_ >= 0].mean()
    return delta,p,R_f_mean,R_z_mean,gini_z,gini_f



@jit(nopython=False)
def get_statistics(data_z,data_f,data_location_,data_gff_,strand):
    data_z = np.array(data_z).reshape([7,])
    data_f = np.array(data_f).reshape([7,])
    chr=list(data_location_['chr'])[0]
    index_z,location_index_z,nucleotide_z, R_z, gini_z = get_window(data_z,data_location_)
    index_f,location_index_f,nucleotide_f, R_f, gini_f = get_window(data_f,data_location_)
    # print('ok')
    if len(R_z)!=len(R_f):
        print("error")
    else:
        location_exon = get_gff(location_index_z[:,0],chr,strand,data_gff_)
        # print('ok')
        location_exon = np.array(location_exon)
        p_bh, p = get_window_p(data_z, data_f)
        delta_gini = np.zeros([len(R_z), ])
        index_list = []
        nucleotide_list = []
        location_index_list = []
        for i in range(len(R_z)):
            index_list.append( ','.join(list(index_z[i,:].astype('int').astype('str'))))
            nucleotide_list.append(','.join(list(nucleotide_z[i, :].astype('str'))))
            location_index_list.append( ','.join(list(location_index_z[i,:].astype('int').astype('str'))))
            delta_gini[i] = calculate_delta_gini(R_z[i, :], R_f[i, :], gini_z[i], gini_f[i])
        index_list = np.array(index_list)
        location_index_list = np.array(location_index_list)
        nucleotide_list = np.array(nucleotide_list)
    return delta_gini,p_bh,p, gini_z, gini_f,index_list,location_index_list,nucleotide_list,location_exon


# @jit(nopython=False,error_model="numpy")
def merge_data(delta_gini,index_list,data_z,data_f,location_list,chr,strand,data_gff_,windows=50):
    data_z = np.array(data_z).reshape([7,])
    data_f = np.array(data_f).reshape([7,])
    nucleotide = np.array(list(data_z[1]))
    index = np.array([x.split(',')[0] for x in index_list]).astype('int')
    delta_gini_ =delta_gini.copy()
    delta_gini_[np.isnan(delta_gini_)]=0
    i=0
    windows_num=[]
    n=index[abs(delta_gini_)>0.1][0]+windows-1
    for j in range(len(index)-1):
        if abs(delta_gini_[j])<0.1:
            windows_num.append(np.nan)
        else:
            if index[j]>n+1:
                i=i+1
            n=index[j]+windows-1
            windows_num.append(i)
    if abs(delta_gini_[len(index)-1])<0.1:
        windows_num.append(np.nan)
    else:
        if index[j] > n + 1:
            i = i + 1
        windows_num.append(i)

    data_all= pd.DataFrame(np.zeros([len(set(windows_num) - set([np.nan])),9]))
    for i in list(set(windows_num) - set([np.nan])):
        windows_num_=np.array(windows_num)
        strat = index[windows_num_==i].astype('int')[0]
        end = index[windows_num_==i].astype('int')[-1]+windows-1
        strat_chr=location_list[strat]
        end_chr = location_list[end]
        delta,p=merge_gini(data_z,data_f,strat,end)
        location = get_gff(np.array([strat_chr,end_chr]),chr,strand,data_gff_)
        nucleotide_ =nucleotide[strat:end+1]
        nucleotide_str = ','.join(list(nucleotide_.astype('str')))
        data_all.iloc[i,:]=[strat,end,strat_chr,end_chr,location[0],location[1],delta,p,nucleotide_str]
    data_all.columns=['start','end','start_chr','end_chr','location_start','location_end','delta','p','nucleotide']
    return data_all


def merge_data_2(delta_gini,index_list,data_z,data_f,location_list,chr,strand,data_gff_,windows=50):
    data_z = np.array(data_z).reshape([7,])
    data_f = np.array(data_f).reshape([7,])
    nucleotide = np.array(list(data_z[1]))
    index = np.array([x.split(',')[0] for x in index_list]).astype('int')
    delta_gini_ =delta_gini.copy()
    delta_gini_[np.isnan(delta_gini_)]=0
    i=0
    windows_num=[]
    if abs(delta_gini_[0]) < 0.1:
        windows_num.append(np.nan)
    else:
        i = i + 1
        windows_num.append(i)

    for j in range(1,len(index)):
        if abs(delta_gini_[j])<0.1:
            windows_num.append(np.nan)
        else:
            if abs(delta_gini_[j-1])>0.1:
                windows_num.append(i)
            else:
                i=i+1
                windows_num.append(i)

    data_all= pd.DataFrame(np.zeros([len(set(windows_num) - set([np.nan])),13]))
    for i in list(set(windows_num) - set([np.nan])):
        windows_num_=np.array(windows_num)
        strat = index[windows_num_==i].astype('int')[0]
        end = index[windows_num_==i].astype('int')[-1]+windows-1
        delta, p, R_f_mean, R_z_mean,gini_z,gini_f=merge_gini(data_z,data_f,strat,end)
        if abs(delta)<=0.1:
            num = len(index[windows_num_==i].astype('int'))
            for n in range(1,num):
                end_=end-n
                delta, p, R_f_mean, R_z_mean,gini_z,gini_f = merge_gini(data_z, data_f, strat, end_)
                if abs(delta)>0.1:
                    end = end_
                    break
        strat_chr=location_list[strat]
        end_chr = location_list[end]
        location = get_gff(np.array([strat_chr,end_chr]),chr,strand,data_gff_)
        nucleotide_ =nucleotide[strat:end+1]
        nucleotide_str = ','.join(list(nucleotide_.astype('str')))
        data_all.iloc[i-1,:]=[strat,end,strat_chr,end_chr,location[0],location[1],delta,p,R_f_mean, R_z_mean,gini_z,gini_f,nucleotide_str]
    data_all.columns=['start','end','start_chr','end_chr','location_start','location_end','delta','p','R_f_mean','R_z_mean','gini_z','gini_f','nucleotide']
    return data_all



def main_sum(data_uv_z,data_uv_f,transcript_id_list,data_location,data_gff,save_path):
    i=0
    data_all = pd.DataFrame(columns=['transcript_id','start','end','start_chr','end_chr','location_start','location_end','delta','p','R_f_mean','R_z_mean','gini_z','gini_f','nucleotide'])
    for transcript in transcript_id_list:
        data_z = data_uv_z.loc[data_uv_z['transcript_id']==transcript,:]
        data_f = data_uv_f.loc[data_uv_f['transcript_id'] == transcript,:]
        data_location_ = data_location.loc[data_location['transcript_id'] == transcript,:]
        data_gff_ = data_gff.loc[data_gff['transcript_id']== transcript,:]
        location_list=np.array(list(data_location_['location'])[0].split(',')).astype('int')
        chr = list(data_location_['chr'])[0]
        strand = list(data_location_['strand'])[0]
        delta_gini, p_bh,p,gini_z, gini_f, index_list, location_index_list, nucleotide_list,location_exon = get_statistics(data_z,data_f,data_location_,data_gff_,strand)
        # print(delta_gini)

        data_new = pd.DataFrame(np.vstack([location_exon,index_list,location_index_list,nucleotide_list,gini_z, gini_f,delta_gini,p_bh,p])).T
        data_new.columns =['location_exon','index','location_index','nucleotide','gini_z', 'gini_f','delta_gini','p_bh','p']
        data_new['transcript']=transcript
        # pd.DataFrame(data_new).to_csv(save_path+'/'+transcript+'_gini.csv',sep='\t',header=True,index=False)
        data_all_=merge_data_2(delta_gini, index_list, data_z, data_f, location_list,chr,strand,data_gff_)
        data_all_['transcript_id']=transcript
        data_all = pd.concat([data_all,data_all_])
        i = i+1
        print(i)
    return data_all

####################主函数部分####################
if __name__ == '__main__':
    col_uv_f = '/data/TA_QUIZ_RNA_regulation/data/riboshape_liulab_batch4/final.modified_umodified/col_nouv/'
    col_uv_z = '/data/TA_QUIZ_RNA_regulation/data/riboshape_liulab_batch4/final.modified_umodified/col_uv/'


    ###col###
    result = '/data/TA_QUIZ_RNA_regulation/result/PartIII.SHAPE-seq_analysis/col'
    uv_z = col_uv_z
    uv_f = col_uv_f

    data_uv_z = pd.read_csv(uv_z + '/final.modified_unmodified_new', sep='\t')
    data_uv_z.columns = ['transcript_id', 'Nucleotide', 'Modified_mutations', 'Modified_effective_depth',
                         'Untreated_mutations', 'Untreated_effective_depth', 'R1']
    data_uv_z = data_uv_z[['transcript_id', 'Nucleotide', 'Modified_mutations', 'Modified_effective_depth',
                         'Untreated_mutations', 'Untreated_effective_depth', 'R1']]
    data_uv_z = data_uv_z.drop_duplicates()

    data_uv_f = pd.read_csv(uv_f + '/final.modified_unmodified_new', sep='\t', header=None)
    data_uv_f.columns = ['transcript_id', 'Nucleotide', 'Modified_mutations', 'Modified_effective_depth',
                         'Untreated_mutations', 'Untreated_effective_depth', 'R1']
    data_uv_f = data_uv_f[['transcript_id', 'Nucleotide', 'Modified_mutations', 'Modified_effective_depth',
                         'Untreated_mutations', 'Untreated_effective_depth', 'R1']]
    data_uv_f = data_uv_f.drop_duplicates()
    statistics_sum=pd.read_csv(result + '/gini_summary_50_1.csv', sep='\t')
    # statistics_sum_new = pd.read_csv(result + '/summary_result_merge_50_1_0.csv',sep='\t')

    transcript_all = list(statistics_sum.loc[statistics_sum['num_0.1']>0,'transcript_id'])
    # transcript_all =transcript_all[:1000]
    # transcript_all=['AT5G26000.1','AT5G26000.2']
    # transcript_all = ['AT5G45260.2']
    print(len(transcript_all))
    gff_path = '/data/TA_QUIZ_RNA_regulation/data/ATH/GFF/Ath_genes.gff'
    data_gff = pd.read_csv(gff_path, sep='\t')
    # data_gff.columns = ['exosome','name','location','strat','end','.','+/-','num','id']
    data_location = pd.read_csv('/data/TA_QUIZ_RNA_regulation/data/ATH/GTF/shape_map/result/transcript_exon_location.csv', sep='\t')
    data_all = main_sum(data_uv_z,data_uv_f,transcript_all,data_location,data_gff,result+'/transcript_gini_merge')
    data_all.to_csv(result+'/summary_result_merge_50_1.csv',sep='\t',index=False)
    hit_level_coil_uv_f = pd.read_csv(uv_f+'cutoff.hit.group',sep='\t')
    #hit_level_coil_uv_f = hit_level_coil_uv_f.reset_index()
    hit_level_coil_uv_f.columns =['group','transcript_id','modified.median','unmodified.median','modified.sum','unmodified.sum','hit_f']
    hit_level_coil_uv_z = pd.read_csv(uv_z+'cutoff.hit.group',sep='\t')
    #hit_level_coil_uv_z = hit_level_coil_uv_z.reset_index()
    hit_level_coil_uv_z.columns =['group','transcript_id','modified.median','unmodified.median','modified.sum','unmodified.sum','hit_z']
    shape_data_coil =pd.merge(pd.merge(data_all,hit_level_coil_uv_f[['transcript_id','hit_f']],on='transcript_id',how='left'),hit_level_coil_uv_z[['transcript_id','hit_z']],on='transcript_id',how='left')
    shape_data_coil_2 =shape_data_coil.loc[(shape_data_coil['hit_f']>2)&(shape_data_coil['hit_z']>2),:]
    shape_data_coil.to_csv(result+'/summary_result_merge_50_1_new.csv',sep='\t',index=False)

output:

summary_result_merge_50_1.csv:为hit level 大于0的转录本结果。 summary_result_merge_50_1_new.csv:为hit level 大于2的转录本结果。

Last updated