1.3.Helps
由于完成本次作业需要一定的计算资源支持,我们为各个小组提供了集群账户,每个小组可以最多使用四个核,32G内存。大作业需要使用python完成,推荐读者使用python3。我们 需要一些python的工具包来实现部分功能。建议使用jupyter notebook进行代码编辑、运行和调试。本次作业也有可能需要读者使用R,读者同样可以使用jupyter notebook (其中预装了R kernel)来编写、运行R代码。
为了节约时间,我们已经在集群上为读者建立了公共的jupyter使用平台,读者无需配置python、R以及相关的环境,请各小组联系助教获取使用公用jupyter的方法。
#导入必需的库
import gc, argparse, sys, os, errno
%pylab inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import NearestNeighbors
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve, auc
from tqdm import tqdm_notebook as tqdm
from scipy.stats import pearsonr
import warnings
warnings.filterwarnings('ignore')
#绘图设置
styles = ["white","dark",'whitegrid',"darkgrid"]
contexts = ['paper','talk','poster','notebook']
sns.set_context(contexts[1])
sns.set_style(styles[2])
tableau20 = np.array([(31, 119, 180), (174, 199, 232), (255, 127, 14), (255, 187, 120),
(44, 160, 44), (152, 223, 138), (214, 39, 40), (255, 152, 150),
(148, 103, 189), (197, 176, 213), (140, 86, 75), (196, 156, 148),
(227, 119, 194), (247, 182, 210), (127, 127, 127), (199, 199, 199),
(188, 189, 34), (219, 219, 141), (23, 190, 207), (158, 218, 229)])/255.
Populating the interactive namespace from numpy and matplotlib
完成五个样本
Sample_N1, Sample_N7, Sample_N13, Sample_N19, Sample_N25
的mapping和RNA ratio与length的统计工作,其中产生的bam文件供下一步构建expression matrix使用。总体流程图

~/proj_exRNA/
|-- data
|-- RNA_index #1.2.4 d. 中比对到各种RNA类型的index
|-- hg38_index #1.2.4 d. 中最后一步所需要的index
|-- raw_data
|-- stat #存放最终步骤的信息统计文件
|-- output #可以根据自己的习惯对output进行整理,以下是按照流程设置output的路径
eg:
|-- 01.trim #对应1.2.2
|-- QC1 #对应1.2.2 step one
|-- trim #对应1.2.2 step two
|-- QC2 #对应1.2.2 step three
|-- 02.mapping #对应1.2.3 和 1.2.4
|-- 1.no_rRNA
|-- fastq #存*.rRNA.unAligned.fastq,详见1.2.3
|-- sam #存*.rRNA.sam,详见1.2.3
|-- rsem_bam #将.sam转化为.bam文件,详见1.2.3
|-- 2.no_miRNA
|-- ...
|-- 12.no_hg38other
|-- fastq
|-- sam
|-- bam #.sam转.bam工具不同,文件夹由rsem_bam改名至bam
|-- 03.tags #homer构建表达矩阵所需路径,本教程不需要建立此路径
|-- Sample_N1
|-- miRNA
|-- ...
|-- ...
|-- 04.counts #构建表达矩阵
|-- 05.matrix #构建表达矩阵
|-- tmp #存放中间文件
Inputs
File format | Information contained in file | File description |
fastq | reads | five samples, GEO link: GSE71008 |
Outputs
File format | Information contained in file |
sam/bam | mapped reads to different kinds of indexes |
tsv format | stats of RNA ratio and length |
从
/BioII/chenxupeng/student/
上获取基因组数据hg38
,基因组注释数据/gtf
,索引文件/RNA_index
以及原始数据(fastq files)
到自己的账号下data | path |
hg38 | /BioII/chenxupeng/student/data/hg38_index/GRCh38.p10.genome.fa |
gtf | /BioII/chenxupeng/student/data/gtf |
RNA index | /BioII/chenxupeng/student/data/RNA_index/ |
raw data | /BioII/chenxupeng/student/data/raw_data/*.fastq |
推荐使用
ln
或cp
命令这步操作目的主要有两个,一个是检查数据的质量,另一个是减掉接头序列
- Step one - QC of raw data
Input:
data type | path |
raw data | /BioII/chenxupeng/student/data/raw_data/*.fastq |
Software/Parameters:
fastqc
options | function |
-q --quiet | Supress all progress messages on stdout and only report errors. |
-o --outdir | Create all output files in the specified output directory. |
-h --help | detailed introduction of options |
Output:
QC files
- step two - cut adaptor & trim long read
Input:
data type | path |
raw data | /BioII/chenxupeng/student/data/raw_data/*.fastq |
Software/Parameters:
cutadapt
: cutadapt removes adapter sequences from high-throughput sequencing reads.Usage:
cutadapt -a ADAPTER [options] [-o output.fastq] input.fastq
options with Parameter Setting | function |
-q 30 | read quality need to be above 30 |
-m 16 | reads less than 15nt are removed |
-a AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC | cut adapt |
--trim-n | trim N's on ends of reads. |
Output:
*.cutadapt.fastq
- step three - QC after Trim
输入文件是trim后的数据,过程与step one相同
bowtie2可以将
.fastq
文件比对到rRNA index上从而得到不含rRNA reads的.fastq
文件以及map到rRNA index上的.sam
文件Input:
1.2.2 操作结束后的
*.cutadapt.fastq
Software/Parameters:
bowtie2可以Clean rRNA reads得到不含rRNA reads的
.fastq
文件以及map到rRNA index上的.sam
文件bowtie2 -p 4 [options] -x <bt2-idx> --un <address of unmapped reads> $input_file [-S <sam>]
options with Parameter Setting | function |
--sensitive-local | (default) allow no mismatch, etc |
--norc | do not align reverse-complement version of read, attention |
--no-unal | suppress SAM records for unaligned reads |
--un <path to unmapped reads> | store unmapped reads |
-x <path to index>/rRNA | indexed genome/transcriptome |
-S <path to output file> | output file foramt as sam |
对于那些map到rRNA index上的
.sam
文件,可以用rsem-tbam2gbam
命令转化为.bam
文件。rsem-tbam2gbam <rRNA-idx> <sam> genome_bam_output
Output:
不含rRNA reads的
.fastq
文件*.rRNA.unAligned.fastq
,位于fastq文件夹下,详见Data Structuremap到rRNA index上的
*.rRNA.sam
文件,位于sam文件夹下以及
*.<rRNA>.rsem.clean.bam
文件,位于rsem_bam文件夹下这步的目的就是得到比对到各种RNA类型(例如miRNA, piRNA, Y RNA和srp RNA等等)的index后得到的
.sam
文件,mapping的过程就类似于clean rRNA reads的过程。只不过,1.2.3 比对的index是rRNA,这里只需要 1)把index替换成其他类型的index,2)将上一步比对得到的
*.<some type of RNA>.unAligned.fastq
作为input,重复1)2),直至比对完所有类型至这样就可以得到各种RNA类型的比对结果。Input:
*.rRNA.unAligned.fastq
Software/Parameters:
类似1.2.3,只需修改index和input:
bowtie2 -p 4 [options] -x <some type of RNA-idx> --un <address of unmapped reads> $input_file [-S <sam>]
Parameter Setting |
-x <path to index>/<some type of RNA> |
*.<some type of RNA>.unAligned.fastq , output of the previous step, as $input_file |
-un <path to output>/*.<some type of RNA>.unAligned.fastq |
-S <path to .sam file>/<some type of RNA>.sam |
对于那些map到 index上的
.sam
文件,可以用rsem-tbam2gbam
命令转化为.bam
文件。对于map到hg38上的
.sam
文件,可以用samtools的view功能转化为.bam
文件,具体可以敲入samtools view -h
查看怎么转化Output:
不含某类型RNA reads的
.fastq
文件*.<some type of RNA>.unAligned.fastq
map到某类型RNA index上的
*.<some type of RNA>.sam
文件以及
*.<some type of RNA>.rsem.clean.bam
文件提示:
- reads依次map到各种类型的RNA index上,推荐次序为,
miRNA、piRNA、Y_RNA、srpRNA、tRNA、snRNA、snoRNA、lncRNA、mRNA、tucp
,最后是hg38other
- map的最后一步非常特殊,1)index不再是RNA_index,是hg38,不在RNA_index文件夹下,需要注意 2)sam转bam工具也有所不同,输出文件理论上不再应该是
*.hg38other.rsem.clean.bam
而是*.hg38other.bam
,但是文件名的设置会影响后续代码简洁性,需要注意
对mapping到不同RNA类型的index的reads,我们可以统计其长度,观察其不同RNA类型的长度分布;我们还可以统计不同RNA类型的reads所占比例,作为sample QC的参考。
length
这里提供统计长度的.sh,脚本位置在
/BioII/chenxupeng/student/bin/length.sh
,可以得到包含长度信息的文件该脚本得到的包含长度信息的文件可以用python读取并作格式精简处理,精简处理的结果可以用来可视化。同时,
/BioII/chenxupeng/student/data/other_annotations/length/
文件夹下有除五个样本之外的其余统计好的包含长度信息的文件,格式与运行length.sh
得到的结果一致。import pandas as pd
def get_length_table(samplename):
'''
sample name: Sample_N14
'''
df = pd.read_table('/BioII/chenxupeng/student/data/other_annotations/length/'+samplename+'.lengthN.stat.tsv')
df = df.pivot(index='type',columns='len',values='num')
df = df.fillna(0)
return df
get_length_table('Sample_N14')
这段代码核心功能是
df.pivot
,将文件的数据以type为index以len为columns重新排列,可以用3.2)的line plot of length画不同read长度的reads数的分布图。用*.apply(lambda x: (x/x.sum()))
语句可以实现归一化,便可以画不同read长度的reads比例的分布图。包含长度信息的文件类似如下: 每一行表示sample = Sample_N1, type = < some type of RNA >, 碱基数len = < int > 的read数为num = < int >
sample | type | len | num |
Sample_N1 | rRNA | 16 | 4569 |
Sample_N1 | rRNA | 17 | 13178 |
Sample_N1 | rRNA | ... | ... |
Sample_N1 | rRNA | 51 | ... |
Sample_N1 | miRNA | 16 | ... |
Sample_N1 | miRNA | ... | ... |
Sample_N1 | miRNA | 51 | ... |
Sample_N1 | ... | ... | ... |
Sample_N1 | hg38other | 16 | ... |
Sample_N1 | hg38other | ... | ... |
Sample_N1 | hg38other | 51 | ... |
ratio
这里提供统计属于不同RNA的read数量的.sh脚本,位置在
/BioII/chenxupeng/student/bin/read.sh
,可以得到包含read数量信息的文件该脚本得到的包含read数量信息的文件可以用python读取并作格式精简处理,精简处理的结果可以用来可视化。同时,
/BioII/chenxupeng/student/data/other_annotations/counts/
文件夹下有除五个样本之外的其余统计好的包含read数量信息的文件,格式与运行read.sh
得到的结果一致。def get_counts(samplename):
'''
samplename: Sample_N14
'''
df = pd.read_table('/BioII/chenxupeng/student/data/other_annotations/counts/'+samplename+'.readsN.stat.tsv',
names=['sample','method','type','counts']).pivot(index='type',columns='sample',values='counts')
return df
get_counts('Sample_N14')
包含read数量信息的文件类似如下,对于column1 row2的13178信息,它表示Sample_N1经过preprocess中的cutadpat操作后留下的reads数目;对于column3的各个属性的含义,libSizeN表示rawdata的reads数目,rRNA_N表示map到rRNA index上的reads的数目,keptN表示去除rRNA后fastq文件剩余的reads的数目,sequentialMap的< some type of RNA >表示map到< some type of RNA > index上的reads的数目,hg38other表示map到hg38的reads的数目,nonHuman_N表示未map的reads的数目:
| | | |
Sample_N1 | preprocess | libSizeN | 13190 |
Sample_N1 | preprocess | cleanN | 13178 |
Sample_N1 | preprocess | rRNA_N | ... |
Sample_N1 | preprocess | keptN | ... |
Sample_N1 | sequentialMap | miRNA | ... |
Sample_N1 | sequentialMap | piRNA | ... |
Sample_N1 | sequentialMap | Y_RNA | ... |
Sample_N1 | sequentialMap | snRNA | ... |
Sample_N1 | sequentialMap | snoRNA | ... |
Sample_N1 | sequentialMap | srpRNA | ... |
Sample_N1 | sequentialMap | tRNA | ... |
Sample_N1 | sequentialMap | lncRNA | ... |
Sample_N1 | sequentialMap | mRNA | ... |
Sample_N1 | sequentialMap | tucp | ... |
Sample_N1 | map2hg38other | hg38other | ... |
Sample_N1 | map2hg38other | nonHuman_N | ... |
完成五个样本
Sample_N1, Sample_N7, Sample_N13, Sample_N19, Sample_N25
的expression matrix构建工作,使用mapping产生的bam文件,使用Sample_N1, Sample_N7
的counts检查mapping和construct expression matrix是否有误。inputs
File format | Information contained in file | File description | Notes |
bam | alignments | Produced by mapping reads to the transcriptome. | Reads are trimmed using a proprietary version of cutAdapt. We map to transcriptome for a better sensitivity (see details in protocol and example). |
outputs
File format | Information contained in file | File description | Notes |
tsv | gene (ncRNA) quantifications | Non-normalized counts. | |
- FeatureCounts
对Mapping步骤得到的不同样本不同RNA类型的
<sample>.<some type of RNA>.rsem.clean.bam
文件,进行Raw Counts的统计(无需统计hg38other),结果可输出到.../04.counts/<sample>/<sample>.<some type of RNA>.featureCounts.counts
Reads比对到RNA indexes后,每条Read都标记了位置。需要将位置信息由transcript的位置转变为genome的位置,才能用FeatureCounts将它与gtf比较,统计每个transcript_id的counts。
Input1:
...02.mapping/*.no_<some type of RNA>/rsem_bam/<sample>.<some type of RNA>.rsem.clean.bam
<annotation_file>:/BioII/chenxupeng/student/data/gtf/<some type of RNA>.gtf
Software usage:
featureCounts -t exon -g transcript_id -s 1 -a <annotation_file> -o <output_file> input_file1
Output:
<sample>.<some type of RNA>.featureCounts.counts
上步操作我们得到不同样本不同RNA类型的Raw Counts,现在要将这些文件合并为一个文件,代码位置在
/BioII/chenxupeng/student/bin/merge.sh
。proj_exRNA.featureCounts.counts.merged.mx
就是我们需要的文件用
Sample_N1, Sample_N7
的expression matrix数据和/BioII/chenxupeng/student/data/expression_matrix/GSE71008.txt
中相应的两个样本的参考数据计算相 关系数以检查结果。可以使用pearsonr correlation coefficient衡量相关性。
from scipy.stats import pearsonr
pearsonr(X,Y)
python参考代码位于
/BioII/chenxupeng/student/bin/corr.py
*提醒:以上操作构建了五个样本的mx,在后续的normlization等过程中,会出现以下问题
存在两个gtf文件包含一条相同的exon注释信息,导致产生的mx存在两行,它们的
transcript
一致,而在把mx的这些counts与路径下已经存在其他样本的counts合并为一个mx过程中,如果存在上述问题,便出现拼接失败的报错信息。解决办法:
#仅作参考
#找到`transcript`出现duplicated的位置,第二行代码返回duplicated的位置,第三行实现删除
judge = mx.duplicated('geneID')
judge[judge==1]
mx=mx.drop([821794])
mx=mx.set_index('geneID')
mx.shape
合并代码用参考:
pd.concat([ref, mx], axis=1, join_axes=[ref.index])
由于我们使用的数据是小RNA测序数据,除了使用full length数据来构建expression matrix,另一种思路是考虑使用peak calling的方法来获得domain feature作为expression matrix。如使用piranha等工具call peak。
我们需要在mapping完成后,对reads所map到的gene,通过分析reads在全长上的位置,找到reads覆盖区域的峰值,以peak区域取代全长区域作为expression matrix的feature。

我们提供 了所有样本的domain feature构建的expression matrix,但其中不包括miRNA和piRNA(请读者思考为什么),读者可以在后续步骤比较full length和domain feature的分类效果,甚至考虑结合两种feature,比较分类效果,根据工作量会酌情加分。
matrix位置:
/BioII/chenxupeng/student/data/expression_matrix/domains_05.txt
另外为了方便大家工作,提供了所有样本的domain feature与miRNA, piRNA的全长feature合并得到的expression matrix,matrix位置:
/BioII/chenxupeng/student/data/expression_matrix/domains_combined.txt
.注意,上一步获得的自己map的五个样本也需要加入到统计中,不能只统计已经提供的样本的信息
统计不同RNA类型reads的比例并以饼图展示

统计不同RNA类型mapped reads的长度分布


统计一套数据中不同RNA type在不同样本的counts

统计某套数据中某种类型的RNA在不同样本中的counts数量。

分析每个样本不同RNA所占的比例


pie plot of ratio
rnanames = ['miRNA', 'piRNA', 'Y_RNA', 'snRNA','srpRNA','tRNA',
'lncRNA','mRNA','other_genomic_region','non_human',
]
rna_ratio = pd.read_table(file1)
x = rnanames
colours = tableau20[:len(x)]/255.
y = rna_ratio.mean()
z = np.array([float('{:.4f}'.format(y[i])) for i in range(y.shape[0])])*100
fig1, ax1 = plt.subplots(figsize=(10,10))
patches, texts = ax1.pie(y, colors=colours, #autopct='%1.1f%%',
shadow=True, startangle=90)
labels = ['{0} - {1:1.2f} %'.format(i,j) for i,j in zip(x, z)]
sort_legend = True
if sort_legend:
patches, labels, dummy = zip(*sorted(zip(patches, labels, y),
key=lambda x: x[2],
reverse=True))
plt.legend(patches, labels, loc='center', bbox_to_anchor=(1.1, .7),
fontsize=8)
or
def plot_pie(data, rnanames):
'''
data: table_ratio
rnanames: rna type names
adjustment: merge RNA with small percent together
'''
from bokeh.io import output_file, show
from bokeh.palettes import Category20c
from bokeh.plotting import figure
from bokeh.transform import cumsum
x = np.array(rnanames)
y = np.array(data.loc[:,x].mean())
z = np.array([float('{:.10f}'.format(y[i])) for i in range(y.shape[0])])*100
labels = rnanames
dataframe = pd.DataFrame(np.concatenate((x.reshape(-1,1),z.reshape(-1,1)),axis=1))
dataframe.columns=['rna','percent']
dataframe["percent"] = pd.to_numeric(dataframe["percent"])
dataframe['angle'] = dataframe['percent']/dataframe['percent'].sum() * 2*pi
dataframe['color'] = Category20c[len(x)]
p = figure(plot_height=400, title="Pie Chart", toolbar_location=None,
tools="hover", tooltips="@rna: @percent", x_range=(-0.5, 1.0))
p.wedge(x=0, y=1, radius=0.4,
start_angle=cumsum('angle', include_zero=True), end_angle=cumsum('angle'),
line_color="white", fill_color='color', legend="rna", source=dataframe)
p.axis.axis_label=None
p.axis.visible=False
p.grid.grid_line_color = None
show(p)
line plot of length
lengthdata = pd.read_table(file2)
length = np.array(lengthdata.T)
fig,ax=plt.subplots(10,1,figsize=(20,50))
for i in range(length.shape[0]):
ax[i].plot(length[i],label=rnanames[i],color=colours[i])
ax[i].legend(loc='upper right')
ax[i].set_xticklabels(np.arange(lengthdata.index[0]-5,lengthdata.index[-1],5))
boxplot of ratio
fig, ax = plt.subplots(figsize=(100,20))
sns.boxplot(data = rna_ratio,ax=ax,boxprops=dict(alpha=.5))
ax.set_title(u'RNA percentage in different samples',fontsize=80)
ax.set_xticks(range(10))
ax.set_xticklabels(rnanames,fontsize=40)
ax.set_yticks(np.arange(0,1,0.1))
ax.set_yticklabels(['{:.1f}%'.format(i*10) for i in range(10)],fontsize=40)
table with color
def gradient_func(val):
return '<span style="background: linear-gradient(90deg, #d65f5f {0}%, transparent 0%)">{0:.3f}</span>'.format(val)
table_percent = table_ratio*100
table_percent.style.format(gradient_func)
为了让比对结果更让人信服,我们基于不同RNA类型reads的比例制定了一套标准用于对样本进行质量控制:
Check point | Threshold | Notes |
Raw reads quality | reads quality >28 (median lines in green area) | Check fastqc results(*.html) |
Clean reads number | > 10 million | Adaptors and too-short sequences removed reads |
rRNAs% | < 10% | Reads mapped to rRNAs (all following % are divided by the total number of clean reads) |
HG% | > 60% (optional) | Reads mapped to Human Genome except rRNAs |
Transcriptome% | > 50% | Reads mapped to Human Transcriptome (including rRNA, miRNA, piRNA, Y RNA, srpRNA, snRNA, snoRNA, tRNA, mRNA exons, lncRNA exons, TUCP exons) |
Y RNA% | 根据数据情况判断 | Reads mapped to Y RNA |
miRNA% | 根据数据情况判断 | Reads mapped to miRNA |
请读者依据以上标准,或者观察数据情况制定一定的标准,对样本进行质量控制,并且可以可视化质量控制的条件。注意不要轻易删除样本,给出去除某个样本的理由。
利用PCA或者t-SNE可视化观察离群点也可以做sample QC

注意此处的normalization是对每个样本的系统误差(如测序深度)进行的,对feature进行normalization(如每列normalize到0-1)请在下一步feature selection中完成。
不同normalization策略比较
- 使用CPM(counts per million)
- 或者使用可能的内参基因:
gene name:
'MIR1228', 'MIR16-1', 'MIR16-2', 'MIR21', 'MIR23A', 'MIR23B', 'MIR23C', 'MIR451A', 'MIR15A', 'MIR15B'
gene ID:
ENST00000408438.1, ENST00000385271.1, ENST00000607334.3, ENST00000385059.1, ENST00000362134.1, ENST00000385245.1, ENST00000385045.1, ENST00000362117.1, ENST00000384832.1, ENST00000579846.3
进行scale。- 去除piRNA和miRNA后使用CPM(counts per million)
- 使用SCNorm, RLE, TMM等package
- 分别对表达量排名top k的基因和其它gene做scale

内参基因的选择
我们可以绘制density plot或者violin plot来分析不同内参基因的变异系数,选择变异系数小的,比较稳定的miRNA作为内参。可以看到MIR1228, MIR15B的变异系数较大,不够稳定,不应该作为内参


visualize batch effect

不同去除batch effect方法
- RUVs,可以设置factor的数量
- Combat,需要给定batch信息
PCA和t-SNE可以直观的看到样本目前的聚集程度,但是无法量化,尤其是不容易做比较,我们提供以下函数量化二分类和多分类样本的聚集程度。数值越接近1说明同类样本越聚集。利用这种方法读者可以量化自己使用的imputation, normalization和remove batch effect方法的效果。
def alignment_score(X, y, K=10):
N = X.shape[0]
nn = NearestNeighbors(K)
nn.fit(X)
distances, indices = nn.kneighbors(X, K + 1)
neighbor_classes = np.take(y, indices[:, 1:])
same_class_fractions = np.sum(neighbor_classes == y[:, np.newaxis], axis=1)
score = 1.0 - (np.mean(same_class_fractions) - K/N)/(K - K/N)
print (same_class_fractions.shape,np.mean(same_class_fractions),K/N,neighbor_classes)
return score
def knn_score(X, y, K=10):
N = X.shape[0]
assert K < N
nn = NearestNeighbors(K)
nn.fit(X)
distances, indices = nn.kneighbors(X, K + 1)
neighbor_classes = np.take(y, indices[:, 1:])
same_class_fractions = np.sum(neighbor_classes == y[:, np.newaxis], axis=1)
classes, counts = np.unique(y, return_counts=True)
classes = np.argmax(y.reshape((-1, 1)) == classes.reshape((1, -1)), axis=1)
counts = np.take(counts, classes)
mean_r = K/(N - 1)*counts
max_r = np.minimum(K, counts)
#print (same_class_fractions.shape,mean_r.shape,max_r.shape)
#scores = (np.mean(same_class_fractions) - mean_r)/(max_r - mean_r)
scores = (same_class_fractions - mean_r)/(max_r - mean_r)
#print(scores)
return scores.mean()
def convert_label_to_int(sample_class):
classes, counts = np.unique(sample_class, return_counts=True)
classes = np.argmax(sample_class.reshape((-1, 1)) == classes.reshape((1, -1)), axis=1)
return classes
def unsupervised_clustering_accuracy(y, y_pred):
from sklearn.utils.linear_assignment_ import linear_assignment
assert len(y_pred) == len(y)
u = np.unique(np.concatenate((y, y_pred)))
n_clusters = len(u)
mapping = dict(zip(u, range(n_clusters)))
reward_matrix = np.zeros((n_clusters, n_clusters), dtype=np.int64)
for y_pred_, y_ in zip(y_pred, y):
if y_ in mapping:
reward_matrix[mapping[y_pred_], mapping[y_]] += 1
cost_matrix = reward_matrix.max() - reward_matrix
ind = linear_assignment(cost_matrix)
return sum([reward_matrix[i, j] for i, j in ind]) * 1.0 / y_pred.size, ind
def clustering_scores(X,y, prediction_algorithm='knn'):
from sklearn.metrics import adjusted_rand_score as ARI
from sklearn.metrics import normalized_mutual_info_score as NMI
from sklearn.metrics import silhouette_score
from sklearn.mixture import GaussianMixture as GMM
from sklearn.cluster import KMeans
cluster_num = np.unique(y).shape[0]
if prediction_algorithm == 'knn':
labels_pred = KMeans(cluster_num, n_init=200).fit_predict(X)
elif prediction_algorithm == 'gmm':
gmm = GMM(cluster_num)
gmm.fit(X)
labels_pred = gmm.predict(X)
labels = y
asw_score = silhouette_score(X, labels)
nmi_score = NMI(labels, labels_pred)
ari_score = ARI(labels, labels_pred)
labels_int = convert_label_to_int(labels)
uca_score = unsupervised_clustering_accuracy(labels_int, labels_pred)[0]
return asw_score, nmi_score, ari_score, uca_score
def get_clustering_score(data,sampleclass,method_PCA = True,prediction_algorithm='knn'):
X = np.log2(data + 0.001).T
X = StandardScaler().fit_transform(X)
if method_PCA == True:
transform = PCA()
else:
transform = TSNE()
X_pca = transform.fit_transform(X)