Links
Comment on page

1.1.Data Introduction

数据介绍

我们使用的数据主要包括两种癌症和正常人样本,其中Colorectal Cancer, Prostate Cancer和Healthy Control的样本数量分别为99,36和50。数据存放公共目录为cnode服务器的/BioII/chenxupeng/student/目录。
  • data目录下为已经建好的expression matrix,相应的label和annotation
  • data目录下另外的文件夹中存放的文件是读者用于自己完成对五个正常人样本Sample_N1, Sample_N7, Sample_N13, Sample_N19, Sample_N25进行mapping和创建expression matrix等操作的。

1) mapping相关文件

路径:包括/BioII/chenxupeng/student/data/目录下的hg38_index, raw_data, RNA_index文件夹。
data
path
raw data
/BioII/chenxupeng/student/data/raw_data/*.fastq
hg38
/BioII/chenxupeng/student/data/hg38_index/GRCh38.p10.genome.fa
gtf
/BioII/chenxupeng/student/data/gtf
RNA index
/BioII/chenxupeng/student/data/RNA_index/
具体内容参考 11.1 Helps: mapping指南

2) expression matrix

路径:/BioII/chenxupeng/student/data/expression_matrix/
expression matrix每一行为一个feature,每一列为一个样本,其中我们去掉了Sample_N13, Sample_N19, Sample_N25三个样本的相应数据,需要读者自己完成mapping和构建expression matrix(详见 11.2 Requirement: Expression Matrix)。
import pandas as pd
import numpy as np
scirepount = pd.read_table('data/expression_matrix/GSE71008.txt',sep=',',index_col=0)
scirepount.iloc[:,:5].head()
Sample_1S10
Sample_1S11
Sample_1S12
Sample_1S13
Sample_1S14
transcript
ENST00000473358.1|MIR1302-2HG-202|1544
0
0
0
0
0
ENST00000469289.1|MIR1302-2HG-201|843
0
0
0
0
0
ENST00000466430.5|AL627309.1-201|31638
0
0
0
0
0
ENST00000471248.1|AL627309.1-203|18221
0
0
0
0
0
ENST00000610542.1|AL627309.1-205|12999
0
0
0
0
0
scirepount.shape
(89619, 188)

3) sample labels

路径:/BioII/chenxupeng/student/data/labels
scirep_samplenames = pd.read_table('data/labels/scirep_classes.txt',delimiter=',' , index_col=0)
scirep_samplenames.head()
label
sample_id
Sample_1S3
Colorectal Cancer
Sample_1S6
Colorectal Cancer
Sample_1S9
Colorectal Cancer
Sample_1S12
Colorectal Cancer
Sample_1S15
Colorectal Cancer
delete_sample = ['Sample_N1','Sample_N7','Sample_N13','Sample_N19','Sample_N25']
check_sample = ['Sample_N1','Sample_N7']
np.unique(scirep_samplenames['label'],return_counts=True)
(array(['Colorectal Cancer', 'Healthy Control', 'Pancreatic Cancer',
'Prostate Cancer'], dtype=object), array([99, 50, 6, 36]))

4) other annotations

路径:/BioII/chenxupeng/student/data/other_annotations

4a) gene annotation

可以通过feature的transcript id找到feature的transcript_name, gene_type等信息
geneannotation = pd.read_table('data/other_annotations/transcript_anno.txt')
geneannotation.iloc[:,:5].head()
chrom
start
end
name
score
0
chr1
14629
14657
piR-hsa-18438
0
1
chr1
17368
17436
ENSG00000278267.1
0
2
chr1
18535
18563
piR-hsa-7508
0
3
chr1
26805
26836
piR-hsa-23387
0
4
chr1
29553
31097
ENSG00000243485.5
0

4b) batch信息

batch信息记录了对不同样本采取的不同实验条件,包括处理时间,处理材料的规格差异等,可能会造成同类样本的较大差异,称为batch effect。
对于exoRBase数据,每一种癌症样本均来自不同的实验室,因此其batch与样本类别重合。对于scirep数据和hcc数据,batch信息如下:
scirepbatch = pd.read_csv('data/other_annotations/scirep_batch.txt',index_col=0)
scirepbatch.head()
RNA Isolation batch
library prepration day
gel cut size selection
Sample_1S1
2
22
7
Sample_1S2
2
22
8
Sample_1S3
2
22
1
Sample_2S1
2
22
2
Sample_2S2
2
22
3

5) RNA type 统计信息

scireprnastats = pd.read_csv('data/other_annotations/scirep_rna_stats.txt',index_col=0)
scireprnastats.iloc[:,:5].head()
Sample_1S10
Sample_1S11
Sample_1S12
Sample_1S13
Sample_1S14
Y_RNA
88835
127497
145142
90106
105377
cleanN
9034303
10963430
11077344
10262615
11065325
hg38other
1462269
2044478
2624270
1476586
1806268
libSizeN
11362190
13437632
13905951
12271219
13619701
lncRNA
26733
38346
35639
25523
31489