# 1.1.Data Introduction

{% hint style="info" %}

## **数据介绍**

{% endhint %}

我们使用的[数据](https://www.nature.com/articles/srep19413)主要包括两种癌症和正常人样本，其中Colorectal Cancer, Prostate Cancer和Healthy Control的样本数量分别为99，36和50。数据存放公共目录为cnode服务器的`/BioII/chenxupeng/student/`目录。

* `data`目录下为已经建好的expression matrix，相应的label和annotation
* `data`目录下另外的文件夹中存放的文件是读者用于自己完成对五个正常人样本`Sample_N1, Sample_N7, Sample_N13, Sample_N19, Sample_N25`进行mapping和创建expression matrix等操作的。

## 1) mapping相关文件

路径：包括`/BioII/chenxupeng/student/data/`目录下的`hg38_index`, `raw_data`, `RNA_index`文件夹。

| data        | path                                                             |
| ----------- | ---------------------------------------------------------------- |
| `raw data`  | `/BioII/chenxupeng/student/data/raw_data/*.fastq`                |
| `hg38`      | `/BioII/chenxupeng/student/data/hg38_index/GRCh38.p10.genome.fa` |
| `gtf`       | `/BioII/chenxupeng/student/data/gtf`                             |
| `RNA index` | `/BioII/chenxupeng/student/data/RNA_index/`                      |

具体内容参考 **11.1 Helps:** ***mapping指南***

## 2) expression matrix

路径：`/BioII/chenxupeng/student/data/expression_matrix/`

expression matrix每一行为一个feature，每一列为一个样本，其中我们去掉了`Sample_N13, Sample_N19, Sample_N25`三个样本的相应数据，需要读者自己完成mapping和构建expression matrix（详见 **11.2 Requirement:** ***Expression Matrix***）。

```python
import pandas as pd
import numpy as np
scirepount = pd.read_table('data/expression_matrix/GSE71008.txt',sep=',',index_col=0)
```

```python
scirepount.iloc[:,:5].head()
```

|                                          | Sample\_1S10 | Sample\_1S11 | Sample\_1S12 | Sample\_1S13 | Sample\_1S14 |
| ---------------------------------------- | ------------ | ------------ | ------------ | ------------ | ------------ |
| transcript                               |              |              |              |              |              |
| ENST00000473358.1\|MIR1302-2HG-202\|1544 | 0            | 0            | 0            | 0            | 0            |
| ENST00000469289.1\|MIR1302-2HG-201\|843  | 0            | 0            | 0            | 0            | 0            |
| ENST00000466430.5\|AL627309.1-201\|31638 | 0            | 0            | 0            | 0            | 0            |
| ENST00000471248.1\|AL627309.1-203\|18221 | 0            | 0            | 0            | 0            | 0            |
| ENST00000610542.1\|AL627309.1-205\|12999 | 0            | 0            | 0            | 0            | 0            |

```python
scirepount.shape
```

```
(89619, 188)
```

## 3) sample labels

路径：`/BioII/chenxupeng/student/data/labels`

```python
scirep_samplenames = pd.read_table('data/labels/scirep_classes.txt',delimiter=',' , index_col=0)
```

```python
scirep_samplenames.head()
```

|              | label             |
| ------------ | ----------------- |
| sample\_id   |                   |
| Sample\_1S3  | Colorectal Cancer |
| Sample\_1S6  | Colorectal Cancer |
| Sample\_1S9  | Colorectal Cancer |
| Sample\_1S12 | Colorectal Cancer |
| Sample\_1S15 | Colorectal Cancer |

```python
delete_sample = ['Sample_N1','Sample_N7','Sample_N13','Sample_N19','Sample_N25']
check_sample = ['Sample_N1','Sample_N7']
```

```python
np.unique(scirep_samplenames['label'],return_counts=True)
```

```
(array(['Colorectal Cancer', 'Healthy Control', 'Pancreatic Cancer',
        'Prostate Cancer'], dtype=object), array([99, 50,  6, 36]))
```

## 4) other annotations

路径：`/BioII/chenxupeng/student/data/other_annotations`

### 4a) gene annotation

可以通过feature的transcript id找到feature的transcript\_name, gene\_type等信息

```python
geneannotation = pd.read_table('data/other_annotations/transcript_anno.txt')
```

```python
geneannotation.iloc[:,:5].head()
```

|   | chrom | start | end   | name              | score |
| - | ----- | ----- | ----- | ----------------- | ----- |
| 0 | chr1  | 14629 | 14657 | piR-hsa-18438     | 0     |
| 1 | chr1  | 17368 | 17436 | ENSG00000278267.1 | 0     |
| 2 | chr1  | 18535 | 18563 | piR-hsa-7508      | 0     |
| 3 | chr1  | 26805 | 26836 | piR-hsa-23387     | 0     |
| 4 | chr1  | 29553 | 31097 | ENSG00000243485.5 | 0     |

### 4b) batch信息

batch信息记录了对不同样本采取的不同实验条件，包括处理时间，处理材料的规格差异等，可能会造成同类样本的较大差异，称为batch effect。

对于exoRBase数据，每一种癌症样本均来自不同的实验室，因此其batch与样本类别重合。对于scirep数据和hcc数据，batch信息如下：

```python
scirepbatch = pd.read_csv('data/other_annotations/scirep_batch.txt',index_col=0)
```

```python
scirepbatch.head()
```

|             | RNA Isolation batch | library prepration day | gel cut size selection |
| ----------- | ------------------- | ---------------------- | ---------------------- |
| Sample\_1S1 | 2                   | 22                     | 7                      |
| Sample\_1S2 | 2                   | 22                     | 8                      |
| Sample\_1S3 | 2                   | 22                     | 1                      |
| Sample\_2S1 | 2                   | 22                     | 2                      |
| Sample\_2S2 | 2                   | 22                     | 3                      |

## 5) RNA type 统计信息

```python
scireprnastats = pd.read_csv('data/other_annotations/scirep_rna_stats.txt',index_col=0)
```

```python
scireprnastats.iloc[:,:5].head()
```

|           | Sample\_1S10 | Sample\_1S11 | Sample\_1S12 | Sample\_1S13 | Sample\_1S14 |
| --------- | ------------ | ------------ | ------------ | ------------ | ------------ |
| Y\_RNA    | 88835        | 127497       | 145142       | 90106        | 105377       |
| cleanN    | 9034303      | 10963430     | 11077344     | 10262615     | 11065325     |
| hg38other | 1462269      | 2044478      | 2624270      | 1476586      | 1806268      |
| libSizeN  | 11362190     | 13437632     | 13905951     | 12271219     | 13619701     |
| lncRNA    | 26733        | 38346        | 35639        | 25523        | 31489        |
