# 6.1.Alternative Splicing

* 分析可变剪接的工具本质上还是在对基因表达进行定量，只不过在定量的过程中对不同的可变剪接事件进行了区分。
* 我们前面已经介绍过，RNA-seq常见的分析一类是在基因水平上直接count有多少reads落在一个gene的exons对应的genomic interval上，如featureCounts；另一类是利用一些统计方法在给出转录本水平上的丰度估计，如cufflinks,Rsem等。
* 类似的，用于可变剪接分析的工具也可以分为两类。一类是只考虑能确定的assign给一个可变剪接事件的reads，直接count对应splicing事件和retention事件的reads，然后计算它们的相对比例。我们这里主要介绍的[rMATS](http://rnaseq-mats.sourceforge.net/rmats4.0.2/user_guide.htm)就属于这类工具。另外一类工具基本上和cufflinks,Rsem这类工具使用的方法是相似的，但是针对可变剪接的分析提供了更多的功能，例如[MISO](https://miso.readthedocs.io/en/fastmiso/#)。

## 1) Pipeline

![](/files/-LPVvF90wS88PggGWrp8)

## 2) Data Structure

### 2a) getting software & data

#### 方法1: 使用docker

1. install software (already available in Docker，docker images的下载链接如[附表](/teaching/appendix/appendix-iv.-teaching.md#teaching-docker)所示)

   [rMATS](http://rnaseq-mats.sourceforge.net/rmats4.0.2/user_guide.htm)
2. Get data (already available in Docker)，我们使用 [PRJNA130865](https://www.ncbi.nlm.nih.gov/bioproject/PRJNA130865) 中的两个样本：
   * [SRR065545](https://www.ebi.ac.uk/ena/data/view/SRR065545): C2C12 with control shRNA vector
   * [SRR065544](https://www.ebi.ac.uk/ena/data/view/SRR065544): C2C12 with shRNA against CUGBP1

> 我们已经准备好注释`` `.gtf ``文件和map好的`.bam` 文件（仅包含 mapping 到 X 染色体上的部分），位于 Docker 中的 `/home/test/alter-spl/input`。
>
> 如果希望从头做起，读者也可以点击上述相应链接下载原始的 FASTQ 文件； *Mus musculus* 的基因组注释文件可以从Ensembl下载： [GRCm38/mm10](ftp://ftp.ensembl.org/pub/release-93/gtf/mus_musculus/Mus_musculus.GRCm38.93.gtf.gz)。

#### 方法2: 自行下载和安装

1. Install software：[rMATS](http://rnaseq-mats.sourceforge.net/rmats4.0.2/user_guide.htm)
2. Download data: [link](https://lulab2.gitbook.io/teaching/appendix/appendix-iv.-teaching);

### 2b) input

| Format | Description          | Notes |
| ------ | -------------------- | ----- |
| `.bam` | 将样本中的 Reads 比对到参考基因组 | -     |
| `.gtf` | 参考基因组注释文件            | -     |

### 2c) output

| Format   | Description                                                                | Notes |
| -------- | -------------------------------------------------------------------------- | ----- |
| many TSV | all possible alternative splicing (AS) events derived from GTF and RNA-seq | -     |

详细说明请参见 <http://rnaseq-mats.sourceforge.net/rmats4.0.2/user_guide.htm#output>

## 3) Running Steps

和之前章节一样，首先进入到容器：

```bash
docker exec -it bioinfo_tsinghua bash
```

以下步骤均在 `/home/test/alter-spl/` 下进行:

```bash
cd /home/test/alter-spl/
```

### 3a) check read length

```bash
samtools view input/SRR065544_chrX.bam | cut -f 10 | \
    perl -ne 'chomp;print length($_) . "\n"' | sort | uniq -c
```

```
1448805 35
```

```bash
samtools view input/SRR065545_chrX.bam | cut -f 10 | \
    perl -ne 'chomp;print length($_) . "\n"' | sort | uniq -c
```

```
1964089 35
```

也就是说 read length 均为 35

### 3b) run

```bash
echo "input/SRR065544_chrX.bam" > input/b1.txt
echo "input/SRR065545_chrX.bam" > input/b2.txt
```

```bash
python2 /usr/local/rMATS-turbo-Linux-UCS4/rmats.py \
    --b1 input/b1.txt --b2 input/b2.txt --gtf input/Mus_musculus_chrX.gtf --od output \
    -t paired --readLength 35
```

第二行指定输入和输出文件（夹）。\
第三行是一些必需参数：

* 这里我们的数据是 paired-end, 所以选择 `-t paired`
* 根据第一步，我们指定 `--readLength 35`

### 3c) output

输出文件位于 `output/` 中。

最重要的是以下两类文件：

* `AS_Event.MATS.JC.txt`: evaluates splicing with only reads that span splicing junctions
* `AS_Event.MATS.JCEC.txt`: evaluates splicing with reads that span splicing junctions and reads on target (striped regions on [MATS home page figure](http://rnaseq-mats.sourceforge.net/rmats4.0.2/splicing.jpg))

其中，`AS_Event` 包含以下几种：

1. `A5SS`: alternative 5' splice site
2. `A3SS`: alternative 3' splice site
3. `SE`: skipped exon
4. `MXE`: mutually exclusive exons
5. `RI`: retained intron

For example, `A5SS.MATS.JC.txt` includes alternative 5' splice site (A5SS) using only reads that span splicing junctions:

```
ID    GeneID    geneSymbol    chr    strand    longExonStart_0base    longExonEnd    shortES    shortEE    flankingES    flankingEE    ID    IJC_SAMPLE_1    SJC_SAMPLE_1    IJC_SAMPLE_2    SJC_SAMPLE_2    IncFormLen    SkipFormLen    PValue    FDR    IncLevel1    IncLevel2    IncLevelDifference
2    "ENSMUSG00000004221"    "Ikbkg"    chrX    +    74427761    74427902    74427761    74427874    74432804    74433006    2    1    0    0    17    61    34    9.36092704494e-05    0.000468046352247    1.0    0.0    1.0
20    "ENSMUSG00000025332"    "Kdm5c"    chrX    +    152271859    152271938    152271859    152271929    152272074    152272274    20    0    2    4    9    42    34    0.00591200967211    0.00985334945352    0.0    0.265    -0.265
63    "ENSMUSG00000031167"    "Rbm3"    chrX    -    8143246    8143332    8143250    8143332    8142367    8142955    63    26    140    51    145    37    34    0.0299275020479    0.0374093775599    0.146    0.244    -0.098
84    "ENSMUSG00000037369"    "Kdm6a"    chrX    +    18277375    18277563    18277375    18277546    18278625    18279936    84    4    5    0    5    50    34    0.00235425083108    0.0058856270777    0.352    0.0    0.352
124    "ENSMUSG00000025283"    "Sat1"    chrX    -    155215119    155215684    155215600    155215684    155214032    155214134    124    2    23    4    32    68    34    0.549749061659    0.549749061659    0.042    0.059    -0.017
```

其中最重要的列意义如下：

|                    |                                                                                                            |
| ------------------ | ---------------------------------------------------------------------------------------------------------- |
| IncFormLen         | length of inclusion form, used for normalization                                                           |
| SkipFormLen        | length of skipping form, used for normalization                                                            |
| P-Value            | Significance of splicing difference between two sample groups. (Only available if statistical model is on) |
| FDR                | False Discovery Rate calculated from p-value. (Only available if statistical model is on)                  |
| IncLevel1          | inclusion level for SAMPLE\_1 replicates (comma separated) calculated from normalized counts               |
| IncLevel2          | inclusion level for SAMPLE\_2 replicates (comma separated) calculated from normalized counts               |
| IncLevelDifference | average(IncLevel1) - average(IncLevel2)                                                                    |

## 4) Tips/Utilities

### 4a) 准备bam文件

可变剪接分析需要用到的是普通的RNA-seq数据，所以用我们前面介绍的STAR和hisat2都是可以的，mapping的参数通常也不需要特殊的设置。我们这里提供了一个hisat2的例子。

(1) install [**hisat**](ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/downloads/hisat2-2.1.0-Linux_x86_64.zip), **bamtools**, **samtools**

hisat 下载后解压到当前目录下，另外两个软件在 Docker 中已经装好

(2) 基因组和基因组注释

注意gtf文件的坐标应当和fasta文件是对应的

* [chromFa.tar.gz](https://hgdownload.soe.ucsc.edu/goldenPath/mm10/bigZips/chromFa.tar.gz)
* [Musmusculus.GRCm38.93.gtf.gz](ftp://ftp.ensembl.org/pub/release-93/gtf/mus_musculus/Mus_musculus.GRCm38.93.gtf.gz)

(3) 下载原始数据

* <ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR065/SRR065544/SRR065544\\_1.fastq.gz>
* <ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR065/SRR065544/SRR065544\\_2.fastq.gz>
* <ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR065/SRR065545/SRR065545\\_1.fastq.gz>
* <ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR065/SRR065545/SRR065545\\_2.fastq.gz>

(4) now your working directory looks like this

```
.
├── chromFa.tar.gz
├── hisat2-2.1.0
├── Mus_musculus.GRCm38.93.gtf.gz
├── SRR065544_1.fastq.gz
├── SRR065544_2.fastq.gz
├── SRR065545_1.fastq.gz
├── SRR065545_2.fastq.gz
```

(5) make hisat index

```bash
# extract X chromosome sequence
tar -xz -f chromFa.tar.gz chrX.fa
mv chrX.fa Mus_musculus_chrX.fa

# use only X chromosome
zcat Mus_musculus.GRCm38.93.gtf.gz | grep -P '(#!)|(X\t)' > Mus_musculus_chrX.gtf

# make hisat index
hisat2-2.1.0/hisat2_extract_splice_sites.py Mus_musculus_chrX.gtf > Mus_musculus_chrX.ss
hisat2-2.1.0/hisat2_extract_exons.py        Mus_musculus_chrX.gtf > Mus_musculus_chrX.exon

mkdir hisat2_indexes
hisat2-2.1.0/hisat2-build -p 4 \
    --ss Mus_musculus_chrX.ss --exon Mus_musculus_chrX.exon \
    Mus_musculus_chrX.fa hisat2_indexes/Mus_musculus_chrX
```

(6) mapping

```bash
# mapping
hisat2-2.1.0/hisat2 -p 4 --dta \
    -S SRR065544_chrX.sam -x hisat2_indexes/Mus_musculus_chrX \
    -1 SRR065544_1.fastq.gz -2 SRR065544_2.fastq.gz
hisat2-2.1.0/hisat2 -p 4 --dta \
    -S SRR065545_chrX.sam -x hisat2_indexes/Mus_musculus_chrX \
    -1 SRR065545_1.fastq.gz -2 SRR065545_2.fastq.gz

# covert to .bam
samtools sort -@ 4 -o SRR065544_chrX_raw.bam SRR065544_chrX.sam
samtools sort -@ 4 -o SRR065545_chrX_raw.bam SRR065545_chrX.sam

# filter only mapped reads
bamtools index -in SRR065544_chrX_raw.bam
bamtools index -in SRR065545_chrX_raw.bam

bamtools filter -isMapped true -in SRR065544_chrX_raw.bam \
    -out SRR065544_chrX.bam
bamtools filter -isMapped true -in SRR065545_chrX_raw.bam \
    -out SRR065545_chrX.bam
```

## 5) Homework

1. 为了鉴定 CUGBP1 对 mRNA isoform 的调控，科学家在 C2C12 小鼠成肌细胞（myoblast）中分别表达空载体（SRR065546）和含有干扰 CUGBP1 的 shRNA 的载体（SRR065547）。请同学们至[该链接](/teaching/appendix/appendix-iv.-teaching.md#teaching-files)中**Files needed by this Tutorial**中的清华云Bioinformatics Tutorial / Files路径下的相应文件夹中下载 `.bam` 输入文件（只含有 map 到 X 染色体的 reads），探索在 X 染色体上存在 differential alternative splicing 的基因。（需要上交代码和输出结果中所有以 `.MATS.JCEC.txt` 结尾的文件）
2. 阅读文献（"[rMATS: Robust and Flexible Detection of Differential Alternative Splicing from Replicate RNA-Seq Data](http://dx.doi.org/10.1073/pnas.1419161111)"），简要解释rMATS是如何对PSI(percentage spliced in)的组间差异进行统计检验的(只需解释Unpaired Replicates的情形即可)。

## 6) References

rMATS is introduced in "[rMATS: Robust and Flexible Detection of Differential Alternative Splicing from Replicate RNA-Seq Data](http://dx.doi.org/10.1073/pnas.1419161111)" in *PNAS*

目前已有大量的工具可以用于可变剪接的分析。读者可参考以下文献，探索其他的工具:

* [A survey of software for genome-wide discovery of differential splicing in RNA-Seq data](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3903050/)
* [A survey of computational methods in transcriptome-wide alternative splicing analysis](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5203768/)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://book.ncrnalab.org/teaching/part-iii.-ngs-data-analyses/6.rna-regulation/alternative-splicing.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
