Bioinformatics Tutorial
Files Needed
  • Getting Started
    • Setup
    • Run jobs in a Docker
    • Run jobs in a cluster [Advanced]
  • Part I. Programming Skills
    • 1.Linux
      • 1.1.Basic Command
      • 1.2.Practice Guide
      • 1.3.Linux Bash
    • 2.R
      • 2.1.R Basics
      • 2.2.Plot with R
    • 3.Python
  • PART II. BASIC ANALYSES
    • 1.Blast
    • 2.Conservation Analysis
    • 3.Function Analysis
      • 3.1.GO
      • 3.2.KEGG
      • 3.3.GSEA
    • 4.Clinical Analyses
      • 4.1.Survival Analysis
  • Part III. NGS DATA ANALYSES
    • 1.Mapping
      • 1.1 Genome Browser
      • 1.2 bedtools and samtools
    • 2.RNA-seq
      • 2.1.Expression Matrix
      • 2.2.Differential Expression with Cufflinks
      • 2.3.Differential Expression with DEseq2 and edgeR
    • 3.ChIP-seq
    • 4.Motif
      • 4.1.Sequence Motif
      • 4.2.Structure Motif
    • 5.RNA Network
      • 5.1.Co-expression Network
      • 5.2.miRNA Targets
      • 5.3. CLIP-seq (RNA-Protein Interaction)
    • 6.RNA Regulation - I
      • 6.1.Alternative Splicing
      • 6.2.APA (Alternative Polyadenylation)
      • 6.3.Chimeric RNA
      • 6.4.RNA Editing
      • 6.5.SNV/INDEL
    • 7.RNA Regulation - II
      • 7.1.Translation: Ribo-seq
      • 7.2.RNA Structure
    • 8.cfDNA
      • 8.1.Basic cfDNA-seq Analyses
  • Part IV. MACHINE LEARNING
    • 1.Machine Learning Basics
      • 1.1 Data Pre-processing
      • 1.2 Data Visualization & Dimension Reduction
      • 1.3 Feature Extraction and Selection
      • 1.4 Machine Learning Classifiers/Models
      • 1.5 Performance Evaluation
    • 2.Machine Learning with R
    • 3.Machine Learning with Python
  • Part V. Assignments
    • 1.Precision Medicine - exSEEK
      • Help
      • Archive: Version 2018
        • 1.1.Data Introduction
        • 1.2.Requirement
        • 1.3.Helps
    • 2.RNA Regulation - RiboShape
      • 2.0.Programming Tools
      • 2.1.RNA-seq Analysis
      • 2.2.Ribo-seq Analysis
      • 2.3.SHAPE Data Analysis
      • 2.4.Integration
    • 3.RNA Regulation - dsRNA
    • 4.Single Cell Data Analysis
      • Help
  • 5.Model Programming
  • Appendix
    • Appendix I. Keep Learning
    • Appendix II. Databases & Servers
    • Appendix III. How to Backup
    • Appendix IV. Teaching Materials
    • Appendix V. Software and Tools
    • Appendix VI. Genome Annotations
Powered by GitBook
On this page
  • 1) Pipeline
  • 2) Data Structure
  • 2a) getting software & data
  • 2b) input
  • 2c) output
  • 3) Running Steps
  • 3a) check read length
  • 3b) run
  • 3c) output
  • 4) Tips/Utilities
  • 4a) 准备bam文件
  • 5) Homework
  • 6) References

Was this helpful?

Edit on GitHub
  1. Part III. NGS DATA ANALYSES
  2. 6.RNA Regulation - I

6.1.Alternative Splicing

Previous6.RNA Regulation - INext6.2.APA (Alternative Polyadenylation)

Last updated 3 years ago

Was this helpful?

  • 分析可变剪接的工具本质上还是在对基因表达进行定量,只不过在定量的过程中对不同的可变剪接事件进行了区分。

  • 我们前面已经介绍过,RNA-seq常见的分析一类是在基因水平上直接count有多少reads落在一个gene的exons对应的genomic interval上,如featureCounts;另一类是利用一些统计方法在给出转录本水平上的丰度估计,如cufflinks,Rsem等。

  • 类似的,用于可变剪接分析的工具也可以分为两类。一类是只考虑能确定的assign给一个可变剪接事件的reads,直接count对应splicing事件和retention事件的reads,然后计算它们的相对比例。我们这里主要介绍的就属于这类工具。另外一类工具基本上和cufflinks,Rsem这类工具使用的方法是相似的,但是针对可变剪接的分析提供了更多的功能,例如。

1) Pipeline

2) Data Structure

2a) getting software & data

方法1: 使用docker

我们已经准备好注释`.gtf文件和map好的.bam 文件(仅包含 mapping 到 X 染色体上的部分),位于 Docker 中的 /home/test/alter-spl/input。

方法2: 自行下载和安装

2b) input

Format
Description
Notes

.bam

将样本中的 Reads 比对到参考基因组

-

.gtf

参考基因组注释文件

-

2c) output

Format
Description
Notes

many TSV

all possible alternative splicing (AS) events derived from GTF and RNA-seq

-

3) Running Steps

和之前章节一样,首先进入到容器:

docker exec -it bioinfo_tsinghua bash

以下步骤均在 /home/test/alter-spl/ 下进行:

cd /home/test/alter-spl/

3a) check read length

samtools view input/SRR065544_chrX.bam | cut -f 10 | \
    perl -ne 'chomp;print length($_) . "\n"' | sort | uniq -c
1448805 35
samtools view input/SRR065545_chrX.bam | cut -f 10 | \
    perl -ne 'chomp;print length($_) . "\n"' | sort | uniq -c
1964089 35

也就是说 read length 均为 35

3b) run

echo "input/SRR065544_chrX.bam" > input/b1.txt
echo "input/SRR065545_chrX.bam" > input/b2.txt
python2 /usr/local/rMATS-turbo-Linux-UCS4/rmats.py \
    --b1 input/b1.txt --b2 input/b2.txt --gtf input/Mus_musculus_chrX.gtf --od output \
    -t paired --readLength 35

第二行指定输入和输出文件(夹)。 第三行是一些必需参数:

  • 这里我们的数据是 paired-end, 所以选择 -t paired

  • 根据第一步,我们指定 --readLength 35

3c) output

输出文件位于 output/ 中。

最重要的是以下两类文件:

  • AS_Event.MATS.JC.txt: evaluates splicing with only reads that span splicing junctions

其中,AS_Event 包含以下几种:

  1. A5SS: alternative 5' splice site

  2. A3SS: alternative 3' splice site

  3. SE: skipped exon

  4. MXE: mutually exclusive exons

  5. RI: retained intron

For example, A5SS.MATS.JC.txt includes alternative 5' splice site (A5SS) using only reads that span splicing junctions:

ID    GeneID    geneSymbol    chr    strand    longExonStart_0base    longExonEnd    shortES    shortEE    flankingES    flankingEE    ID    IJC_SAMPLE_1    SJC_SAMPLE_1    IJC_SAMPLE_2    SJC_SAMPLE_2    IncFormLen    SkipFormLen    PValue    FDR    IncLevel1    IncLevel2    IncLevelDifference
2    "ENSMUSG00000004221"    "Ikbkg"    chrX    +    74427761    74427902    74427761    74427874    74432804    74433006    2    1    0    0    17    61    34    9.36092704494e-05    0.000468046352247    1.0    0.0    1.0
20    "ENSMUSG00000025332"    "Kdm5c"    chrX    +    152271859    152271938    152271859    152271929    152272074    152272274    20    0    2    4    9    42    34    0.00591200967211    0.00985334945352    0.0    0.265    -0.265
63    "ENSMUSG00000031167"    "Rbm3"    chrX    -    8143246    8143332    8143250    8143332    8142367    8142955    63    26    140    51    145    37    34    0.0299275020479    0.0374093775599    0.146    0.244    -0.098
84    "ENSMUSG00000037369"    "Kdm6a"    chrX    +    18277375    18277563    18277375    18277546    18278625    18279936    84    4    5    0    5    50    34    0.00235425083108    0.0058856270777    0.352    0.0    0.352
124    "ENSMUSG00000025283"    "Sat1"    chrX    -    155215119    155215684    155215600    155215684    155214032    155214134    124    2    23    4    32    68    34    0.549749061659    0.549749061659    0.042    0.059    -0.017

其中最重要的列意义如下:

IncFormLen

length of inclusion form, used for normalization

SkipFormLen

length of skipping form, used for normalization

P-Value

Significance of splicing difference between two sample groups. (Only available if statistical model is on)

FDR

False Discovery Rate calculated from p-value. (Only available if statistical model is on)

IncLevel1

inclusion level for SAMPLE_1 replicates (comma separated) calculated from normalized counts

IncLevel2

inclusion level for SAMPLE_2 replicates (comma separated) calculated from normalized counts

IncLevelDifference

average(IncLevel1) - average(IncLevel2)

4) Tips/Utilities

4a) 准备bam文件

可变剪接分析需要用到的是普通的RNA-seq数据,所以用我们前面介绍的STAR和hisat2都是可以的,mapping的参数通常也不需要特殊的设置。我们这里提供了一个hisat2的例子。

hisat 下载后解压到当前目录下,另外两个软件在 Docker 中已经装好

(2) 基因组和基因组注释

注意gtf文件的坐标应当和fasta文件是对应的

(3) 下载原始数据

(4) now your working directory looks like this

.
├── chromFa.tar.gz
├── hisat2-2.1.0
├── Mus_musculus.GRCm38.93.gtf.gz
├── SRR065544_1.fastq.gz
├── SRR065544_2.fastq.gz
├── SRR065545_1.fastq.gz
├── SRR065545_2.fastq.gz

(5) make hisat index

# extract X chromosome sequence
tar -xz -f chromFa.tar.gz chrX.fa
mv chrX.fa Mus_musculus_chrX.fa

# use only X chromosome
zcat Mus_musculus.GRCm38.93.gtf.gz | grep -P '(#!)|(X\t)' > Mus_musculus_chrX.gtf

# make hisat index
hisat2-2.1.0/hisat2_extract_splice_sites.py Mus_musculus_chrX.gtf > Mus_musculus_chrX.ss
hisat2-2.1.0/hisat2_extract_exons.py        Mus_musculus_chrX.gtf > Mus_musculus_chrX.exon

mkdir hisat2_indexes
hisat2-2.1.0/hisat2-build -p 4 \
    --ss Mus_musculus_chrX.ss --exon Mus_musculus_chrX.exon \
    Mus_musculus_chrX.fa hisat2_indexes/Mus_musculus_chrX

(6) mapping

# mapping
hisat2-2.1.0/hisat2 -p 4 --dta \
    -S SRR065544_chrX.sam -x hisat2_indexes/Mus_musculus_chrX \
    -1 SRR065544_1.fastq.gz -2 SRR065544_2.fastq.gz
hisat2-2.1.0/hisat2 -p 4 --dta \
    -S SRR065545_chrX.sam -x hisat2_indexes/Mus_musculus_chrX \
    -1 SRR065545_1.fastq.gz -2 SRR065545_2.fastq.gz

# covert to .bam
samtools sort -@ 4 -o SRR065544_chrX_raw.bam SRR065544_chrX.sam
samtools sort -@ 4 -o SRR065545_chrX_raw.bam SRR065545_chrX.sam

# filter only mapped reads
bamtools index -in SRR065544_chrX_raw.bam
bamtools index -in SRR065545_chrX_raw.bam

bamtools filter -isMapped true -in SRR065544_chrX_raw.bam \
    -out SRR065544_chrX.bam
bamtools filter -isMapped true -in SRR065545_chrX_raw.bam \
    -out SRR065545_chrX.bam

5) Homework

6) References

目前已有大量的工具可以用于可变剪接的分析。读者可参考以下文献,探索其他的工具:

install software (already available in Docker,docker images的下载链接如所示)

Get data (already available in Docker),我们使用 中的两个样本:

: C2C12 with control shRNA vector

: C2C12 with shRNA against CUGBP1

如果希望从头做起,读者也可以点击上述相应链接下载原始的 FASTQ 文件; Mus musculus 的基因组注释文件可以从Ensembl下载: 。

Install software:

Download data: ;

详细说明请参见

AS_Event.MATS.JCEC.txt: evaluates splicing with reads that span splicing junctions and reads on target (striped regions on )

(1) install , bamtools, samtools

为了鉴定 CUGBP1 对 mRNA isoform 的调控,科学家在 C2C12 小鼠成肌细胞(myoblast)中分别表达空载体(SRR065546)和含有干扰 CUGBP1 的 shRNA 的载体(SRR065547)。请同学们至中Files needed by this Tutorial中的清华云Bioinformatics Tutorial / Files路径下的相应文件夹中下载 .bam 输入文件(只含有 map 到 X 染色体的 reads),探索在 X 染色体上存在 differential alternative splicing 的基因。(需要上交代码和输出结果中所有以 .MATS.JCEC.txt 结尾的文件)

阅读文献(""),简要解释rMATS是如何对PSI(percentage spliced in)的组间差异进行统计检验的(只需解释Unpaired Replicates的情形即可)。

rMATS is introduced in "" in PNAS

rMATS
PRJNA130865
SRR065545
SRR065544
GRCm38/mm10
rMATS
link
http://rnaseq-mats.sourceforge.net/rmats4.0.2/user_guide.htm#output
MATS home page figure
hisat
chromFa.tar.gz
Musmusculus.GRCm38.93.gtf.gz
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR065/SRR065544/SRR065544\_1.fastq.gz
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR065/SRR065544/SRR065544\_2.fastq.gz
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR065/SRR065545/SRR065545\_1.fastq.gz
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR065/SRR065545/SRR065545\_2.fastq.gz
rMATS: Robust and Flexible Detection of Differential Alternative Splicing from Replicate RNA-Seq Data
rMATS: Robust and Flexible Detection of Differential Alternative Splicing from Replicate RNA-Seq Data
A survey of software for genome-wide discovery of differential splicing in RNA-Seq data
A survey of computational methods in transcriptome-wide alternative splicing analysis
rMATS
MISO
附表
该链接