3.3.GSEA

1) Introduction to GSEA

我们前面介绍的GO和KEGG是收集了大量基因集注释的两个数据库，而这里介绍的GSEA (Gene Set Enrichment Analysis) 通常指的是一种统计方法。
GSEA最早在2005年的PNAS文章Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles中提出，此后在生物信息分析中一直被广泛使用，至今已有三万次以上的引用。

GSEA的基本步骤是：

首先，将所有（或者一批）基因按照上图A中Phenotype A/B两种条件下的表达差异的多少（例如Fold Change）排序；
同时，研究者已经有了一个基因集，如上图B中的Gene set S，S通常来自功能注释或其他的实验结果。
然后，检验Gene set S是否富集在这个排序表的顶端或者底端，计算一个Enrichment Score (ES)。
通常还会利用1000个或更多的shuffle的Gene Sets做为背景，计算很多背景的ES来估计一个FDR值。

1a) Graphic Version of GSEA

进入 GSEA官方下载页面（需要登录方可进入，只需输入邮箱即可），可在安装列表中查找电脑对应版本安装图形化界面的版本。

1b) R Version of GSEA - fGSEA

安装fgsea包

# if you install in windows, make sure Rtools 3.5 is installed
library(devtools)
install_github("ctlab/fgsea")

注意：每次运行时注意要加载安装好的fgsea以及依赖的R包

library(data.table)
library(fgsea)
library(ggplot2)

2) Files Needed

我们只需要2个输入文件，所用到的文件可以直接从该链接中Files needed by this Tutorial中的清华云的Bioinformatics Tutorial / Files路径下的相应文件夹中下载响应文件。

基因排序(Rank list)：如Introduction图中，把所有（或者一批）基因按照Phenotype A/B两种条件下的表达差异的多少（如Fold Change）排个序。
基因集(Gene set): 如Introduction图中预设好的一个Gene Set S，通常来自功能注释或其他的实验结果。

Format	Description	Notes
test.rnk	The RNK file contains a single, rank-ordered gene list (not gene set) in a simple newline-delimited text format.	Rank list
test.gmx/gmt	The GMX/GMT file format is a tab-delimited file format that describes gene sets. This file can contain multiple gene sets: In the GMX format, each column represents a gene set. In the GMT format, each line represents a gene set.	Gene Sets

Format

Description

Notes

test.rnk

The RNK file contains a single, rank-ordered gene list (not gene set) in a simple newline-delimited text format.

Rank list

test.gmx/gmt

The GMX/GMT file format is a tab-delimited file format that describes gene sets. This file can contain multiple gene sets:

In the GMX format, each column represents a gene set.

In the GMT format, each line represents a gene set.

Gene Sets

Example of test.rnk：

GeneID    Rank
ENSG00000177106.14    1062.683366
ENSG00000005206.16    469.7135917
ENSG00000169994.18    390.5745216
ENSG00000185189.17    368.5153284
ENSG00000167702.11    359.6025582
ENSG00000203697.11    295.0883717
...
...

example of test.gmx (推荐桌面版GSEA使用，用户还可以使用GSEA软件自带的gene set，例如可以在软件参数Gene sets database中选择相应的GMX文件）:

Gene_set_1    Gene_set_2
Normal    Tumor
ENSG00000278505.4    G032605
ENSG00000169994.18    ENSG00000088256.8
ENSG00000148702.14    ENSG00000277739.1
ENSG00000134827.7    G061517
ENSG00000148600.14    ENSG00000235333.3
...
...

example of test.gmt (推荐fGSEA使用，GMT文件为GMX文件的转置）:

Gene_set_1    Normal    ENSG00000278505.4    ENSG00000169994.18    ENSG00000148702.14    ENSG00000134827.7    ENSG00000148600.14    ENSG00000083782.7    ENSG00000089356.18    G066936    ENSG00000251026.1    ENSG00000130827.6    G028862    ENSG00000185479.5    ENSG00000137745.11    ENSG00000266714.7    ENSG00000123500.9    ENSG00000166165.12    ENSG00000143850.13    ENSG00000136059.14    ENSG00000119514.6    ENSG00000257046.5
Gene_set_2    Tumor    G032605    ENSG00000088256.8    ENSG00000277739.1    G061517    ENSG00000235333.3    ENSG00000214021.15    G002038    G040639    ENSG00000125851.9    ENSG00000175785.12    ENSG00000123560.13    ENSG00000142959.4    ENSG00000183034.12    ENSG00000108231.12    ENSG00000054793.13    ENSG00000111404.6    ENSG00000180900.18    ENSG00000136546.13    ENSG00000107331.16    ENSG00000215908.10

注意：
GMX文件开头两行格式固定，第一行为Gene sets的名字，第二行为Gene sets的label，前两行不可省略。
fGSEA现在只兼容GMT文件，第一列为Gene Sets的名字，第二列为Gene sets的label，前两列不可省略。
基因集文件中包含了两个Gene Sets，GSEA会分别对这两个Sets做两个GSEA的分析。

3) GSEA Basic

本节我们讲介绍如何利用一个自定义的基因排序(Rank list)和自定义的基因集(Gene set)进行富集分析，主要使用GseaPreranked模块。我们的示例文件中包含了两个Gene Sets，GSEA会分别对这两个Sets做两个GSEA的分析。

3a) Load Data

打开 GSEA，点击Load data，选择文件夹下相应文件test.rnk和test.gmx。

3b) Set Pamameters for GseaPreranked

点击Run GseaPreranked, 选择合适的参数，如下图所示:

Gene set database选择上传的GMX文件，test.gmt。
Number of permutations选择1000。置换检验的次数，数字越大结果越准确，但是太大会占用太多内存，软件默认检验1000次。软件分析时会得到一个基因富集的评分（ES），但是富集评分是否具有统计学意义，软件就会采用随机模拟的方法，根据指定参数随机打乱1000次，得到1000个富集评分，然后判断得到的ES是否在这1000个随机产生的得分中有统计学意义。测试使用时建议填一个很小的数如10，先让程序跑通。真正分析时再换为1000。
Ranked List选择上传的RNK文件，test.rnk。
Collapse/Remap to gene symbols选择No_Collapse。因为这里不需要进行转换，如果GMX文件和RNK文件基因格式不同，咋需要上传.chip进行转换，如上述教程。
Basic fields中Max size需大于GMX文件中gene set大小，Min size需小于GMX文件gene set大小。这里Min size选1。
其他为默认参数。

3c) Check output

运行成功后，Status 会显示为 Success，点击其即可查看输出。网页格式的输出信息，使用说明见 http://software.broadinstitute.org/gsea/doc/GSEAUserGuideTEXT.htm#_Interpreting_GSEA_Results。

4) fGSEA in R

对R语言比较熟悉的同学可以利用叫做fgsea的R包进行批量的分析和绘图。

library(data.table)
library(fgsea)
library(ggplot2)

4b) Test software

加载示例数据

data(examplePathways)
data(exampleRanks)

运行fgsea进行自定义排序的分析

fgseaRes <- fgsea(pathways = examplePathways, 
                    stats    = exampleRanks,
                    eps      = 0.0,
                    minSize  = 15,
                    maxSize  =500)

检查结果（按照p-value排序）

head(fgseaRes[order(pval), ])

可视化基因set的富集结果

plotEnrichment(examplePathways[["5991130_Programmed_Cell_Death"]],
                exampleRanks) + labs(title="Programmed Cell Death")

可视化多个基因set的富集结果

topPathwaysUp <- fgseaRes[ES > 0][head(order(pval), n=10), pathway]
topPathwaysDown <- fgseaRes[ES < 0][head(order(pval), n=10), pathway]
topPathways <- c(topPathwaysUp, rev(topPathwaysDown))
plotGseaTable(examplePathways[topPathways], exampleRanks, fgseaRes, gseaParam=0.5)

4c) Run fGSEA

我们可以利用如下脚本重复之前在图形界面完成的GSEA的分析（我们的示例文件中包含了两个Gene Sets，GSEA会分别对这两个Sets进行分析）:

# first input .rnk and .gmt 
rnk.file <- "test.rnk"
gmt.file <- "test.gmt"

# extract ranklist
ranks <- read.table(rnk.file, header=T, colClasses = c("character", "numeric"))
ranks <- setNames(ranks$Rank, ranks$GeneID)

# extract gene sets
pathways <- gmtPathways(gmt.file)

# check your ranklist and geneset
str(ranks)
str(head(pathways))

# run user-defined GSEA. 
# Question about warning message, refer to https://www.gitmemory.com/issue/ctlab/fgsea/79/703686364
fgseaRes <- fgsea(pathways, ranks, nperm = 10, minSize=0, maxSize=500)
head(fgseaRes[order(pval), ])

5) Tips

5a) More input files

GSEA还可以有更多的分析，接受更多的输入文件格式，例如：

Format	Description	Notes
txt, res, gct, pcl	A dataset containing expression value for each feature in each sample.	一批基因的表达值
cls	Associates each sample with a phenotype label.	基因在不同样品下的表达值
Chip	Lists each probe and its matching HUGO gene symbol.	基因 ID 和基因名字对应表

Format

Description

Notes

txt, res, gct, pcl

A dataset containing expression value for each feature in each sample.

一批基因的表达值

cls

Associates each sample with a phenotype label.

基因在不同样品下的表达值

Chip

Lists each probe and its matching HUGO gene symbol.

基因 ID 和基因名字对应表

具体参见 http://software.broadinstitute.org/gsea/doc/GSEAUserGuideTEXT.htm#_Preparing_Data_Files

5b) Download data with qGSEA

可以通过qGSEA(一个R package)远程从在 GEO 中下载感兴趣的数据集（本例中我们使用 GSE19161)：

R
# 进入 R 操作界面
> library(qGSEA)
> dir.create('raw/')
> download.file(rGEO::gse_soft_ftp('GSE19161'), 'raw/GSE19161_family.soft.gz')
> download.file(rGEO::gse_matrix_ftp('GSE19161'), 'raw/GSE19161_series_matrix.txt.gz')

> dir.create('input/')
> qGSEA::make_gsea_input(
matrix_file = 'raw/GSE19161_series_matrix.txt.gz',
soft_file = 'raw/GSE19161_family.soft.gz',
output_dir = 'input/',
gene = 'EIF4G2'
)
> q()
Save workspace image? [y/n/c]: n # 按 n 再按 Enter

可得到文件目录结构如下：

gsea
├── input
│   ├── GSE19161.chip
│   ├── GSE19161.cls
│   └── GSE19161.txt
└── raw
    ├── GSE19161_family.soft.gz
    └── GSE19161_series_matrix.txt.gz

5c) Binary phenotype in GSEA

GSEA也可以分析与离散型表型显著相关的 gene set （与差异表达相似），不同的地方在于 .cls 文件。以 GSE2251 例，打开链接后，可以在 Samples 一栏找到样品信息（样本数过多时，网页会显示不全，这时可以解压并打开 _series_matrix.txt.gz 查看样本信息。）。本例共有12个样本，其中第1, 2, 3, 7, 8, 9 号为"E2-8"，第4, 5, 6, 10, 11, 12 号为"vehicle"。针对这一表型的.cls 文件如下所示**：**

12 2 1
# E2-8 vehicle
0 0 0 1 1 1 0 0 0 1 1 1

12 是样本总数，2 是表型类型数目，0 代表第一种表型，1 代表第二种表型。

注意：
由于表型文件(cls文件)为离散型，选择参数时Basci fields中Metric for ranking genes应选择signal2noise、t-Test、ratio_of_class、 diff_of_class、log2_ratio_of_class之一。

5d) Common error messages

由于基因芯片仅分析一部分基因，有些 gene set 会因为包含的基因过少而被剔除。在极端情况下，可能所有的 gene set 都会被剔除，这时就会发生如下错误：

6) Homework

对一套Gene Set (KEGG_CELL_CYCLE.v2022.1.Hs.ensg.gmt) 在我们的Ranked Genes (COAD.tumor.vs.normal.rnk) 上做GSEA分析，数据可从https://cloud.tsinghua.edu.cn/d/ad22768345664924b202/?p=%2FFiles%2FPART_II%2F3.2.gsea下载。